Clean Opera Task
The CleanOperaTask transforms Opera raw data into the common cleaned schema used across the ETL pipeline. It is implemented as a BaseCleanerTask to reuse shared cleaning utilities.
What this task does
- Generates stable UUIDs for reservation and guest records
- Removes duplicate reservations using
lastmodifydatetimewith status ties - Parses booking and room stay dates
- Matches daily rates to room stay rows
- Extracts and normalizes guest details and profiles
- Converts monetary columns to a standard currency (via
convert_currency), and calculates per-stay revenue fields
Data Transformations
Guest Data
- Extracts names, emails, phone numbers, and nationalities
- Generates a stable
guest_idUUID - Normalizes country and email formats
Reservation Data
- Standardizes status codes (e.g.,
CheckedOut→checked_out) - Extracts
res_id_ogand computes stableres_idUUIDs - Maps booking channel and source information
Room Data
- Explodes daily stay dates (
room_stay_date) usingcheck_in/check_out - Joins with rates to compute daily room revenue
- Sets
room_pm, room type and categorical fields
Models Required
RawReservationModelRawRateModelRawProfileModel
Models Provided
CleanGuestModelCleanReservationModelCleanRoomModel
Implementation Notes
- Date/timestamp parsing uses
to_dateandto_timestampwith stable UDF-based tie-breaks. - Duplicate rows are removed using partition/window functions.
- Rates are matched to rooms by
(res_id_og, room_stay_date).
Related Documentation
Last updated on