Skip to Content

Clean Opera Task

The CleanOperaTask transforms Opera raw data into the common cleaned schema used across the ETL pipeline. It is implemented as a BaseCleanerTask to reuse shared cleaning utilities.

What this task does

  • Generates stable UUIDs for reservation and guest records
  • Removes duplicate reservations using lastmodifydatetime with status ties
  • Parses booking and room stay dates
  • Matches daily rates to room stay rows
  • Extracts and normalizes guest details and profiles
  • Converts monetary columns to a standard currency (via convert_currency), and calculates per-stay revenue fields

Data Transformations

Guest Data

  • Extracts names, emails, phone numbers, and nationalities
  • Generates a stable guest_id UUID
  • Normalizes country and email formats

Reservation Data

  • Standardizes status codes (e.g., CheckedOutchecked_out)
  • Extracts res_id_og and computes stable res_id UUIDs
  • Maps booking channel and source information

Room Data

  • Explodes daily stay dates (room_stay_date) using check_in/check_out
  • Joins with rates to compute daily room revenue
  • Sets room_pm, room type and categorical fields

Models Required

  • RawReservationModel
  • RawRateModel
  • RawProfileModel

Models Provided

  • CleanGuestModel
  • CleanReservationModel
  • CleanRoomModel

Implementation Notes

  • Date/timestamp parsing uses to_date and to_timestamp with stable UDF-based tie-breaks.
  • Duplicate rows are removed using partition/window functions.
  • Rates are matched to rooms by (res_id_og, room_stay_date).
Last updated on