Opera Cleaning

The OperaCleaner transforms raw Oracle Opera PMS data into our standardized format. Opera is used by the Cheval Collection properties and is one of the most widely used property management systems in the hospitality industry.

Overview

The Opera cleaning process has several unique characteristics:

Nested JSON structure with complex field paths (e.g., roomstay.arrivaldate)
Multiple ID lists requiring custom extraction functions
Attached profiles for companies and travel agents
Separate daily rates table with detailed revenue breakdowns
Pseudo rooms (PM rooms) for forecasting and overbooking management

Data Sources

The cleaner reads from three tables in the etl_gp_raw_chevalcollection database:

reservations - Main reservation data with nested structures
guest_profiles - Additional guest information (nationality)
daily_rates - Daily revenue breakdown by date

Cleaning Workflow

Helper Functions

Opera uses custom utility functions to extract data from complex structures:

Extract Reservation ID


def get_res_id(col_array, type_substring):
    """Finds the reservation id from list"""
    return F.when(
        F.size(
            F.expr(f"filter({col_array}, x -> lower(x.type) like '%{type_substring.lower()}%')")
        ) > 0,
        F.expr(f"filter({col_array}, x -> lower(x.type) like '%{type_substring.lower()}%')")[0]["id"],
    ).otherwise(None)

Usage: Extracts the reservation ID from reservationidlist array where type contains “reservation”:


df = df.withColumn("reservation_id_og", get_res_id("reservationidlist", "reservation"))

Extract Profile Information


def get_profile(col_array, type_substring: str):
    """Finds the Travel Agent and Company"""
    filter_expr = f"""
        filter({col_array}, x -> lower(x.reservationProfileType) like '%{type_substring}%')
    """
    return F.when(
        F.size(F.expr(filter_expr)) > 0,
        F.expr(f"element_at({filter_expr}, 1).name")
    ).otherwise(None)

Usage: Extracts company and travel agent names from attachedprofiles array:


df = df.withColumn("company", get_profile("attachedprofiles", "Company"))
df = df.withColumn("travel_agent", get_profile("attachedprofiles", "TravelAgent"))

Deduplication Strategy

Opera uses a priority-based deduplication similar to Athenaeum:


status_priority = (
    F.when(F.col("reservationstatus") == "CheckedOut", 3)
    .when(F.col("reservationstatus") == "CheckedIn", 2)
    .when(F.col("reservationstatus") == "Reserved", 1)
    .otherwise(0)
)
 
window = Window.partitionBy("reservation_id_og").orderBy(
    F.to_date(F.col("lastmodifydatetime")).desc(),
    status_priority.desc(),
)

Key: For duplicate reservation IDs, the most recent update with the highest status priority wins.

Field Mapping

Reservation Fields

Opera Field	Standard Schema Field	Notes
`reservationidlist` → extract by type	`reservation_id_og`	Extracted from array using `get_res_id()`
`reservationguest.id`	`guest_id_og`	Primary guest ID
`createdatetime`	`booking_date`, `created_timestamp`	When reservation was created
`lastmodifydatetime`	`last_modified_timestamp`	Last update time
`roomstay.arrivaldate`	`check_in_date`	Check-in date
`roomstay.departuredate`	`check_out_date`	Check-out date
`reservationstatus`	`status`	Reservation status
`roomstay.bookingchannelcode`	`channel`	Booking channel
`roomstay.sourcecode`	`source`	Source code
`roomstay.marketcode`	`market_segment`	Market segment
`roomstay.adultcount`	`guests_adults`	Number of adults
`roomstay.childcount`	`guests_children`	Number of children
`roomstay.rateamount.currencycode`	`currency`	Currency code
`attachedprofiles` → extract by type	`company`, `travel_agent`	Parsed using `get_profile()`

Guest Fields (from nested `reservationguest`)

Opera Field	Standard Schema Field
`reservationguest.givenname`	`first_name`
`reservationguest.surname`	`last_name`
`reservationguest.email`	`email`
`reservationguest.birthdate`	`birth_date`
`reservationguest.phonenumber`	`phone_number`
`reservationguest.address.streetaddress`	`address_street`
`reservationguest.address.cityname`	`address_city`
`reservationguest.address.postalcode`	`address_postcode`
`reservationguest.address.state`	`address_state`
`reservationguest.address.country.code`	`address_country`
`reservationguest.vip.vipCode`	`membership_type`

Room Fields (from `roomstay`)

Opera Field	Standard Schema Field
`roomstay.roomid`	`room_id`
`roomstay.roomtype`	`room_type`
`roomstay.rateplancode`	`room_rate_code`
`roomstay.roomclass`	`room_type_category`
`roomstay.roomtypecharged`	`room_type_charged`
`roomstay.pseudoRoom`	`room_pm` (pseudo/PM room flag)

Status Mapping

Opera Status	Standard Status
`Cancelled`	`cancelled`
`CheckedOut`	`checked_out`
`NoShow`	`no_show`
`CheckedIn`	`checked_in`
`Reserved`	`confirmed`
`DueIn`	`confirmed`
`DueOut`	`confirmed`
`InHouse`	`in_house`

Daily Rates Processing

Opera provides detailed daily revenue in the daily_rates table:

Source Structure


{
  "reservationid": "12345",
  "dailyrates": {
    "details": [
      {"summarydate": "2025-01-15", "revenue": 250.00},
      {"summarydate": "2025-01-16", "revenue": 250.00}
    ]
  }
}

Processing Logic


# Explode nested daily rates
df = df.withColumn("detail", F.explode("dailyrates.details"))
 
df = df.select(
    F.col("reservationid").alias("reservation_id_og"),
    F.to_date("detail.summarydate").alias("room_stay_date"),
    F.col("detail.revenue").cast("double").alias("room_stay_date_rate_net"),
    F.col("synced_date"),
)

Deduplication

For each reservation and date, keep only the most recent synced data:


window = Window.partitionBy("reservation_id_og", "room_stay_date").orderBy(
    F.col("synced_date").desc()
)
 
df = (
    df.withColumn("rn", F.row_number().over(window))
      .filter("rn = 1")
      .drop("rn", "synced_date")
)

Guest Profile Enrichment

Additional guest information (nationality) is loaded from guest_profiles:


df = self._job_context.catalog.read_table(
    database="etl_gp_raw_chevalcollection",
    table="guest_profiles",
)
 
df = (
    df.withColumn("guest_id_og", get_res_id("guestIdList", "Profile"))
      .withColumn("nationality", F.col("guestDetails.customer.nationality"))
)
 
# Join with reservation guests
guest_df = guest_df.join(profiles_df, on="guest_id_og", how="left")

Rooms Processing

Stay Date Expansion

Daily room records are generated by expanding the stay into individual dates:


df = df.withColumn(
    "room_stay_date",
    F.explode(
        F.when(
            # Same-day checkout
            F.to_date("check_in_date") == F.col("check_out_date"),
            F.array(F.to_date("check_in_date"))
        ).otherwise(
            # Multi-night stay
            F.sequence(
                F.to_date("check_in_date"),
                F.date_sub(F.to_date("check_out_date"), 1),
            )
        )
    ),
)

Revenue Join

Daily rates are joined by reservation ID and stay date:


df = df.join(
    rates,
    on=["reservation_id_og", "room_stay_date"],
    how="left",
)

F&B and Other Revenue

Opera provides aggregate F&B and other revenue at the reservation level:


df = df.withColumn(
    "room_stay_date_fnb_net",
    F.coalesce(
        F.col("revenuesandbalances.foodAndBevRevenue.amount").cast("double"),
        F.lit(0.0),
    ),
).withColumn(
    "room_stay_date_other_net",
    F.coalesce(
        F.col("revenuesandbalances.otherRevenue.amount").cast("double"),
        F.lit(0.0),
    ),
)

Important: These totals are only assigned to the check-in date to avoid double-counting:


df = df.withColumn(
    "room_stay_date_fnb_net",
    F.when(
        F.col("room_stay_date") == F.to_date("room_check_in_date"),
        F.col("room_stay_date_fnb_net")
    ).otherwise(F.lit(0.0))
)

Pseudo Rooms (PM Rooms)

Opera supports pseudo rooms (PM rooms) for inventory management:


df = df.withColumn("room_pm", F.col("roomstay.pseudoRoom"))

Purpose: PM rooms are used for:

Forecasting and planning
Overbooking management
Group blocks without specific room assignments

Records with room_pm = true don’t have actual room assignments yet.

Data Validation

Standard cleaning utilities are applied:


df = clean_country(df, "address_country")
df = clean_country(df, "nationality")
df = clean_email(df, "email")
df = clean_phone_number(df, "phone_number")

Output Tables

Clean Reservations

Database: etl_gp_clean_chevalcollection
Table: reservations
Key: reservation_id_og (extracted from Opera ID list)
Row Count: One per reservation

Clean Guests

Database: etl_gp_clean_chevalcollection
Table: guests
Key: guest_id (stable UUID)
Source: Combined from reservation guest data + guest profiles
Enrichment: Nationality from guest_profiles table

Clean Rooms

Database: etl_gp_clean_chevalcollection
Table: rooms
Key: Composite (res_id, room_stay_date)
Granularity: One row per reservation per day
Revenue: Room rates from daily_rates, F&B/Other from reservation totals

Configuration

Property mapping is applied via configuration:


df = self._map_property(df, "property_id")

Challenges & Solutions

Challenge	Solution
Complex nested JSON structures	Use dot notation to access nested fields
Multiple ID lists	Custom `get_res_id()` function to extract by type
Attached profiles array	Custom `get_profile()` function to parse by profile type
Daily rates in separate table	Join by reservation ID and stay date
F&B/Other totals not daily	Assign entire total to check-in date only
Pseudo rooms	Preserve `room_pm` flag for identification

Opera-Specific Features

ID Lists

Opera stores multiple IDs (confirmation numbers, external IDs, etc.) in arrays. The cleaner extracts the primary reservation ID using type filters.

Attached Profiles

Companies and travel agents are stored in an attachedprofiles array rather than dedicated fields. The cleaner parses these using profile type filters.

Revenue Structure

Opera provides both:

Daily room rates (in separate table) - Detailed daily breakdown
Aggregate F&B/Other (in reservation) - Totals only

Room Type vs Charged Type

Opera tracks both the original room_type and the room_type_charged when a guest is assigned a different room category.

VIP Status

Opera’s VIP codes are preserved in the membership_type field.