Opera Cleaning
The OperaCleaner transforms raw Oracle Opera PMS data into our standardized format. Opera is used by the Cheval Collection properties and is one of the most widely used property management systems in the hospitality industry.
Overview
The Opera cleaning process has several unique characteristics:
- Nested JSON structure with complex field paths (e.g.,
roomstay.arrivaldate) - Multiple ID lists requiring custom extraction functions
- Attached profiles for companies and travel agents
- Separate daily rates table with detailed revenue breakdowns
- Pseudo rooms (PM rooms) for forecasting and overbooking management
Data Sources
The cleaner reads from three tables in the etl_gp_raw_chevalcollection database:
reservations- Main reservation data with nested structuresguest_profiles- Additional guest information (nationality)daily_rates- Daily revenue breakdown by date
Cleaning Workflow
Helper Functions
Opera uses custom utility functions to extract data from complex structures:
Extract Reservation ID
def get_res_id(col_array, type_substring):
"""Finds the reservation id from list"""
return F.when(
F.size(
F.expr(f"filter({col_array}, x -> lower(x.type) like '%{type_substring.lower()}%')")
) > 0,
F.expr(f"filter({col_array}, x -> lower(x.type) like '%{type_substring.lower()}%')")[0]["id"],
).otherwise(None)Usage: Extracts the reservation ID from reservationidlist array where type contains “reservation”:
df = df.withColumn("reservation_id_og", get_res_id("reservationidlist", "reservation"))Extract Profile Information
def get_profile(col_array, type_substring: str):
"""Finds the Travel Agent and Company"""
filter_expr = f"""
filter({col_array}, x -> lower(x.reservationProfileType) like '%{type_substring}%')
"""
return F.when(
F.size(F.expr(filter_expr)) > 0,
F.expr(f"element_at({filter_expr}, 1).name")
).otherwise(None)Usage: Extracts company and travel agent names from attachedprofiles array:
df = df.withColumn("company", get_profile("attachedprofiles", "Company"))
df = df.withColumn("travel_agent", get_profile("attachedprofiles", "TravelAgent"))Deduplication Strategy
Opera uses a priority-based deduplication similar to Athenaeum:
status_priority = (
F.when(F.col("reservationstatus") == "CheckedOut", 3)
.when(F.col("reservationstatus") == "CheckedIn", 2)
.when(F.col("reservationstatus") == "Reserved", 1)
.otherwise(0)
)
window = Window.partitionBy("reservation_id_og").orderBy(
F.to_date(F.col("lastmodifydatetime")).desc(),
status_priority.desc(),
)Key: For duplicate reservation IDs, the most recent update with the highest status priority wins.
Field Mapping
Reservation Fields
| Opera Field | Standard Schema Field | Notes |
|---|---|---|
reservationidlist → extract by type | reservation_id_og | Extracted from array using get_res_id() |
reservationguest.id | guest_id_og | Primary guest ID |
createdatetime | booking_date, created_timestamp | When reservation was created |
lastmodifydatetime | last_modified_timestamp | Last update time |
roomstay.arrivaldate | check_in_date | Check-in date |
roomstay.departuredate | check_out_date | Check-out date |
reservationstatus | status | Reservation status |
roomstay.bookingchannelcode | channel | Booking channel |
roomstay.sourcecode | source | Source code |
roomstay.marketcode | market_segment | Market segment |
roomstay.adultcount | guests_adults | Number of adults |
roomstay.childcount | guests_children | Number of children |
roomstay.rateamount.currencycode | currency | Currency code |
attachedprofiles → extract by type | company, travel_agent | Parsed using get_profile() |
Guest Fields (from nested reservationguest)
| Opera Field | Standard Schema Field |
|---|---|
reservationguest.givenname | first_name |
reservationguest.surname | last_name |
reservationguest.email | email |
reservationguest.birthdate | birth_date |
reservationguest.phonenumber | phone_number |
reservationguest.address.streetaddress | address_street |
reservationguest.address.cityname | address_city |
reservationguest.address.postalcode | address_postcode |
reservationguest.address.state | address_state |
reservationguest.address.country.code | address_country |
reservationguest.vip.vipCode | membership_type |
Room Fields (from roomstay)
| Opera Field | Standard Schema Field |
|---|---|
roomstay.roomid | room_id |
roomstay.roomtype | room_type |
roomstay.rateplancode | room_rate_code |
roomstay.roomclass | room_type_category |
roomstay.roomtypecharged | room_type_charged |
roomstay.pseudoRoom | room_pm (pseudo/PM room flag) |
Status Mapping
| Opera Status | Standard Status |
|---|---|
Cancelled | cancelled |
CheckedOut | checked_out |
NoShow | no_show |
CheckedIn | checked_in |
Reserved | confirmed |
DueIn | confirmed |
DueOut | confirmed |
InHouse | in_house |
Daily Rates Processing
Opera provides detailed daily revenue in the daily_rates table:
Source Structure
{
"reservationid": "12345",
"dailyrates": {
"details": [
{"summarydate": "2025-01-15", "revenue": 250.00},
{"summarydate": "2025-01-16", "revenue": 250.00}
]
}
}Processing Logic
# Explode nested daily rates
df = df.withColumn("detail", F.explode("dailyrates.details"))
df = df.select(
F.col("reservationid").alias("reservation_id_og"),
F.to_date("detail.summarydate").alias("room_stay_date"),
F.col("detail.revenue").cast("double").alias("room_stay_date_rate_net"),
F.col("synced_date"),
)Deduplication
For each reservation and date, keep only the most recent synced data:
window = Window.partitionBy("reservation_id_og", "room_stay_date").orderBy(
F.col("synced_date").desc()
)
df = (
df.withColumn("rn", F.row_number().over(window))
.filter("rn = 1")
.drop("rn", "synced_date")
)Guest Profile Enrichment
Additional guest information (nationality) is loaded from guest_profiles:
df = self._job_context.catalog.read_table(
database="etl_gp_raw_chevalcollection",
table="guest_profiles",
)
df = (
df.withColumn("guest_id_og", get_res_id("guestIdList", "Profile"))
.withColumn("nationality", F.col("guestDetails.customer.nationality"))
)
# Join with reservation guests
guest_df = guest_df.join(profiles_df, on="guest_id_og", how="left")Rooms Processing
Stay Date Expansion
Daily room records are generated by expanding the stay into individual dates:
df = df.withColumn(
"room_stay_date",
F.explode(
F.when(
# Same-day checkout
F.to_date("check_in_date") == F.col("check_out_date"),
F.array(F.to_date("check_in_date"))
).otherwise(
# Multi-night stay
F.sequence(
F.to_date("check_in_date"),
F.date_sub(F.to_date("check_out_date"), 1),
)
)
),
)Revenue Join
Daily rates are joined by reservation ID and stay date:
df = df.join(
rates,
on=["reservation_id_og", "room_stay_date"],
how="left",
)F&B and Other Revenue
Opera provides aggregate F&B and other revenue at the reservation level:
df = df.withColumn(
"room_stay_date_fnb_net",
F.coalesce(
F.col("revenuesandbalances.foodAndBevRevenue.amount").cast("double"),
F.lit(0.0),
),
).withColumn(
"room_stay_date_other_net",
F.coalesce(
F.col("revenuesandbalances.otherRevenue.amount").cast("double"),
F.lit(0.0),
),
)Important: These totals are only assigned to the check-in date to avoid double-counting:
df = df.withColumn(
"room_stay_date_fnb_net",
F.when(
F.col("room_stay_date") == F.to_date("room_check_in_date"),
F.col("room_stay_date_fnb_net")
).otherwise(F.lit(0.0))
)Pseudo Rooms (PM Rooms)
Opera supports pseudo rooms (PM rooms) for inventory management:
df = df.withColumn("room_pm", F.col("roomstay.pseudoRoom"))Purpose: PM rooms are used for:
- Forecasting and planning
- Overbooking management
- Group blocks without specific room assignments
Records with room_pm = true don’t have actual room assignments yet.
Data Validation
Standard cleaning utilities are applied:
df = clean_country(df, "address_country")
df = clean_country(df, "nationality")
df = clean_email(df, "email")
df = clean_phone_number(df, "phone_number")Output Tables
Clean Reservations
- Database:
etl_gp_clean_chevalcollection - Table:
reservations - Key:
reservation_id_og(extracted from Opera ID list) - Row Count: One per reservation
Clean Guests
- Database:
etl_gp_clean_chevalcollection - Table:
guests - Key:
guest_id(stable UUID) - Source: Combined from reservation guest data + guest profiles
- Enrichment: Nationality from guest_profiles table
Clean Rooms
- Database:
etl_gp_clean_chevalcollection - Table:
rooms - Key: Composite (
res_id,room_stay_date) - Granularity: One row per reservation per day
- Revenue: Room rates from daily_rates, F&B/Other from reservation totals
Configuration
Property mapping is applied via configuration:
df = self._map_property(df, "property_id")Challenges & Solutions
| Challenge | Solution |
|---|---|
| Complex nested JSON structures | Use dot notation to access nested fields |
| Multiple ID lists | Custom get_res_id() function to extract by type |
| Attached profiles array | Custom get_profile() function to parse by profile type |
| Daily rates in separate table | Join by reservation ID and stay date |
| F&B/Other totals not daily | Assign entire total to check-in date only |
| Pseudo rooms | Preserve room_pm flag for identification |
Opera-Specific Features
ID Lists
Opera stores multiple IDs (confirmation numbers, external IDs, etc.) in arrays. The cleaner extracts the primary reservation ID using type filters.
Attached Profiles
Companies and travel agents are stored in an attachedprofiles array rather than dedicated fields. The cleaner parses these using profile type filters.
Revenue Structure
Opera provides both:
- Daily room rates (in separate table) - Detailed daily breakdown
- Aggregate F&B/Other (in reservation) - Totals only
Room Type vs Charged Type
Opera tracks both the original room_type and the room_type_charged when a guest is assigned a different room category.
VIP Status
Opera’s VIP codes are preserved in the membership_type field.
See Also
- Cleaning Overview - General cleaning process
- Schema Documentation - Field definitions
- Processing Stage - Next stage after cleaning
- Oracle Hospitality Opera - Official Opera documentation