Athenaeum Cleaning

The AthenaeumCleaner transforms raw Athenaeum PMS data into our standardized format. Athenaeum is a property management system used by the Athenaeum Hotel in London.

Overview

The Athenaeum cleaning process is unique in several ways:

Guest data is extracted from revenue transactions rather than a dedicated guests table
Complex transaction aggregation handles both current and future room transactions
Multiple transaction types are categorized: Room, Adjustment, F&B, and Other
Custom fields are preserved (e.g., donotmove room preferences)

Data Sources

The cleaner reads from three raw tables in the etl_gp_raw_athenaeum database:

reservations - Main reservation data
revenues - Transaction-level revenue data (also contains guest information)
No dedicated guests table - guest info extracted from revenues

Cleaning Workflow

Deduplication Strategy

Athenaeum uses a priority-based deduplication approach:


status_priority = (
    F.when(F.col("statuscode") == "CHKOUT", 3)    # Checked out wins
    .when(F.col("statuscode") == "CHKIN", 2)       # Then checked in
    .when(F.col("statuscode") == "CONFIRMED", 1)   # Then confirmed
    .otherwise(0)
)
 
window = Window.partitionBy("confirmationnumber").orderBy(
    F.to_date(F.col("updatedon")).desc(),
    status_priority.desc(),
)

Key: For duplicate confirmation numbers, the most recent update with the highest status priority wins.

Field Mapping

Reservation Fields

Athenaeum Field	Standard Schema Field	Notes
`profileid`	`guest_id_og`	Original guest identifier
`confirmationnumber`	`reservation_id_og`	Unique reservation identifier
`resvdate`	`booking_date`	When reservation was made
`statuscode`	`status`	Mapped to standard statuses
`arrivaldate`	`check_in_date`	Check-in date
`departuredate`	`check_out_date`	Check-out date
`companyname`	`company`	Corporate account
`taname1`	`travel_agent`	Travel agent name
`marketsegment`	`market_segment`	Market segment
`groupname`	`group`	Group booking name
`adults`	`guests_adults`	Number of adults
`children`	`guests_children`	Number of children
`room`	`room_id`	Room number
`roomtype`	`room_type`	Room type code
`ROOMTYPCHG`	`room_type_charged`	Charged room type (if different)
`ROOMTYPCHGREASON`	`room_type_charged_reason`	Reason for room type change
`rateplan`	`room_rate_code`	Rate plan code
`emailadd`	`email`	Guest email
`mobilenumber`	`phone_number`	Guest phone
`country`	`address_country`	Country code
`nationality`	`nationality`	Nationality code

Status Mapping

Athenaeum Status	Standard Status
`CANCELED`	`cancelled`
`CHKOUT`	`checked_out`
`NOSHOW`	`no_show`
`CHKIN`	`checked_in`
`CONFIRMED`	`confirmed`
`INHOUSE`	`checked_in`
`WAITLIST`	`waitlist`

Guest Extraction

Athenaeum doesn’t provide a separate guests table. Instead, guest names are extracted from the revenue transactions:


# Extract first and last name from RVL_ACCOUNTNAME
# Remove common prefixes: Mr, Mrs, Ms, Dr, Miss, Prof
filtered_col = F.expr(
    """
    filter(
        split(
            trim(
                regexp_replace(RVL_ACCOUNTNAME, 
                    '(?i)\\b(Mr|Mrs|Ms|Dr|Miss|Mister|Prof)\\b\\.?\\s*', 
                    ''
                )
            ), 
            ' '
        ),
        x -> length(x) > 1
    )
    """
)
 
df = (
    df.withColumn("filtered_parts", filtered_col)
      .withColumn("first_name", F.trim(F.lower(F.expr("filtered_parts[0]"))))
      .withColumn("last_name", F.trim(F.lower(F.expr("filtered_parts[size(filtered_parts)-1]"))))
)

Process:

Remove title prefixes (Mr, Mrs, Dr, etc.)
Split name on spaces
Filter out short segments (≤1 character)
First element → first_name
Last element → last_name

Transaction Aggregation

The Athenaeum cleaner performs sophisticated revenue aggregation to create daily room rates. Transactions are categorized into four types:

Transaction Categories

Room - Accommodation charges
Adjustment - Room rate adjustments (discounts, credits)
F&B - Food & Beverage charges
Other - Miscellaneous charges (packages, gratuities)

Current vs Future Transactions

Athenaeum distinguishes between:

Current transactions (rvl_recordtype == "CURRENT") - Historical/posted charges
Future transactions - Forecasted charges beyond the latest current date


# Find latest current room transaction date
max_current_date = room_current.agg(F.max("rvl_date")).collect()[0][0]
 
# Filter future transactions beyond that date
room_future = df.filter(
    F.col("rvl_transactioncode").isin(room_codes) &
    (F.to_date(F.col("rvl_date")) > F.lit(max_current_date))
)

For future transactions, only the most recent update per reservation/date/revenue type is kept:


window = Window.partitionBy(
    "rvl_confirmationnumber",
    "rvl_date",
    "rvl_revenuetype",
).orderBy(F.to_date(F.col("rvl_updatedon")).desc())
 
room_future = room_future.withColumn("row_num", F.row_number().over(window))
                         .filter(F.col("row_num") == 1)

Revenue Fields

Transactions are aggregated by reservation and date:

Revenue Type	Net Amount Field	Gross Amount Field
Room	`room_stay_date_rate_net`	`room_stay_date_listing_rate_gross`
Adjustment	`room_stay_date_rate_net` (adjusts room)	-
F&B	`room_stay_date_fnb_net`	-
Other	`room_stay_date_other_net`	-

Custom Fields

Athenaeum-specific fields are preserved with the athenaeum_ prefix:

athenaeum_donotmove - Flag indicating guest should not be moved to a different room
athenaeum_donotmove_reason - Reason for the “do not move” restriction

Rooms Processing

The cleaner generates daily room records by:

Expanding stays into individual dates:


df = df.withColumn(
    "room_stay_date",
    F.explode(
        F.sequence(
            F.to_date(F.col("check_in_date")),
            F.date_sub(F.to_date(F.col("check_out_date")), 1),
        )
    ),
)

Joining revenue aggregations by reservation ID and stay date
Handling unmatched transactions - Creates separate records for:
- Room charges without matching reservations
- F&B charges without matching dates
- Other charges without matching dates
Marking non-room revenue:


df = df.withColumn("booking_type", F.lit(BookingTypes.NON_ROOM.value))

Data Validation

Standard cleaning utilities are applied:


string_columns = [
    "first_name", "last_name", "phone_number", "email",
    "company", "travel_agent", "group", "source", 
    "secondary_source", "room_rate_code"
]
 
df = clean_string_columns(df, string_columns)
df = clean_country(df, "address_country")
df = clean_country(df, "nationality")
df = clean_email(df, "email")
df = clean_phone_number(df, "phone_number")

Additionally, empty/invalid values are cleaned:


df = df.select([
    F.when(F.col(c).isin("NaN", "", "N/A", "***"), None)
     .otherwise(F.col(c))
     .alias(c) 
    for c in df.columns
])

Output Tables

Clean Reservations

Database: etl_gp_clean_athenaeum
Table: reservations
Key: reservation_id_og (confirmation number)
Row Count: One per reservation

Clean Guests

Database: etl_gp_clean_athenaeum
Table: guests
Key: guest_id (stable UUID)
Deduplication: Latest by created_timestamp
Source: Extracted from revenue transactions

Clean Rooms

Database: etl_gp_clean_athenaeum
Table: rooms
Key: Composite (property_id, res_id, room_stay_date, booking_type)
Granularity: One row per reservation per day
Special: Includes non-room revenue records

Configuration

Transaction codes are defined in the job configuration:


self.trx_codes = self._job_context.config.get_key("transaction_codes")
# {
#   "room": ["CODE1", "CODE2", ...],
#   "adjustment": ["ADJ1", "ADJ2", ...],
#   "fnb": mapped via transaction types (NONRM1, NONRM2),
#   "other": mapped via transaction types (NONRM3, NONRM4, NONRM7, PKG, GRATS)
# }

Challenges & Solutions

Challenge	Solution
No dedicated guests table	Extract guest names from revenue account names
Current vs future transactions	Track max current date and filter accordingly
Multiple transaction types	Categorize and aggregate separately, then join
Unmatched revenue	Create separate records for non-room bookings
Room type changes	Preserve both original and charged room types