Skip to Content

Athenaeum Cleaning

The AthenaeumCleaner transforms raw Athenaeum PMS data into our standardized format. Athenaeum is a property management system used by the Athenaeum Hotel in London.

Overview

The Athenaeum cleaning process is unique in several ways:

  1. Guest data is extracted from revenue transactions rather than a dedicated guests table
  2. Complex transaction aggregation handles both current and future room transactions
  3. Multiple transaction types are categorized: Room, Adjustment, F&B, and Other
  4. Custom fields are preserved (e.g., donotmove room preferences)

Data Sources

The cleaner reads from three raw tables in the etl_gp_raw_athenaeum database:

  • reservations - Main reservation data
  • revenues - Transaction-level revenue data (also contains guest information)
  • No dedicated guests table - guest info extracted from revenues

Cleaning Workflow

Deduplication Strategy

Athenaeum uses a priority-based deduplication approach:

status_priority = ( F.when(F.col("statuscode") == "CHKOUT", 3) # Checked out wins .when(F.col("statuscode") == "CHKIN", 2) # Then checked in .when(F.col("statuscode") == "CONFIRMED", 1) # Then confirmed .otherwise(0) ) window = Window.partitionBy("confirmationnumber").orderBy( F.to_date(F.col("updatedon")).desc(), status_priority.desc(), )

Key: For duplicate confirmation numbers, the most recent update with the highest status priority wins.

Field Mapping

Reservation Fields

Athenaeum FieldStandard Schema FieldNotes
profileidguest_id_ogOriginal guest identifier
confirmationnumberreservation_id_ogUnique reservation identifier
resvdatebooking_dateWhen reservation was made
statuscodestatusMapped to standard statuses
arrivaldatecheck_in_dateCheck-in date
departuredatecheck_out_dateCheck-out date
companynamecompanyCorporate account
taname1travel_agentTravel agent name
marketsegmentmarket_segmentMarket segment
groupnamegroupGroup booking name
adultsguests_adultsNumber of adults
childrenguests_childrenNumber of children
roomroom_idRoom number
roomtyperoom_typeRoom type code
ROOMTYPCHGroom_type_chargedCharged room type (if different)
ROOMTYPCHGREASONroom_type_charged_reasonReason for room type change
rateplanroom_rate_codeRate plan code
emailaddemailGuest email
mobilenumberphone_numberGuest phone
countryaddress_countryCountry code
nationalitynationalityNationality code

Status Mapping

Athenaeum StatusStandard Status
CANCELEDcancelled
CHKOUTchecked_out
NOSHOWno_show
CHKINchecked_in
CONFIRMEDconfirmed
INHOUSEchecked_in
WAITLISTwaitlist

Guest Extraction

Athenaeum doesn’t provide a separate guests table. Instead, guest names are extracted from the revenue transactions:

# Extract first and last name from RVL_ACCOUNTNAME # Remove common prefixes: Mr, Mrs, Ms, Dr, Miss, Prof filtered_col = F.expr( """ filter( split( trim( regexp_replace(RVL_ACCOUNTNAME, '(?i)\\b(Mr|Mrs|Ms|Dr|Miss|Mister|Prof)\\b\\.?\\s*', '' ) ), ' ' ), x -> length(x) > 1 ) """ ) df = ( df.withColumn("filtered_parts", filtered_col) .withColumn("first_name", F.trim(F.lower(F.expr("filtered_parts[0]")))) .withColumn("last_name", F.trim(F.lower(F.expr("filtered_parts[size(filtered_parts)-1]")))) )

Process:

  1. Remove title prefixes (Mr, Mrs, Dr, etc.)
  2. Split name on spaces
  3. Filter out short segments (≤1 character)
  4. First element → first_name
  5. Last element → last_name

Transaction Aggregation

The Athenaeum cleaner performs sophisticated revenue aggregation to create daily room rates. Transactions are categorized into four types:

Transaction Categories

  1. Room - Accommodation charges
  2. Adjustment - Room rate adjustments (discounts, credits)
  3. F&B - Food & Beverage charges
  4. Other - Miscellaneous charges (packages, gratuities)

Current vs Future Transactions

Athenaeum distinguishes between:

  • Current transactions (rvl_recordtype == "CURRENT") - Historical/posted charges
  • Future transactions - Forecasted charges beyond the latest current date
# Find latest current room transaction date max_current_date = room_current.agg(F.max("rvl_date")).collect()[0][0] # Filter future transactions beyond that date room_future = df.filter( F.col("rvl_transactioncode").isin(room_codes) & (F.to_date(F.col("rvl_date")) > F.lit(max_current_date)) )

For future transactions, only the most recent update per reservation/date/revenue type is kept:

window = Window.partitionBy( "rvl_confirmationnumber", "rvl_date", "rvl_revenuetype", ).orderBy(F.to_date(F.col("rvl_updatedon")).desc()) room_future = room_future.withColumn("row_num", F.row_number().over(window)) .filter(F.col("row_num") == 1)

Revenue Fields

Transactions are aggregated by reservation and date:

Revenue TypeNet Amount FieldGross Amount Field
Roomroom_stay_date_rate_netroom_stay_date_listing_rate_gross
Adjustmentroom_stay_date_rate_net (adjusts room)-
F&Broom_stay_date_fnb_net-
Otherroom_stay_date_other_net-

Custom Fields

Athenaeum-specific fields are preserved with the athenaeum_ prefix:

  • athenaeum_donotmove - Flag indicating guest should not be moved to a different room
  • athenaeum_donotmove_reason - Reason for the “do not move” restriction

Rooms Processing

The cleaner generates daily room records by:

  1. Expanding stays into individual dates:
df = df.withColumn( "room_stay_date", F.explode( F.sequence( F.to_date(F.col("check_in_date")), F.date_sub(F.to_date(F.col("check_out_date")), 1), ) ), )
  1. Joining revenue aggregations by reservation ID and stay date

  2. Handling unmatched transactions - Creates separate records for:

    • Room charges without matching reservations
    • F&B charges without matching dates
    • Other charges without matching dates
  3. Marking non-room revenue:

df = df.withColumn("booking_type", F.lit(BookingTypes.NON_ROOM.value))

Data Validation

Standard cleaning utilities are applied:

string_columns = [ "first_name", "last_name", "phone_number", "email", "company", "travel_agent", "group", "source", "secondary_source", "room_rate_code" ] df = clean_string_columns(df, string_columns) df = clean_country(df, "address_country") df = clean_country(df, "nationality") df = clean_email(df, "email") df = clean_phone_number(df, "phone_number")

Additionally, empty/invalid values are cleaned:

df = df.select([ F.when(F.col(c).isin("NaN", "", "N/A", "***"), None) .otherwise(F.col(c)) .alias(c) for c in df.columns ])

Output Tables

Clean Reservations

  • Database: etl_gp_clean_athenaeum
  • Table: reservations
  • Key: reservation_id_og (confirmation number)
  • Row Count: One per reservation

Clean Guests

  • Database: etl_gp_clean_athenaeum
  • Table: guests
  • Key: guest_id (stable UUID)
  • Deduplication: Latest by created_timestamp
  • Source: Extracted from revenue transactions

Clean Rooms

  • Database: etl_gp_clean_athenaeum
  • Table: rooms
  • Key: Composite (property_id, res_id, room_stay_date, booking_type)
  • Granularity: One row per reservation per day
  • Special: Includes non-room revenue records

Configuration

Transaction codes are defined in the job configuration:

self.trx_codes = self._job_context.config.get_key("transaction_codes") # { # "room": ["CODE1", "CODE2", ...], # "adjustment": ["ADJ1", "ADJ2", ...], # "fnb": mapped via transaction types (NONRM1, NONRM2), # "other": mapped via transaction types (NONRM3, NONRM4, NONRM7, PKG, GRATS) # }

Challenges & Solutions

ChallengeSolution
No dedicated guests tableExtract guest names from revenue account names
Current vs future transactionsTrack max current date and filter accordingly
Multiple transaction typesCategorize and aggregate separately, then join
Unmatched revenueCreate separate records for non-room bookings
Room type changesPreserve both original and charged room types

See Also

Last updated on