Athenaeum Cleaning
The AthenaeumCleaner transforms raw Athenaeum PMS data into our standardized format. Athenaeum is a property management system used by the Athenaeum Hotel in London.
Overview
The Athenaeum cleaning process is unique in several ways:
- Guest data is extracted from revenue transactions rather than a dedicated guests table
- Complex transaction aggregation handles both current and future room transactions
- Multiple transaction types are categorized: Room, Adjustment, F&B, and Other
- Custom fields are preserved (e.g.,
donotmoveroom preferences)
Data Sources
The cleaner reads from three raw tables in the etl_gp_raw_athenaeum database:
reservations- Main reservation datarevenues- Transaction-level revenue data (also contains guest information)- No dedicated guests table - guest info extracted from revenues
Cleaning Workflow
Deduplication Strategy
Athenaeum uses a priority-based deduplication approach:
status_priority = (
F.when(F.col("statuscode") == "CHKOUT", 3) # Checked out wins
.when(F.col("statuscode") == "CHKIN", 2) # Then checked in
.when(F.col("statuscode") == "CONFIRMED", 1) # Then confirmed
.otherwise(0)
)
window = Window.partitionBy("confirmationnumber").orderBy(
F.to_date(F.col("updatedon")).desc(),
status_priority.desc(),
)Key: For duplicate confirmation numbers, the most recent update with the highest status priority wins.
Field Mapping
Reservation Fields
| Athenaeum Field | Standard Schema Field | Notes |
|---|---|---|
profileid | guest_id_og | Original guest identifier |
confirmationnumber | reservation_id_og | Unique reservation identifier |
resvdate | booking_date | When reservation was made |
statuscode | status | Mapped to standard statuses |
arrivaldate | check_in_date | Check-in date |
departuredate | check_out_date | Check-out date |
companyname | company | Corporate account |
taname1 | travel_agent | Travel agent name |
marketsegment | market_segment | Market segment |
groupname | group | Group booking name |
adults | guests_adults | Number of adults |
children | guests_children | Number of children |
room | room_id | Room number |
roomtype | room_type | Room type code |
ROOMTYPCHG | room_type_charged | Charged room type (if different) |
ROOMTYPCHGREASON | room_type_charged_reason | Reason for room type change |
rateplan | room_rate_code | Rate plan code |
emailadd | email | Guest email |
mobilenumber | phone_number | Guest phone |
country | address_country | Country code |
nationality | nationality | Nationality code |
Status Mapping
| Athenaeum Status | Standard Status |
|---|---|
CANCELED | cancelled |
CHKOUT | checked_out |
NOSHOW | no_show |
CHKIN | checked_in |
CONFIRMED | confirmed |
INHOUSE | checked_in |
WAITLIST | waitlist |
Guest Extraction
Athenaeum doesn’t provide a separate guests table. Instead, guest names are extracted from the revenue transactions:
# Extract first and last name from RVL_ACCOUNTNAME
# Remove common prefixes: Mr, Mrs, Ms, Dr, Miss, Prof
filtered_col = F.expr(
"""
filter(
split(
trim(
regexp_replace(RVL_ACCOUNTNAME,
'(?i)\\b(Mr|Mrs|Ms|Dr|Miss|Mister|Prof)\\b\\.?\\s*',
''
)
),
' '
),
x -> length(x) > 1
)
"""
)
df = (
df.withColumn("filtered_parts", filtered_col)
.withColumn("first_name", F.trim(F.lower(F.expr("filtered_parts[0]"))))
.withColumn("last_name", F.trim(F.lower(F.expr("filtered_parts[size(filtered_parts)-1]"))))
)Process:
- Remove title prefixes (Mr, Mrs, Dr, etc.)
- Split name on spaces
- Filter out short segments (≤1 character)
- First element →
first_name - Last element →
last_name
Transaction Aggregation
The Athenaeum cleaner performs sophisticated revenue aggregation to create daily room rates. Transactions are categorized into four types:
Transaction Categories
- Room - Accommodation charges
- Adjustment - Room rate adjustments (discounts, credits)
- F&B - Food & Beverage charges
- Other - Miscellaneous charges (packages, gratuities)
Current vs Future Transactions
Athenaeum distinguishes between:
- Current transactions (
rvl_recordtype == "CURRENT") - Historical/posted charges - Future transactions - Forecasted charges beyond the latest current date
# Find latest current room transaction date
max_current_date = room_current.agg(F.max("rvl_date")).collect()[0][0]
# Filter future transactions beyond that date
room_future = df.filter(
F.col("rvl_transactioncode").isin(room_codes) &
(F.to_date(F.col("rvl_date")) > F.lit(max_current_date))
)For future transactions, only the most recent update per reservation/date/revenue type is kept:
window = Window.partitionBy(
"rvl_confirmationnumber",
"rvl_date",
"rvl_revenuetype",
).orderBy(F.to_date(F.col("rvl_updatedon")).desc())
room_future = room_future.withColumn("row_num", F.row_number().over(window))
.filter(F.col("row_num") == 1)Revenue Fields
Transactions are aggregated by reservation and date:
| Revenue Type | Net Amount Field | Gross Amount Field |
|---|---|---|
| Room | room_stay_date_rate_net | room_stay_date_listing_rate_gross |
| Adjustment | room_stay_date_rate_net (adjusts room) | - |
| F&B | room_stay_date_fnb_net | - |
| Other | room_stay_date_other_net | - |
Custom Fields
Athenaeum-specific fields are preserved with the athenaeum_ prefix:
athenaeum_donotmove- Flag indicating guest should not be moved to a different roomathenaeum_donotmove_reason- Reason for the “do not move” restriction
Rooms Processing
The cleaner generates daily room records by:
- Expanding stays into individual dates:
df = df.withColumn(
"room_stay_date",
F.explode(
F.sequence(
F.to_date(F.col("check_in_date")),
F.date_sub(F.to_date(F.col("check_out_date")), 1),
)
),
)-
Joining revenue aggregations by reservation ID and stay date
-
Handling unmatched transactions - Creates separate records for:
- Room charges without matching reservations
- F&B charges without matching dates
- Other charges without matching dates
-
Marking non-room revenue:
df = df.withColumn("booking_type", F.lit(BookingTypes.NON_ROOM.value))Data Validation
Standard cleaning utilities are applied:
string_columns = [
"first_name", "last_name", "phone_number", "email",
"company", "travel_agent", "group", "source",
"secondary_source", "room_rate_code"
]
df = clean_string_columns(df, string_columns)
df = clean_country(df, "address_country")
df = clean_country(df, "nationality")
df = clean_email(df, "email")
df = clean_phone_number(df, "phone_number")Additionally, empty/invalid values are cleaned:
df = df.select([
F.when(F.col(c).isin("NaN", "", "N/A", "***"), None)
.otherwise(F.col(c))
.alias(c)
for c in df.columns
])Output Tables
Clean Reservations
- Database:
etl_gp_clean_athenaeum - Table:
reservations - Key:
reservation_id_og(confirmation number) - Row Count: One per reservation
Clean Guests
- Database:
etl_gp_clean_athenaeum - Table:
guests - Key:
guest_id(stable UUID) - Deduplication: Latest by
created_timestamp - Source: Extracted from revenue transactions
Clean Rooms
- Database:
etl_gp_clean_athenaeum - Table:
rooms - Key: Composite (
property_id,res_id,room_stay_date,booking_type) - Granularity: One row per reservation per day
- Special: Includes non-room revenue records
Configuration
Transaction codes are defined in the job configuration:
self.trx_codes = self._job_context.config.get_key("transaction_codes")
# {
# "room": ["CODE1", "CODE2", ...],
# "adjustment": ["ADJ1", "ADJ2", ...],
# "fnb": mapped via transaction types (NONRM1, NONRM2),
# "other": mapped via transaction types (NONRM3, NONRM4, NONRM7, PKG, GRATS)
# }Challenges & Solutions
| Challenge | Solution |
|---|---|
| No dedicated guests table | Extract guest names from revenue account names |
| Current vs future transactions | Track max current date and filter accordingly |
| Multiple transaction types | Categorize and aggregate separately, then join |
| Unmatched revenue | Create separate records for non-room bookings |
| Room type changes | Preserve both original and charged room types |
See Also
- Cleaning Overview - General cleaning process
- Schema Documentation - Field definitions
- Processing Stage - Next stage after cleaning