Guestline Cleaning
The GuestlineCleaner transforms raw Guestline PMS data into our standardized format. Guestline is used by the Queensway properties and provides a comprehensive web-based property management system.
Overview
The Guestline cleaning process has several distinctive features:
- Nested data structure with deep field paths (e.g.,
data.roompick.ratelines) - Ratelines - Complex rate structures with multiple rate periods per reservation
- Multi-level revenue - Revenue items nested within ratelines
- Aggregate-based date calculation - Check-in/out dates computed from ratelines
- Separate person profiles table for guest enrichment
Data Sources
The cleaner reads from two tables in the etl_gp_raw_queensway database:
roompicks- Main reservation data with nested ratelines and revenuepersonprofiles- Guest profiles with contact information
Cleaning Workflow
Deduplication Strategy
Guestline uses a priority-based deduplication similar to other systems:
status_priority = (
F.when(F.col("data.roompick.statuscode") == "CHKO", 3)
.when(F.col("data.roompick.statuscode") == "CheckedIn", 2)
.when(F.col("data.roompick.statuscode") == "RES", 1)
.when(F.col("data.roompick.statuscode") == "Confirmed", 1)
.otherwise(0)
)
window = Window.partitionBy("data.entityid").orderBy(
F.to_timestamp("data.roompick.lasteditedtimestamp").desc(),
status_priority.desc(),
)Key: For duplicate entity IDs, the most recently edited record with the highest status priority wins.
Field Mapping
Reservation Fields
| Guestline Field | Standard Schema Field | Notes |
|---|---|---|
data.entityid | reservation_id_og | Unique booking identifier |
data.roompick.statuscode | status | Reservation status |
data.roompick.mediasource | source | Booking source |
data.roompick.channeldescription | channel | Booking channel |
data.roompick.ratelines.marketsegment[0] | market_segment | Market segment |
data.roompick.ratelines.guests.adults[0] | guests_adults | Number of adults |
data.roompick.ratelines.guests.children[0] | guests_children | Number of children |
data.roompick.creationdate | booking_date, created_timestamp | When booking was created |
data.roompick.lasteditedtimestamp | last_modified_timestamp | Last update time |
data.roompick.cancellationdate | cancellation_date | Cancellation date |
data.roompick.guestprofiles[0].profileref | guest_id_og | Guest profile reference |
data.roompick.agentname | travel_agent | Travel agent name |
data.roompick.companyname | company | Company name |
data.currencycode | currency | Currency code |
Guest Fields (from nested data.contact)
| Guestline Field | Standard Schema Field |
|---|---|
data.contact.contactgivenname | first_name |
data.contact.contactfamilyname | last_name |
data.contact.contactaddress.town | address_city |
data.contact.contactaddress.state | address_state |
data.contact.contactaddress.postcode | address_postcode |
data.contact.contactaddress.countrycode | address_country |
Person Profile Fields
Additional guest information from personprofiles table:
| Profile Field | Standard Schema Field |
|---|---|
forename | first_name |
surname | last_name |
email | email |
gender | gender |
telephonenumber | phone_number |
street | address_street |
town | address_city |
postcode | address_postcode |
country | address_country |
nationality | nationality |
vip | membership_type |
Status Mapping
| Guestline Status | Standard Status |
|---|---|
CAN | cancelled |
CHKO | checked_out |
NoSh | no_show |
CheckedIn | checked_in |
RES | confirmed |
Confirmed | confirmed |
InHouse | checked_in |
Computed Check-in/Check-out Dates
Unlike other PMS systems, Guestline doesn’t have top-level arrival/departure dates. Instead, they must be computed from ratelines:
Check-in Date (Earliest)
df = df.withColumn(
"check_in_date",
F.expr("""
aggregate(
data.roompick.ratelines,
date('9999-12-31'),
(acc, x) -> least(acc, to_date(x.from))
)
""")
)Logic: Start with a far-future date, then take the minimum of all rateline.from dates.
Check-out Date (Latest)
df = df.withColumn(
"check_out_date",
F.expr("""
aggregate(
data.roompick.ratelines,
date('0001-01-01'),
(acc, x) -> greatest(acc, to_date(x.to))
)
""")
)Logic: Start with a far-past date, then take the maximum of all rateline.to dates.
Ratelines Structure
Guestline uses ratelines to represent different rate periods within a single reservation:
{
"data": {
"roompick": {
"ratelines": [
{
"from": "2025-01-15",
"to": "2025-01-17",
"roomid": "101",
"roomtypecode": "DBL",
"rateplancode": "BAR",
"guests": {"adults": 2, "children": 0},
"marketsegment": "LEISURE",
"revenue": [
{
"revenuetypegroup": "Accommodation",
"amountbeforetax": 250.00,
"amountaftertax": 287.50
},
{
"revenuetypegroup": "FoodAndDrink",
"amountbeforetax": 50.00,
"amountaftertax": 57.50
}
]
}
]
}
}
}Key features:
- Multiple ratelines per reservation (e.g., room changes, rate changes)
- Each rateline has a date range (
from-to) - Revenue items are nested within each rateline
- Guest counts and market segment are rateline-specific
Guest Profile Enrichment
Person profiles are loaded separately and joined with reservation data:
profiles_df = self._job_context.catalog.read_table(
database="etl_gp_raw_queensway",
table="personprofiles",
dedup_keys=["entityid"],
sort_by=["synced_date"],
)
# Join on guest profile reference
df = profiles_df.join(
res_df.select("guest_id", "guest_id_og", "address_state", "created_timestamp")
.drop_duplicates(subset=["guest_id_og"]),
on=res_df["guest_id_og"] == profiles_df["profileref"],
how="left"
)Note: The join preserves address_state from reservations (not available in profiles).
Rooms Processing
The rooms processing is more complex for Guestline due to the ratelines structure:
Step 1: Explode Ratelines
df = df.withColumn("detail", F.explode("ratelines"))Each rateline becomes a separate row.
Step 2: Generate Stay Dates per Rateline
df = df.withColumn(
"room_stay_date",
F.explode(
F.sequence(
F.to_date(F.col("detail.from")),
F.date_sub(F.to_date(F.col("detail.to")), 1),
)
),
)Expands each rateline into daily records.
Step 3: Explode Revenue within Rateline
df = df.withColumn("revenue", F.explode("detail.revenue"))Each revenue item within a rateline becomes a separate row.
Step 4: Extract Room and Revenue Fields
df = (
df.withColumn("room_id", F.col("detail.roomid"))
.withColumn("room_rate_code", F.col("detail.rateplancode"))
.withColumn("room_type_code", F.col("detail.roomtypecode"))
.withColumn("amount_before_tax", F.col("revenue.amountbeforetax").cast("double"))
.withColumn("amount_after_tax", F.col("revenue.amountaftertax").cast("double"))
.withColumn("revenue_type_group", F.col("revenue.revenuetypegroup"))
)Step 5: Map Revenue Types
df = (
df.withColumn(
"room_stay_date_rate_net",
F.when(
F.col("revenue_type_group") == "Accommodation",
F.col("amount_before_tax")
).otherwise(0)
)
.withColumn(
"room_stay_date_fnb_net",
F.when(
F.col("revenue_type_group") == "FoodAndDrink",
F.col("amount_before_tax")
).otherwise(0)
)
.withColumn(
"room_stay_date_other_net",
F.when(
F.col("revenue_type_group") == "Other",
F.col("amount_before_tax")
).otherwise(0)
)
)Step 6: Aggregate by Reservation and Date
df = df.groupBy(
"chain_id", "property_id", "res_id", "room_id",
"room_type_code", "room_rate_code", "room_check_in_date",
"room_check_out_date", "room_guests_adults", "room_guests_children",
"room_stay_date", "booking_type"
).agg(
F.sum("room_stay_date_rate_net").alias("room_stay_date_rate_net"),
F.sum("room_stay_date_fnb_net").alias("room_stay_date_fnb_net"),
F.sum("room_stay_date_other_net").alias("room_stay_date_other_net"),
F.sum("room_stay_date_listing_rate_net").alias("room_stay_date_listing_rate_net"),
F.sum("room_stay_date_listing_rate_gross").alias("room_stay_date_listing_rate_gross"),
)This aggregation combines multiple revenue items for the same day into totals.
Revenue Type Groups
Guestline categorizes revenue into three groups:
Accommodation→room_stay_date_rate_netFoodAndDrink→room_stay_date_fnb_netOther→room_stay_date_other_net
Both net (before tax) and gross (after tax) amounts are available for accommodation.
Data Validation
Standard cleaning utilities are applied:
df = clean_country(df, "address_country")
df = clean_country(df, "nationality")
df = clean_email(df, "email")
df = clean_phone_number(df, "phone_number")Output Tables
Clean Reservations
- Database:
etl_gp_clean_queensway - Table:
reservations - Key:
reservation_id_og(entity ID) - Row Count: One per reservation
Clean Guests
- Database:
etl_gp_clean_queensway - Table:
guests - Key:
guest_id(stable UUID) - Source: Combined from roompicks contact data + person profiles
- Deduplication: Latest by
created_timestamp
Clean Rooms
- Database:
etl_gp_clean_queensway - Table:
rooms - Key: Composite (
property_id,res_id,room_id,room_stay_date) - Granularity: One row per reservation per room per day
- Revenue: Aggregated from exploded ratelines and revenue items
Challenges & Solutions
| Challenge | Solution |
|---|---|
| No top-level check-in/out dates | Compute from ratelines using aggregate functions |
| Complex nested ratelines | Multi-level explosion: ratelines → dates → revenue |
| Revenue at rateline level | Explode and aggregate by date |
| Multiple rate periods | Each rateline generates separate room records |
| Heavy nested JSON | Drop after extraction to reduce memory |
| Guest data in two places | Join roompicks contact with person profiles |
Guestline-Specific Features
Ratelines Architecture
Guestline’s rateline model allows for:
- Multiple rate periods within one reservation
- Room changes mid-stay
- Rate plan changes mid-stay
- Per-period guest counts and market segments
Revenue Nesting
Revenue items are nested within ratelines rather than at the reservation level, providing more granular revenue tracking.
Profile Reference System
Guests are linked via profileref in the roompick, which maps to entityid in person profiles.
VIP Status
Guestline tracks VIP status in person profiles, which is preserved in membership_type.
Example: Multi-Rateline Reservation
Consider a reservation with two ratelines:
Rateline 1: Jan 15-17, Room 101, Rate BAR, 2 adults
Revenue: Accommodation $250, F&B $50
Rateline 2: Jan 17-18, Room 202, Rate PROMO, 2 adults
Revenue: Accommodation $150, F&B $30After processing, this produces 3 daily room records:
- Jan 15: Room 101, Accommodation $250, F&B $50
- Jan 16: Room 101, Accommodation $250, F&B $50
- Jan 17: Room 202, Accommodation $150, F&B $30
See Also
- Cleaning Overview - General cleaning process
- Schema Documentation - Field definitions
- Processing Stage - Next stage after cleaning
- Guestline Documentation - Official Guestline information