Processing
Processing is where business rules and calculations are applied consistently.
The processing stage follows the following process chain (runtime order is defined by etl_lib.workflow.Workflow):
Stages
The processing stage runs a small set of focused Spark workflows that enrich and normalise reservation, room and guest data used by reporting and downstream workflows. Below is the canonical order used in our jobs; individual pipelines may skip or repeat steps depending on configuration.
Stages
- Reservation Metrics — Compute booking and cancellation windows and stay nights on the reservation DataFrame.
- Room Metrics — Compute per-room-day flags (stay day / stay night), length-of-stay and per-reservation aggregates such as total net revenue and guest counts.
- Guest Matching — Run Splink-based record linkage to group guest records into clusters (uses a saved model per chain; trains if missing). Produces
guest_cluster_idand keeps the most recent record per cluster. - Guest Matching Check — Lightweight safeguard that ensures every guest has a
guest_cluster_id(copiesguest_idif missing). - Guest Loyalty — Aggregate reservation history per
guest_cluster_idand annotate guests and reservations with loyalty metrics (stays, total nights, lifetime value, last booked property/channel/source, nth booking, returning guest flag, etc.). - Replace Values — Apply configured value mappings to reservation-, room- and guest-level columns. Original values are preserved in an *_og column per replaced field.
Each stage expects certain DataFrames present on the workflow context (typically res_df, rooms_df and guest_df) and writes back enriched DataFrames to the same attributes. See each stage page for required input columns and example behaviours.
Last updated on