Skip to Content

Processing

Processing is where business rules and calculations are applied consistently.

The processing stage follows the following process chain (runtime order is defined by etl_lib.workflow.Workflow):

Stages

The processing stage runs a small set of focused Spark workflows that enrich and normalise reservation, room and guest data used by reporting and downstream workflows. Below is the canonical order used in our jobs; individual pipelines may skip or repeat steps depending on configuration.

Stages

  • Reservation Metrics — Compute booking and cancellation windows and stay nights on the reservation DataFrame.
  • Room Metrics — Compute per-room-day flags (stay day / stay night), length-of-stay and per-reservation aggregates such as total net revenue and guest counts.
  • Guest Matching — Run Splink-based record linkage to group guest records into clusters (uses a saved model per chain; trains if missing). Produces guest_cluster_id and keeps the most recent record per cluster.
  • Guest Matching Check — Lightweight safeguard that ensures every guest has a guest_cluster_id (copies guest_id if missing).
  • Guest Loyalty — Aggregate reservation history per guest_cluster_id and annotate guests and reservations with loyalty metrics (stays, total nights, lifetime value, last booked property/channel/source, nth booking, returning guest flag, etc.).
  • Replace Values — Apply configured value mappings to reservation-, room- and guest-level columns. Original values are preserved in an *_og column per replaced field.

Each stage expects certain DataFrames present on the workflow context (typically res_df, rooms_df and guest_df) and writes back enriched DataFrames to the same attributes. See each stage page for required input columns and example behaviours.

Last updated on