Skip to Content

ProcessingTask

The ProcessingTask is an orchestrator task that runs all common data processing operations in sequence. It coordinates subtasks that calculate metrics, match guests, and prepare data for reporting.

Overview

ProcessingTask serves as the main entry point for the processing stage of the ETL pipeline. It:

  • Orchestrates 6 core processing subtasks
  • Transforms cleaned data into processed models
  • Calculates reservation and room metrics
  • Performs guest matching and deduplication
  • Computes guest loyalty metrics
  • Applies value replacements from configuration

Flow Diagram

Data Flow

Subtasks

ProcessingTask runs the following subtasks in order:

  1. ReservationMetricsTask - Calculates booking window, cancellation window, and stay nights
  2. RoomMetricsTask - Computes daily room metrics and stay flags
  3. GuestMatchingTask - Performs guest deduplication using Splink
  4. GuestMatchingCheckTask - Ensures all guests have cluster IDs
  5. GuestLoyaltyTask - Calculates loyalty metrics and flags
  6. ReplaceValuesTask - Applies configured value replacements

Constructor Options

  • subtasks — Optional custom list of Task objects. If none provided, the default sequence (ReservationMetrics -> RoomMetrics -> GuestMatching -> GuestMatchingCheck -> GuestLoyalty -> ReplaceValues) is used.
  • skip_subtasks — List of names of subtasks to skip. Useful for testing and incremental workflows.
  • is_incremental — Can be set explicitly, otherwise taken from job_context.is_incremental.
  • write_to_catalog — Controls whether the final outputs are written to catalog (default True).

Incremental Behavior

  • When is_incremental = True, subtasks process added/changed records where supported and produce ProcessedAdded* outputs. The parent ProcessingTask gathers outputs and will only write ProcessedAdded* models if they contain data.
  • The orchestrator will also only write the main Processed* models if those DataFrames contain rows.

Models

Requires

  • CleanGuestModel - Cleaned guest data
  • CleanReservationModel - Cleaned reservation data
  • CleanRoomModel - Cleaned room stay data

Provides

  • ProcessedGuestModel - Processed guest data with loyalty metrics
  • ProcessedReservationModel - Processed reservations with calculated metrics
  • ProcessedRoomModel - Processed room data with daily metrics

Provides (incremental)

  • ProcessedAddedGuestModel - Newly added guest records (incremental)
  • ProcessedAddedReservationModel - Newly added reservations (incremental)
  • ProcessedAddedRoomModel - Newly added rooms (incremental)
  • Models - Input/output model definitions
  • Workflow - How ProcessingTask fits in the pipeline
Last updated on