Skip to Content
ProcessesTasksProcessingGuestMatchingTask

GuestMatchingTask

The GuestMatchingTask performs probabilistic record linkage to identify and deduplicate guests across multiple bookings. It uses the Splink library with machine learning to match guest records even when data entry variations exist.

Key Options

  • model_path (Optional[os.PathLike]) — Path to the Splink model JSON file saved to S3. Default: S3Paths.PROCESSED_SPLINK_MODELS.value / f"{chain_id}.json".
  • force_retrain (bool) — Force retraining the Splink model even if a model already exists. Default: False.

Constructor Parameters

  • job_context — workflow context with spark, chain_id, catalog IO helpers, and job config.
  • write_to_catalog — If true, writes the processed outputs to the configured catalog.
  • is_incremental — Run in incremental mode; if enabled the task checks for ProcessedGuestModel and performs link_only on new guests.
  • model_path — Path to the Splink JSON model on S3. Default: S3Paths.PROCESSED_SPLINK_MODELS.value / {chain_id}.json.

Incremental Behavior

  • Incremental mode only matches newly added guests against existing processed guests using Splink’s link_only mode.
  • If there are no existing processed guests, incremental mode falls back to full deduplication and writes matched guests to ProcessedAddedGuestModel.
  • When running in full mode, the task deduplicates the entire guest dataset and writes to ProcessedGuestModel and ProcessedGuestMatchModel.

Matching Process (Summary)

  • The task builds a Splink SettingsCreator from available columns and runs either dedupe_only (full) or link_only (incremental) depending on inputs.
  • Key comparison functions used are: EmailComparison, NameComparison (with lower-cased expressions), and ExactMatch for dates, genders, passport numbers, phone, and addresses.
  • Blocking rules are used to reduce candidate pair counts. These are dynamically generated based on available columns and include rules like block_on("email"), block_on("first_name", "last_name"), and address/birthdate combos.
  • The model is trained with EM after estimating u-probabilities using random sampling. Training occurs only when force_retrain is True or the model file is missing.
  • Results are predicted using linker.inference.predict(threshold_match_probability=0.95) and clustered pairwise with linker.clustering.cluster_pairwise_predictions_at_threshold.

Outputs

  • ProcessedGuestModel — The deduplicated guest profiles with guest_cluster_id.
  • ProcessedGuestMatchModel — Pre-merge matching results (cluster assignments per guest).
  • ProcessedAddedGuestModel (incremental) — Newly added/merged guest records when running in incremental mode.

Notes / Best Practices

  • Monitor model performance and false positives: threshold_match_probability can be tuned (default 0.95) to balance precision/recall.
  • Use checkpointing and StorageLevel.MEMORY_AND_DISK for large datasets to avoid Spark lineage issues.
  • For large tenant datasets, enable is_incremental and reuse trained models to reduce computational cost of retraining.

Related tasks listed above

External Dependencies

  • Splink: Probabilistic record linkage library

  • Model storage: trained Splink models are saved to S3Paths.PROCESSED_SPLINK_MODELS.value / {chain_id}.json by default.

Best Practices

  1. Train once per chain: Model training is expensive; reuse trained models
  2. Monitor match quality: Regularly review clusters for false positives/negatives
  3. Clean input data: Better input data = better matching results
  4. Consider incremental updates: For large databases, prefer incremental matching to reduce training time and compute.
  5. Test threshold values: 95% works well but may need adjustment per use case
  6. Document changes: Track model versions when retraining
Last updated on