GuestMatchingTask
The GuestMatchingTask performs probabilistic record linkage to identify and deduplicate guests across multiple bookings. It uses the Splink library with machine learning to match guest records even when data entry variations exist.
Key Options
model_path(Optional[os.PathLike]) — Path to the Splink model JSON file saved to S3. Default:S3Paths.PROCESSED_SPLINK_MODELS.value / f"{chain_id}.json".force_retrain(bool) — Force retraining the Splink model even if a model already exists. Default:False.
Constructor Parameters
job_context— workflow context withspark,chain_id, catalog IO helpers, and job config.write_to_catalog— If true, writes the processed outputs to the configured catalog.is_incremental— Run in incremental mode; if enabled the task checks forProcessedGuestModeland performslink_onlyon new guests.model_path— Path to the Splink JSON model on S3. Default:S3Paths.PROCESSED_SPLINK_MODELS.value / {chain_id}.json.
Incremental Behavior
- Incremental mode only matches newly added guests against existing processed guests using Splink’s
link_onlymode. - If there are no existing processed guests, incremental mode falls back to full deduplication and writes matched guests to
ProcessedAddedGuestModel. - When running in full mode, the task deduplicates the entire guest dataset and writes to
ProcessedGuestModelandProcessedGuestMatchModel.
Matching Process (Summary)
- The task builds a Splink
SettingsCreatorfrom available columns and runs eitherdedupe_only(full) orlink_only(incremental) depending on inputs. - Key comparison functions used are:
EmailComparison,NameComparison(with lower-cased expressions), andExactMatchfor dates, genders, passport numbers, phone, and addresses. - Blocking rules are used to reduce candidate pair counts. These are dynamically generated based on available columns and include rules like
block_on("email"),block_on("first_name", "last_name"), and address/birthdate combos. - The model is trained with EM after estimating u-probabilities using random sampling. Training occurs only when
force_retrainisTrueor the model file is missing. - Results are predicted using
linker.inference.predict(threshold_match_probability=0.95)and clustered pairwise withlinker.clustering.cluster_pairwise_predictions_at_threshold.
Outputs
ProcessedGuestModel— The deduplicated guest profiles withguest_cluster_id.ProcessedGuestMatchModel— Pre-merge matching results (cluster assignments per guest).ProcessedAddedGuestModel(incremental) — Newly added/merged guest records when running in incremental mode.
Notes / Best Practices
- Monitor model performance and false positives:
threshold_match_probabilitycan be tuned (default 0.95) to balance precision/recall. - Use checkpointing and
StorageLevel.MEMORY_AND_DISKfor large datasets to avoid Spark lineage issues. - For large tenant datasets, enable
is_incrementaland reuse trained models to reduce computational cost of retraining.
Related Tasks
- ProcessingTask - Parent orchestrator
- GuestMatchingCheckTask - Ensures cluster IDs exist
- GuestLoyaltyTask - Uses guest_cluster_id
- CleanAthenaeumTask - Produces CleanGuestModel
Related tasks listed above
External Dependencies
-
Splink: Probabilistic record linkage library
- Documentation: https://moj-analytical-services.github.io/splink/
- Uses Spark backend for scalability
-
Model storage: trained Splink models are saved to
S3Paths.PROCESSED_SPLINK_MODELS.value / {chain_id}.jsonby default.
Best Practices
- Train once per chain: Model training is expensive; reuse trained models
- Monitor match quality: Regularly review clusters for false positives/negatives
- Clean input data: Better input data = better matching results
- Consider incremental updates: For large databases, prefer incremental matching to reduce training time and compute.
- Test threshold values: 95% works well but may need adjustment per use case
- Document changes: Track model versions when retraining
Last updated on