GuestMatchingTask

The GuestMatchingTask performs probabilistic record linkage to identify and deduplicate guests across multiple bookings. It uses the Splink library with machine learning to match guest records even when data entry variations exist.

Key Options

model_path (Optional[os.PathLike]) — Path to the Splink model JSON file saved to S3. Default: S3Paths.PROCESSED_SPLINK_MODELS.value / f"{chain_id}.json".
force_retrain (bool) — Force retraining the Splink model even if a model already exists. Default: False.

Constructor Parameters

job_context — workflow context with spark, chain_id, catalog IO helpers, and job config.
write_to_catalog — If true, writes the processed outputs to the configured catalog.
is_incremental — Run in incremental mode; if enabled the task checks for ProcessedGuestModel and performs link_only on new guests.
model_path — Path to the Splink JSON model on S3. Default: S3Paths.PROCESSED_SPLINK_MODELS.value / {chain_id}.json.

Incremental Behavior

Incremental mode only matches newly added guests against existing processed guests using Splink’s link_only mode.
If there are no existing processed guests, incremental mode falls back to full deduplication and writes matched guests to ProcessedAddedGuestModel.
When running in full mode, the task deduplicates the entire guest dataset and writes to ProcessedGuestModel and ProcessedGuestMatchModel.

Matching Process (Summary)

The task builds a Splink SettingsCreator from available columns and runs either dedupe_only (full) or link_only (incremental) depending on inputs.
Key comparison functions used are: EmailComparison, NameComparison (with lower-cased expressions), and ExactMatch for dates, genders, passport numbers, phone, and addresses.
Blocking rules are used to reduce candidate pair counts. These are dynamically generated based on available columns and include rules like block_on("email"), block_on("first_name", "last_name"), and address/birthdate combos.
The model is trained with EM after estimating u-probabilities using random sampling. Training occurs only when force_retrain is True or the model file is missing.
Results are predicted using linker.inference.predict(threshold_match_probability=0.95) and clustered pairwise with linker.clustering.cluster_pairwise_predictions_at_threshold.

Outputs

ProcessedGuestModel — The deduplicated guest profiles with guest_cluster_id.
ProcessedGuestMatchModel — Pre-merge matching results (cluster assignments per guest).
ProcessedAddedGuestModel (incremental) — Newly added/merged guest records when running in incremental mode.

Notes / Best Practices

Monitor model performance and false positives: threshold_match_probability can be tuned (default 0.95) to balance precision/recall.
Use checkpointing and StorageLevel.MEMORY_AND_DISK for large datasets to avoid Spark lineage issues.
For large tenant datasets, enable is_incremental and reuse trained models to reduce computational cost of retraining.

ProcessingTask - Parent orchestrator
GuestMatchingCheckTask - Ensures cluster IDs exist
GuestLoyaltyTask - Uses guest_cluster_id
CleanAthenaeumTask - Produces CleanGuestModel

Related tasks listed above

External Dependencies

Splink: Probabilistic record linkage library
- Documentation: https://moj-analytical-services.github.io/splink/
- Uses Spark backend for scalability
Model storage: trained Splink models are saved to S3Paths.PROCESSED_SPLINK_MODELS.value / {chain_id}.json by default.

Best Practices

Train once per chain: Model training is expensive; reuse trained models
Monitor match quality: Regularly review clusters for false positives/negatives
Clean input data: Better input data = better matching results
Consider incremental updates: For large databases, prefer incremental matching to reduce training time and compute.
Test threshold values: 95% works well but may need adjustment per use case
Document changes: Track model versions when retraining