Guest Matching
This process uses Splink to probabilistically deduplicate guests.
It filters and deduplicates input, builds a matching configuration with name similarity plus exact email, phone, and other personal information comparisons.
If no saved model exists, it estimates background match rates and learns parameters via Expectation-Maximization (EM), saves the model, then loads it and runs high-threshold (0.95) inference to produce pairwise matches that are clustered into entities. Each guest gets a stable cluster ID (falling back to their own ID if unlinked), fields are enriched within clusters, and the most recent record per cluster becomes the canonical guest.
Guest Matching Check
This process ensures that every guest has a cluster_id, thus, if no cluster_id is found it just copies the guest_id as the cluster_id.
Inputs
- Requires
guest_dfon the workflow context. The runnable inspects a fixed set of candidate columns (first_name, last_name, email, birth_date, phone_number, address_* fields, nationality, passport_number, gender) and keeps only those that exist and have at least one non-null value.
Outputs
- Produces an updated
guest_dfwhere each row has aguest_cluster_idfield. The runnable also ensures the finalguest_dfcontains the most recent guest record per cluster (usescreated_timestampordering if present).
Behaviour and configuration
- Model path: a per-chain JSON model is read from/ written to S3 (path derived from the job chain id). If the model file does not exist the runnable trains a new Splink model and persists it.
- Splink settings: base comparisons include an email comparison and a forename/surname comparison; optional exact comparisons (birth_date, gender, nationality, phone_number, passport_number and address pieces) are added if present in the data.
- Blocking rules: default blocking rules include blocks on email, phone_number, combinations of email+first_name, first_name+last_name, and extended rules when birth_date, address, phone or passport are present.
- Training: the runnable estimates
uusing random sampling (max pairs 500k) and runs EM on a few representative blocking rules. This can be computationally heavy for large guest tables.
Edge cases
- If the filtered
guest_dfcontains zero rows (no non-empty columns) the runnable raises a ValueError. - When
guest_cluster_idis missing,Guest Matching Checkwill copyguest_idtoguest_cluster_idto keep downstream steps robust. - The algorithm relies on certain identity-bearing fields (email/phone/passport/name); when those are missing the matching signal weakens and more false negatives may occur.