Skip to Content

Guest Matching

This process uses Splink to probabilistically deduplicate guests.

It filters and deduplicates input, builds a matching configuration with name similarity plus exact email, phone, and other personal information comparisons.

If no saved model exists, it estimates background match rates and learns parameters via Expectation-Maximization (EM), saves the model, then loads it and runs high-threshold (0.95) inference to produce pairwise matches that are clustered into entities. Each guest gets a stable cluster ID (falling back to their own ID if unlinked), fields are enriched within clusters, and the most recent record per cluster becomes the canonical guest.

Guest Matching Check

This process ensures that every guest has a cluster_id, thus, if no cluster_id is found it just copies the guest_id as the cluster_id.

Inputs

  • Requires guest_df on the workflow context. The runnable inspects a fixed set of candidate columns (first_name, last_name, email, birth_date, phone_number, address_* fields, nationality, passport_number, gender) and keeps only those that exist and have at least one non-null value.

Outputs

  • Produces an updated guest_df where each row has a guest_cluster_id field. The runnable also ensures the final guest_df contains the most recent guest record per cluster (uses created_timestamp ordering if present).

Behaviour and configuration

  • Model path: a per-chain JSON model is read from/ written to S3 (path derived from the job chain id). If the model file does not exist the runnable trains a new Splink model and persists it.
  • Splink settings: base comparisons include an email comparison and a forename/surname comparison; optional exact comparisons (birth_date, gender, nationality, phone_number, passport_number and address pieces) are added if present in the data.
  • Blocking rules: default blocking rules include blocks on email, phone_number, combinations of email+first_name, first_name+last_name, and extended rules when birth_date, address, phone or passport are present.
  • Training: the runnable estimates u using random sampling (max pairs 500k) and runs EM on a few representative blocking rules. This can be computationally heavy for large guest tables.

Edge cases

  • If the filtered guest_df contains zero rows (no non-empty columns) the runnable raises a ValueError.
  • When guest_cluster_id is missing, Guest Matching Check will copy guest_id to guest_cluster_id to keep downstream steps robust.
  • The algorithm relies on certain identity-bearing fields (email/phone/passport/name); when those are missing the matching signal weakens and more false negatives may occur.
Last updated on