Skip to Content

Mews Cleaning

The MewsCleaner transforms raw Mews PMS data into our standardized format. Mews is a modern cloud-based property management system with a comprehensive API and rich data model.

Overview

The Mews cleaning process has several distinguishing characteristics:

  1. Rich relational model with separate tables for customers, rates, business segments, resource categories, etc.
  2. Multilingual support with localized field names
  3. Complex guest counting using age categories and person counts
  4. Order-based revenue model with product classifications
  5. Flexible table handling - gracefully handles missing optional tables

Data Sources

The cleaner reads from multiple tables in the etl_gp_raw_mews database:

Required Tables

  • reservations - Main reservation data
  • customers - Guest profiles and contact information
  • order_items - Line-item charges for rooms, products, and services

Optional Tables

  • resource_categories - Room type definitions
  • rates - Rate plan configurations
  • business_segments - Market segment definitions
  • age_categories - Age classification (Adult/Child/Infant) for guest counting
  • products - Product catalog with F&B classifications

The cleaner checks for table existence before loading optional tables and continues gracefully if they’re not available.

Cleaning Workflow

Deduplication Strategy

Mews uses a simple timestamp-based deduplication:

window = Window.partitionBy("Id").orderBy( F.coalesce(F.col("UpdatedUtc"), F.col("CreatedUtc")).desc() ) df = df.withColumn("row_num", F.row_number().over(window)) df = df.filter(F.col("row_num") == 1).drop("row_num")

Key: For duplicate reservation IDs, the most recently updated (or created, if no update timestamp) record wins.

Field Mapping

Reservation Fields

Mews FieldStandard Schema FieldNotes
Idreservation_id_ogUnique reservation identifier (UUID)
BookerIdguest_id_ogPrimary guest/booker ID
CreatedUtccreated_timestampWhen reservation was created
UpdatedUtclast_modified_timestampLast update time
StartUtccheck_in_dateCheck-in date/time
EndUtccheck_out_dateCheck-out date/time
CancelledUtccancellation_dateCancellation timestamp
StatestatusReservation state
OriginsourceBooking source/origin
CompanyIdmews_company_idCorporate account ID
TravelAgencyIdmews_travel_agency_idTravel agency ID
GroupIdmews_group_idGroup booking ID
VoucherCodepromo_codePromotional/voucher code
ServiceIdproperty_idMaps to property via config
PersonCountsguests_adults, guests_childrenParsed via age categories

Status Mapping

Mews StateStandard Status
Canceledcancelled
Confirmedconfirmed
Startedchecked_in
Processedchecked_out
Optionaltentative

Guest Counting

Mews uses a sophisticated age category system for guest counting:

PersonCounts Structure

PersonCounts: [ {"AgeCategoryId": "uuid-adult", "Count": 2}, {"AgeCategoryId": "uuid-child", "Count": 1} ]

Processing Logic

  1. Explode PersonCounts array into individual rows
  2. Join with age_categories table to get Classification (Adult/Child/Infant)
  3. Aggregate by classification:
    df_counts = df_exploded.groupBy("Id").agg( F.sum(F.when(F.col("Classification") == "Adult", F.col("Count"))).alias("AdultCount"), F.sum(F.when(F.col("Classification") == "Child", F.col("Count"))).alias("ChildCount"), )
  4. Join counts back to reservation dataframe

Fallback Handling

If age_categories table is not available:

df = df.withColumn("AdultCount", F.lit(None).cast(IntegerType())) df = df.withColumn("ChildCount", F.lit(None).cast(IntegerType()))

Lookup Table Joins

Business Segments (Market Segments)

if self.business_segments is not None: business_seg_lookup = self.business_segments.select( F.col("Id").alias("BusinessSegmentId"), F.col("Name").alias("market_segment") ) df = df.join(business_seg_lookup, on="BusinessSegmentId", how="left")

Rates (Rate Plans)

if self.rates is not None: rate_lookup = self.rates.select( F.col("Id").alias("RateId"), F.col("Name").alias("room_rate_code") ) df = df.join(rate_lookup, on="RateId", how="left")

Customers (Guest Information)

if self.customers is not None: customer_info = self.customers.select( F.col("Id").alias("guest_id_og"), F.col("FirstName").alias("first_name"), F.col("LastName").alias("last_name"), F.col("Email").alias("email"), F.col("Phone").alias("phone_number"), F.col("NationalityCode").alias("nationality"), ) df = df.join(customer_info, on="guest_id_og", how="left")

Resource Categories (Room Types)

Mews stores multilingual names. We extract the English name:

if self.resource_categories is not None: resource_lookup = self.resource_categories.select( F.col("Id").alias("RequestedResourceCategoryId"), F.col("`Names.en-US`").alias("room_type") # Note: backticks for dotted field ) df = df.join(resource_lookup, on="RequestedResourceCategoryId", how="left")

Revenue Processing (Order Items)

Mews uses an order-based model where charges are stored as order items.

Order Item Types

  • SpaceOrder - Room/accommodation charges
  • ProductOrder - Product sales (F&B, retail, etc.)
  • Other types - Miscellaneous charges

F&B Classification

Products are classified using the products table:

if self.products is not None: # Extract product ID from nested field revenue = revenue.withColumn( "product_id", F.when(F.col("Data.Discriminator") == "Product", F.col("Data.Product.ProductId")) ) # Identify F&B products fnb_products = self.products.filter( (F.col("Classifications.Food") == True) | (F.col("Classifications.Beverage") == True) ) # Join to mark F&B items revenue = revenue.join(fnb_products, on="product_id", how="left")

Revenue Aggregation

Revenue is aggregated by reservation and consumed date:

revenue_agg = revenue.groupBy( F.col("ServiceOrderId").alias("rev_res_id_og"), F.col("stay_date") ).agg( # Room charges F.sum( F.when(F.col("Type") == "SpaceOrder", F.col("net_value")) ).alias("room_stay_date_rate_net"), # F&B charges F.sum( F.when( (F.col("Type") == "ProductOrder") & (F.col("is_fnb") == True), F.col("net_value") ) ).alias("room_stay_date_fnb_net"), # Other charges F.sum( F.when( (F.col("Type") != "SpaceOrder") & ((F.col("Type") != "ProductOrder") | (F.col("is_fnb") == False)), F.col("net_value") ) ).alias("room_stay_date_other_net"), )

Amount Field Handling

The cleaner handles variations in the amount field structure:

if "Amount.NetValue" in revenue.columns: revenue = revenue.withColumn("net_value", F.col("`Amount.NetValue`")) elif "Amount" in revenue.columns: revenue = revenue.withColumn("net_value", F.col("Amount.NetValue")) else: logger.warning("Cannot find Amount.NetValue field, defaulting to 0") revenue = revenue.withColumn("net_value", F.lit(0.0))

Rooms Processing

Stay Date Expansion

Daily room records are generated by expanding the stay:

df = df.withColumn( "room_stay_date", F.explode( F.when( # Same-day checkout F.to_date("check_in_date") == F.col("check_out_date"), F.array(F.to_date("check_in_date")) ).otherwise( # Multi-night stay F.sequence( F.to_date("check_in_date"), F.date_sub(F.to_date("check_out_date"), 1), ) ) ), )

Revenue Join

Revenue is joined by reservation ID and stay date:

df = df.join( revenue_agg, (df["reservation_id_og"] == revenue_agg["rev_res_id_og"]) & (df["room_stay_date"] == revenue_agg["room_stay_date"]), how="left" )

Data Validation

Standard cleaning utilities are applied:

string_columns = [ "first_name", "last_name", "email", "phone_number", "source" ] # Add optional columns if they exist if "market_segment" in df.columns: string_columns.append("market_segment") if "room_rate_code" in df.columns: string_columns.append("room_rate_code") df = clean_string_columns(df, string_columns) df = clean_email(df) df = clean_phone_number(df)

Selective Field Inclusion

The cleaner only includes fields that actually exist in the dataframe:

schema_fields = [] for field in all_required_fields: field_name = nof(field) if field_name in df.columns: schema_fields.append(field) else: logger.log(30, f"Warning: Field {field_name} not found, skipping")

This allows the cleaner to work even when optional lookup tables are unavailable.

Output Tables

Clean Reservations

  • Database: etl_gp_clean_mews
  • Table: reservations
  • Key: reservation_id_og (Mews reservation UUID)
  • Row Count: One per reservation

Clean Guests

  • Database: etl_gp_clean_mews
  • Table: guests
  • Key: guest_id (stable UUID)
  • Source: Mews customers table
  • Deduplication: Latest by created_timestamp

Clean Rooms

  • Database: etl_gp_clean_mews
  • Table: rooms
  • Key: Composite (property_id, res_id, room_stay_date, booking_type)
  • Granularity: One row per reservation per day
  • Revenue: Joined from order_items by stay date

Configuration

Property Mapping

Mews ServiceId is mapped to property_id via configuration:

df = self._map_property(df, "ServiceId")

Configuration should define the mapping:

properties: "service-uuid-1": "Property Name 1" "service-uuid-2": "Property Name 2"

Challenges & Solutions

ChallengeSolution
Optional tables may not existCheck table existence before loading
Multilingual field namesExtract specific locale (e.g., Names.en-US)
Complex guest countingParse PersonCounts with age category classifications
Nested amount fieldsHandle multiple field structures gracefully
Order-based revenueAggregate by type and date, join to rooms
Product F&B classificationJoin with products table to identify food/beverage

Mews-Specific Features

Promo Codes

Mews supports voucher codes which are preserved in promo_code field.

Service-Based Architecture

Mews uses ServiceId to represent bookable services (typically accommodation). This maps to our property_id.

Rich Metadata

Mews provides extensive relational data (business segments, rate plans, resource categories) that enhance the standardized data.

UUID Identifiers

All Mews entities use UUID identifiers, providing globally unique keys.

See Also

Last updated on