Mews Cleaning

The MewsCleaner transforms raw Mews PMS data into our standardized format. Mews is a modern cloud-based property management system with a comprehensive API and rich data model.

Overview

The Mews cleaning process has several distinguishing characteristics:

Rich relational model with separate tables for customers, rates, business segments, resource categories, etc.
Multilingual support with localized field names
Complex guest counting using age categories and person counts
Order-based revenue model with product classifications
Flexible table handling - gracefully handles missing optional tables

Data Sources

The cleaner reads from multiple tables in the etl_gp_raw_mews database:

Required Tables

reservations - Main reservation data
customers - Guest profiles and contact information
order_items - Line-item charges for rooms, products, and services

Optional Tables

resource_categories - Room type definitions
rates - Rate plan configurations
business_segments - Market segment definitions
age_categories - Age classification (Adult/Child/Infant) for guest counting
products - Product catalog with F&B classifications

The cleaner checks for table existence before loading optional tables and continues gracefully if they’re not available.

Cleaning Workflow

Deduplication Strategy

Mews uses a simple timestamp-based deduplication:


window = Window.partitionBy("Id").orderBy(
    F.coalesce(F.col("UpdatedUtc"), F.col("CreatedUtc")).desc()
)
 
df = df.withColumn("row_num", F.row_number().over(window))
df = df.filter(F.col("row_num") == 1).drop("row_num")

Key: For duplicate reservation IDs, the most recently updated (or created, if no update timestamp) record wins.

Field Mapping

Reservation Fields

Mews Field	Standard Schema Field	Notes
`Id`	`reservation_id_og`	Unique reservation identifier (UUID)
`BookerId`	`guest_id_og`	Primary guest/booker ID
`CreatedUtc`	`created_timestamp`	When reservation was created
`UpdatedUtc`	`last_modified_timestamp`	Last update time
`StartUtc`	`check_in_date`	Check-in date/time
`EndUtc`	`check_out_date`	Check-out date/time
`CancelledUtc`	`cancellation_date`	Cancellation timestamp
`State`	`status`	Reservation state
`Origin`	`source`	Booking source/origin
`CompanyId`	`mews_company_id`	Corporate account ID
`TravelAgencyId`	`mews_travel_agency_id`	Travel agency ID
`GroupId`	`mews_group_id`	Group booking ID
`VoucherCode`	`promo_code`	Promotional/voucher code
`ServiceId`	`property_id`	Maps to property via config
`PersonCounts`	`guests_adults`, `guests_children`	Parsed via age categories

Status Mapping

Mews State	Standard Status
`Canceled`	`cancelled`
`Confirmed`	`confirmed`
`Started`	`checked_in`
`Processed`	`checked_out`
`Optional`	`tentative`

Guest Counting

Mews uses a sophisticated age category system for guest counting:

PersonCounts Structure


PersonCounts: [
  {"AgeCategoryId": "uuid-adult", "Count": 2},
  {"AgeCategoryId": "uuid-child", "Count": 1}
]

Processing Logic

Explode PersonCounts array into individual rows
Join with age_categories table to get Classification (Adult/Child/Infant)

Aggregate by classification:


df_counts = df_exploded.groupBy("Id").agg(
    F.sum(F.when(F.col("Classification") == "Adult", F.col("Count"))).alias("AdultCount"),
    F.sum(F.when(F.col("Classification") == "Child", F.col("Count"))).alias("ChildCount"),
)

Join counts back to reservation dataframe

Fallback Handling

If age_categories table is not available:


df = df.withColumn("AdultCount", F.lit(None).cast(IntegerType()))
df = df.withColumn("ChildCount", F.lit(None).cast(IntegerType()))

Lookup Table Joins

Business Segments (Market Segments)


if self.business_segments is not None:
    business_seg_lookup = self.business_segments.select(
        F.col("Id").alias("BusinessSegmentId"),
        F.col("Name").alias("market_segment")
    )
    df = df.join(business_seg_lookup, on="BusinessSegmentId", how="left")

Rates (Rate Plans)


if self.rates is not None:
    rate_lookup = self.rates.select(
        F.col("Id").alias("RateId"),
        F.col("Name").alias("room_rate_code")
    )
    df = df.join(rate_lookup, on="RateId", how="left")

Customers (Guest Information)


if self.customers is not None:
    customer_info = self.customers.select(
        F.col("Id").alias("guest_id_og"),
        F.col("FirstName").alias("first_name"),
        F.col("LastName").alias("last_name"),
        F.col("Email").alias("email"),
        F.col("Phone").alias("phone_number"),
        F.col("NationalityCode").alias("nationality"),
    )
    df = df.join(customer_info, on="guest_id_og", how="left")

Resource Categories (Room Types)

Mews stores multilingual names. We extract the English name:


if self.resource_categories is not None:
    resource_lookup = self.resource_categories.select(
        F.col("Id").alias("RequestedResourceCategoryId"),
        F.col("`Names.en-US`").alias("room_type")  # Note: backticks for dotted field
    )
    df = df.join(resource_lookup, on="RequestedResourceCategoryId", how="left")

Revenue Processing (Order Items)

Mews uses an order-based model where charges are stored as order items.

Order Item Types

SpaceOrder - Room/accommodation charges
ProductOrder - Product sales (F&B, retail, etc.)
Other types - Miscellaneous charges

F&B Classification

Products are classified using the products table:


if self.products is not None:
    # Extract product ID from nested field
    revenue = revenue.withColumn(
        "product_id",
        F.when(F.col("Data.Discriminator") == "Product", 
               F.col("Data.Product.ProductId"))
    )
    
    # Identify F&B products
    fnb_products = self.products.filter(
        (F.col("Classifications.Food") == True) | 
        (F.col("Classifications.Beverage") == True)
    )
    
    # Join to mark F&B items
    revenue = revenue.join(fnb_products, on="product_id", how="left")

Revenue Aggregation

Revenue is aggregated by reservation and consumed date:


revenue_agg = revenue.groupBy(
    F.col("ServiceOrderId").alias("rev_res_id_og"),
    F.col("stay_date")
).agg(
    # Room charges
    F.sum(
        F.when(F.col("Type") == "SpaceOrder", F.col("net_value"))
    ).alias("room_stay_date_rate_net"),
    
    # F&B charges
    F.sum(
        F.when(
            (F.col("Type") == "ProductOrder") & (F.col("is_fnb") == True),
            F.col("net_value")
        )
    ).alias("room_stay_date_fnb_net"),
    
    # Other charges
    F.sum(
        F.when(
            (F.col("Type") != "SpaceOrder") & 
            ((F.col("Type") != "ProductOrder") | (F.col("is_fnb") == False)),
            F.col("net_value")
        )
    ).alias("room_stay_date_other_net"),
)

Amount Field Handling

The cleaner handles variations in the amount field structure:


if "Amount.NetValue" in revenue.columns:
    revenue = revenue.withColumn("net_value", F.col("`Amount.NetValue`"))
elif "Amount" in revenue.columns:
    revenue = revenue.withColumn("net_value", F.col("Amount.NetValue"))
else:
    logger.warning("Cannot find Amount.NetValue field, defaulting to 0")
    revenue = revenue.withColumn("net_value", F.lit(0.0))

Rooms Processing

Stay Date Expansion

Daily room records are generated by expanding the stay:


df = df.withColumn(
    "room_stay_date",
    F.explode(
        F.when(
            # Same-day checkout
            F.to_date("check_in_date") == F.col("check_out_date"),
            F.array(F.to_date("check_in_date"))
        ).otherwise(
            # Multi-night stay
            F.sequence(
                F.to_date("check_in_date"),
                F.date_sub(F.to_date("check_out_date"), 1),
            )
        )
    ),
)

Revenue Join

Revenue is joined by reservation ID and stay date:


df = df.join(
    revenue_agg,
    (df["reservation_id_og"] == revenue_agg["rev_res_id_og"]) &
    (df["room_stay_date"] == revenue_agg["room_stay_date"]),
    how="left"
)

Data Validation

Standard cleaning utilities are applied:


string_columns = [
    "first_name", "last_name", "email", 
    "phone_number", "source"
]
 
# Add optional columns if they exist
if "market_segment" in df.columns:
    string_columns.append("market_segment")
if "room_rate_code" in df.columns:
    string_columns.append("room_rate_code")
 
df = clean_string_columns(df, string_columns)
df = clean_email(df)
df = clean_phone_number(df)

Selective Field Inclusion

The cleaner only includes fields that actually exist in the dataframe:


schema_fields = []
for field in all_required_fields:
    field_name = nof(field)
    if field_name in df.columns:
        schema_fields.append(field)
    else:
        logger.log(30, f"Warning: Field {field_name} not found, skipping")

This allows the cleaner to work even when optional lookup tables are unavailable.

Output Tables

Clean Reservations

Database: etl_gp_clean_mews
Table: reservations
Key: reservation_id_og (Mews reservation UUID)
Row Count: One per reservation

Clean Guests

Database: etl_gp_clean_mews
Table: guests
Key: guest_id (stable UUID)
Source: Mews customers table
Deduplication: Latest by created_timestamp

Clean Rooms

Database: etl_gp_clean_mews
Table: rooms
Key: Composite (property_id, res_id, room_stay_date, booking_type)
Granularity: One row per reservation per day
Revenue: Joined from order_items by stay date

Configuration

Property Mapping

Mews ServiceId is mapped to property_id via configuration:


df = self._map_property(df, "ServiceId")

Configuration should define the mapping:


properties:
  "service-uuid-1": "Property Name 1"
  "service-uuid-2": "Property Name 2"

Challenges & Solutions

Challenge	Solution
Optional tables may not exist	Check table existence before loading
Multilingual field names	Extract specific locale (e.g., `Names.en-US`)
Complex guest counting	Parse PersonCounts with age category classifications
Nested amount fields	Handle multiple field structures gracefully
Order-based revenue	Aggregate by type and date, join to rooms
Product F&B classification	Join with products table to identify food/beverage

Mews-Specific Features

Mews supports voucher codes which are preserved in promo_code field.

Service-Based Architecture

Mews uses ServiceId to represent bookable services (typically accommodation). This maps to our property_id.

Rich Metadata

Mews provides extensive relational data (business segments, rate plans, resource categories) that enhance the standardized data.

UUID Identifiers

All Mews entities use UUID identifiers, providing globally unique keys.