Mews Cleaning
The MewsCleaner transforms raw Mews PMS data into our standardized format. Mews is a modern cloud-based property management system with a comprehensive API and rich data model.
Overview
The Mews cleaning process has several distinguishing characteristics:
- Rich relational model with separate tables for customers, rates, business segments, resource categories, etc.
- Multilingual support with localized field names
- Complex guest counting using age categories and person counts
- Order-based revenue model with product classifications
- Flexible table handling - gracefully handles missing optional tables
Data Sources
The cleaner reads from multiple tables in the etl_gp_raw_mews database:
Required Tables
reservations- Main reservation datacustomers- Guest profiles and contact informationorder_items- Line-item charges for rooms, products, and services
Optional Tables
resource_categories- Room type definitionsrates- Rate plan configurationsbusiness_segments- Market segment definitionsage_categories- Age classification (Adult/Child/Infant) for guest countingproducts- Product catalog with F&B classifications
The cleaner checks for table existence before loading optional tables and continues gracefully if they’re not available.
Cleaning Workflow
Deduplication Strategy
Mews uses a simple timestamp-based deduplication:
window = Window.partitionBy("Id").orderBy(
F.coalesce(F.col("UpdatedUtc"), F.col("CreatedUtc")).desc()
)
df = df.withColumn("row_num", F.row_number().over(window))
df = df.filter(F.col("row_num") == 1).drop("row_num")Key: For duplicate reservation IDs, the most recently updated (or created, if no update timestamp) record wins.
Field Mapping
Reservation Fields
| Mews Field | Standard Schema Field | Notes |
|---|---|---|
Id | reservation_id_og | Unique reservation identifier (UUID) |
BookerId | guest_id_og | Primary guest/booker ID |
CreatedUtc | created_timestamp | When reservation was created |
UpdatedUtc | last_modified_timestamp | Last update time |
StartUtc | check_in_date | Check-in date/time |
EndUtc | check_out_date | Check-out date/time |
CancelledUtc | cancellation_date | Cancellation timestamp |
State | status | Reservation state |
Origin | source | Booking source/origin |
CompanyId | mews_company_id | Corporate account ID |
TravelAgencyId | mews_travel_agency_id | Travel agency ID |
GroupId | mews_group_id | Group booking ID |
VoucherCode | promo_code | Promotional/voucher code |
ServiceId | property_id | Maps to property via config |
PersonCounts | guests_adults, guests_children | Parsed via age categories |
Status Mapping
| Mews State | Standard Status |
|---|---|
Canceled | cancelled |
Confirmed | confirmed |
Started | checked_in |
Processed | checked_out |
Optional | tentative |
Guest Counting
Mews uses a sophisticated age category system for guest counting:
PersonCounts Structure
PersonCounts: [
{"AgeCategoryId": "uuid-adult", "Count": 2},
{"AgeCategoryId": "uuid-child", "Count": 1}
]Processing Logic
- Explode PersonCounts array into individual rows
- Join with age_categories table to get Classification (Adult/Child/Infant)
- Aggregate by classification:
df_counts = df_exploded.groupBy("Id").agg( F.sum(F.when(F.col("Classification") == "Adult", F.col("Count"))).alias("AdultCount"), F.sum(F.when(F.col("Classification") == "Child", F.col("Count"))).alias("ChildCount"), ) - Join counts back to reservation dataframe
Fallback Handling
If age_categories table is not available:
df = df.withColumn("AdultCount", F.lit(None).cast(IntegerType()))
df = df.withColumn("ChildCount", F.lit(None).cast(IntegerType()))Lookup Table Joins
Business Segments (Market Segments)
if self.business_segments is not None:
business_seg_lookup = self.business_segments.select(
F.col("Id").alias("BusinessSegmentId"),
F.col("Name").alias("market_segment")
)
df = df.join(business_seg_lookup, on="BusinessSegmentId", how="left")Rates (Rate Plans)
if self.rates is not None:
rate_lookup = self.rates.select(
F.col("Id").alias("RateId"),
F.col("Name").alias("room_rate_code")
)
df = df.join(rate_lookup, on="RateId", how="left")Customers (Guest Information)
if self.customers is not None:
customer_info = self.customers.select(
F.col("Id").alias("guest_id_og"),
F.col("FirstName").alias("first_name"),
F.col("LastName").alias("last_name"),
F.col("Email").alias("email"),
F.col("Phone").alias("phone_number"),
F.col("NationalityCode").alias("nationality"),
)
df = df.join(customer_info, on="guest_id_og", how="left")Resource Categories (Room Types)
Mews stores multilingual names. We extract the English name:
if self.resource_categories is not None:
resource_lookup = self.resource_categories.select(
F.col("Id").alias("RequestedResourceCategoryId"),
F.col("`Names.en-US`").alias("room_type") # Note: backticks for dotted field
)
df = df.join(resource_lookup, on="RequestedResourceCategoryId", how="left")Revenue Processing (Order Items)
Mews uses an order-based model where charges are stored as order items.
Order Item Types
SpaceOrder- Room/accommodation chargesProductOrder- Product sales (F&B, retail, etc.)- Other types - Miscellaneous charges
F&B Classification
Products are classified using the products table:
if self.products is not None:
# Extract product ID from nested field
revenue = revenue.withColumn(
"product_id",
F.when(F.col("Data.Discriminator") == "Product",
F.col("Data.Product.ProductId"))
)
# Identify F&B products
fnb_products = self.products.filter(
(F.col("Classifications.Food") == True) |
(F.col("Classifications.Beverage") == True)
)
# Join to mark F&B items
revenue = revenue.join(fnb_products, on="product_id", how="left")Revenue Aggregation
Revenue is aggregated by reservation and consumed date:
revenue_agg = revenue.groupBy(
F.col("ServiceOrderId").alias("rev_res_id_og"),
F.col("stay_date")
).agg(
# Room charges
F.sum(
F.when(F.col("Type") == "SpaceOrder", F.col("net_value"))
).alias("room_stay_date_rate_net"),
# F&B charges
F.sum(
F.when(
(F.col("Type") == "ProductOrder") & (F.col("is_fnb") == True),
F.col("net_value")
)
).alias("room_stay_date_fnb_net"),
# Other charges
F.sum(
F.when(
(F.col("Type") != "SpaceOrder") &
((F.col("Type") != "ProductOrder") | (F.col("is_fnb") == False)),
F.col("net_value")
)
).alias("room_stay_date_other_net"),
)Amount Field Handling
The cleaner handles variations in the amount field structure:
if "Amount.NetValue" in revenue.columns:
revenue = revenue.withColumn("net_value", F.col("`Amount.NetValue`"))
elif "Amount" in revenue.columns:
revenue = revenue.withColumn("net_value", F.col("Amount.NetValue"))
else:
logger.warning("Cannot find Amount.NetValue field, defaulting to 0")
revenue = revenue.withColumn("net_value", F.lit(0.0))Rooms Processing
Stay Date Expansion
Daily room records are generated by expanding the stay:
df = df.withColumn(
"room_stay_date",
F.explode(
F.when(
# Same-day checkout
F.to_date("check_in_date") == F.col("check_out_date"),
F.array(F.to_date("check_in_date"))
).otherwise(
# Multi-night stay
F.sequence(
F.to_date("check_in_date"),
F.date_sub(F.to_date("check_out_date"), 1),
)
)
),
)Revenue Join
Revenue is joined by reservation ID and stay date:
df = df.join(
revenue_agg,
(df["reservation_id_og"] == revenue_agg["rev_res_id_og"]) &
(df["room_stay_date"] == revenue_agg["room_stay_date"]),
how="left"
)Data Validation
Standard cleaning utilities are applied:
string_columns = [
"first_name", "last_name", "email",
"phone_number", "source"
]
# Add optional columns if they exist
if "market_segment" in df.columns:
string_columns.append("market_segment")
if "room_rate_code" in df.columns:
string_columns.append("room_rate_code")
df = clean_string_columns(df, string_columns)
df = clean_email(df)
df = clean_phone_number(df)Selective Field Inclusion
The cleaner only includes fields that actually exist in the dataframe:
schema_fields = []
for field in all_required_fields:
field_name = nof(field)
if field_name in df.columns:
schema_fields.append(field)
else:
logger.log(30, f"Warning: Field {field_name} not found, skipping")This allows the cleaner to work even when optional lookup tables are unavailable.
Output Tables
Clean Reservations
- Database:
etl_gp_clean_mews - Table:
reservations - Key:
reservation_id_og(Mews reservation UUID) - Row Count: One per reservation
Clean Guests
- Database:
etl_gp_clean_mews - Table:
guests - Key:
guest_id(stable UUID) - Source: Mews customers table
- Deduplication: Latest by
created_timestamp
Clean Rooms
- Database:
etl_gp_clean_mews - Table:
rooms - Key: Composite (
property_id,res_id,room_stay_date,booking_type) - Granularity: One row per reservation per day
- Revenue: Joined from order_items by stay date
Configuration
Property Mapping
Mews ServiceId is mapped to property_id via configuration:
df = self._map_property(df, "ServiceId")Configuration should define the mapping:
properties:
"service-uuid-1": "Property Name 1"
"service-uuid-2": "Property Name 2"Challenges & Solutions
| Challenge | Solution |
|---|---|
| Optional tables may not exist | Check table existence before loading |
| Multilingual field names | Extract specific locale (e.g., Names.en-US) |
| Complex guest counting | Parse PersonCounts with age category classifications |
| Nested amount fields | Handle multiple field structures gracefully |
| Order-based revenue | Aggregate by type and date, join to rooms |
| Product F&B classification | Join with products table to identify food/beverage |
Mews-Specific Features
Promo Codes
Mews supports voucher codes which are preserved in promo_code field.
Service-Based Architecture
Mews uses ServiceId to represent bookable services (typically accommodation). This maps to our property_id.
Rich Metadata
Mews provides extensive relational data (business segments, rate plans, resource categories) that enhance the standardized data.
UUID Identifiers
All Mews entities use UUID identifiers, providing globally unique keys.
See Also
- Cleaning Overview - General cleaning process
- Schema Documentation - Field definitions
- Processing Stage - Next stage after cleaning
- Mews API Documentation - Official Mews API docs