Benefits of Modular ETL

Old Guestpulse ETL Implementation

The old structure of the ETL Groups uses the following: Workflow, WorkflowContext, WorkflowRunnable.

WorkflowRunnable

This is the most basic state of actions that could be ran in the app, this is Inherited by the different processes in the application.

WorkflowContext

Holds the information that the workflow will use e.g. current has the res_df, guest_df, and rooms_df.

Workflow

Houses the set of WorkflowRunnable that will be ran for the whole processes on each hotel chains.

Orchestrated Workflow

After the ingester and crawler is ran the following processes and reports are started, while using Cleaner as a dependency for using its get_clean_reservations, get_clean_guests, and get_clean_rooms methods.

Cleaner is injected as a dependency for the Workflow usually using the from_cleaner static method.

Workflow is separated into process_chain and reports.
Process chain is a list of WorkflowRunnable that adds calculated values and transformations to the existing cleaned data generating process data.
Reports is also a list of WorkflowRunnable that analyzes the processed data and saves it to the database sink (Big Query).

Benefits of the New Architecture

The new modular ETL architecture addresses the limitations of the previous system by introducing a flexible, task-based approach. This shift brings several key benefits that improve scalability, maintainability, and performance.

1. Incremental Processing Support

One of the most significant improvements is the native support for incremental data processing.

Efficiency: Instead of reprocessing the entire dataset every time, tasks can be configured to process only new or modified data.
Built-in Logic: The Task class provides standard methods (_run_incremental and _merge_incremental) to handle delta loads seamlessly.
Cost Reduction: By processing smaller batches of data, compute resources are utilized more efficiently, leading to faster execution times and lower cloud costs.

2. Unlimited Data Extensibility

The old workflow was rigidly structured around specific entities like guests, rooms, and reservations. The new architecture removes these constraints.

Generic Data Models: You are no longer limited to a fixed set of data types. You can define Model classes for any kind of data—emails, loyalty transactions, external API responses, or custom business metrics.
Flexible Integration: Adding a new data source is as simple as defining a new Model and a corresponding Task. The system treats all data types with the same importance and capability.
Future-Proofing: As business requirements evolve, the pipeline can easily adapt to ingest and process novel datasets without requiring a rewrite of the core framework.

3. Modular and Decoupled Design

The codebase promotes a strict separation of concerns, making it easier to understand, test, and maintain.

Tasks: Focus solely on how data is transformed. They are isolated units of logic that declare their inputs and outputs explicitly.
Models: Focus solely on what the data is (schema, location). They handle I/O operations, decoupling storage details from processing logic.
Orchestrator: Focuses solely on when tasks run. It manages the execution flow, ensuring that dependencies are met before a task starts.

4. Automated Dependency Management

The Orchestrator eliminates the need for manual task chaining.

Self-Organizing Workflows: Tasks declare what they require() and what they provide(). The Orchestrator uses this information to automatically build a dependency graph.
Topological Execution: The system automatically determines the optimal execution order. If Task B requires data from Task A, the Orchestrator guarantees that Task A runs first.
Error Prevention: Circular dependencies and missing inputs are detected at the planning stage, preventing runtime failures.