Benefits of Modular ETL
Old Guestpulse ETL Implementation
The old structure of the ETL Groups uses the following: Workflow, WorkflowContext, WorkflowRunnable.
WorkflowRunnable
- This is the most basic state of actions that could be ran in the app, this is Inherited by the different processes in the application.
WorkflowContext
- Holds the information that the workflow will use e.g. current has the
res_df,guest_df, androoms_df.
Workflow
- Houses the set of
WorkflowRunnablethat will be ran for the whole processes on each hotel chains.
Orchestrated Workflow
- After the ingester and crawler is ran the following processes and reports are started, while using Cleaner as a dependency for using its
get_clean_reservations,get_clean_guests, andget_clean_roomsmethods.
- Cleaner is injected as a dependency for the Workflow usually using the
from_cleanerstatic method.
- Workflow is separated into process_chain and reports.
- Process chain is a list of
WorkflowRunnablethat adds calculated values and transformations to the existing cleaned data generating process data. - Reports is also a list of
WorkflowRunnablethat analyzes the processed data and saves it to the database sink (Big Query).
Benefits of the New Architecture
The new modular ETL architecture addresses the limitations of the previous system by introducing a flexible, task-based approach. This shift brings several key benefits that improve scalability, maintainability, and performance.
1. Incremental Processing Support
One of the most significant improvements is the native support for incremental data processing.
- Efficiency: Instead of reprocessing the entire dataset every time, tasks can be configured to process only new or modified data.
- Built-in Logic: The
Taskclass provides standard methods (_run_incrementaland_merge_incremental) to handle delta loads seamlessly. - Cost Reduction: By processing smaller batches of data, compute resources are utilized more efficiently, leading to faster execution times and lower cloud costs.
2. Unlimited Data Extensibility
The old workflow was rigidly structured around specific entities like guests, rooms, and reservations. The new architecture removes these constraints.
- Generic Data Models: You are no longer limited to a fixed set of data types. You can define
Modelclasses for any kind of data—emails, loyalty transactions, external API responses, or custom business metrics. - Flexible Integration: Adding a new data source is as simple as defining a new Model and a corresponding Task. The system treats all data types with the same importance and capability.
- Future-Proofing: As business requirements evolve, the pipeline can easily adapt to ingest and process novel datasets without requiring a rewrite of the core framework.
3. Modular and Decoupled Design
The codebase promotes a strict separation of concerns, making it easier to understand, test, and maintain.
- Tasks: Focus solely on how data is transformed. They are isolated units of logic that declare their inputs and outputs explicitly.
- Models: Focus solely on what the data is (schema, location). They handle I/O operations, decoupling storage details from processing logic.
- Orchestrator: Focuses solely on when tasks run. It manages the execution flow, ensuring that dependencies are met before a task starts.
4. Automated Dependency Management
The Orchestrator eliminates the need for manual task chaining.
- Self-Organizing Workflows: Tasks declare what they
require()and what theyprovide(). The Orchestrator uses this information to automatically build a dependency graph. - Topological Execution: The system automatically determines the optimal execution order. If Task B requires data from Task A, the Orchestrator guarantees that Task A runs first.
- Error Prevention: Circular dependencies and missing inputs are detected at the planning stage, preventing runtime failures.
Last updated on