ETL Pipeline Fundamentals

Building ETL Pipelines: A Practical Guide

ETL (Extract, Transform, Load) pipelines move data from source systems to data warehouses.

Batch vs Streaming

Batch Processing

Process data at scheduled intervals. Simpler, cheaper, and easier to test:

def extract_orders():

return pd.read_sql("SELECT * FROM orders WHERE date = CURRENT_DATE", conn)

def transform_orders(df):

df['total'] = df['quantity'] * df['price']

return df

def load_orders(df):

df.to_sql('daily_orders', warehouse_conn, if_exists='append')

Stream Processing

Process data as it arrives with millisecond latency. Use for real-time dashboards, fraud detection, and event-driven architectures.

Orchestration with Airflow

with DAG('orders_etl', schedule_interval='0 6 * * *'):

extract = PythonOperator(task_id='extract', python_callable=extract_orders)

transform = PythonOperator(task_id='transform', python_callable=transform_orders)

load = PythonOperator(task_id='load', python_callable=load_orders)

extract >> transform >> load

Transformation with dbt

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\-- models/staging/stg_orders.sql

with source as (

select * from {{ source('ecommerce', 'orders') }}

)

select id as order_id, customer_id, amount from source

Data Quality

data_quality_checks:

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\- check: row_count > 1000

severity: critical

\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\- check: max_timestamp >= yesterday

severity: warning

Conclusion

Choose batch for simplicity and streaming for real-time needs. Use Airflow for orchestration and dbt for transformations. Implement data quality checks at every stage. Design for idempotency and incremental loads.