Data Warehousing Concepts and Modern Tools

Data Warehousing Concepts

A data warehouse centralizes data from multiple sources for analysis and reporting. It is optimized for read-heavy analytical queries.

Star Schema

A central fact table connected to dimension tables:

CREATE TABLE fact_sales (

sale_id BIGSERIAL PRIMARY KEY,

date_key INT REFERENCES dim_date(date_key),

product_key INT REFERENCES dim_product(product_key),

customer_key INT REFERENCES dim_customer(customer_key),

quantity INT NOT NULL,

unit_price DECIMAL(10,2) NOT NULL,

total_amount DECIMAL(12,2) GENERATED ALWAYS AS

(quantity * unit_price) STORED

);

CREATE TABLE dim_date (

date_key INT PRIMARY KEY,

date DATE NOT NULL, year SMALLINT, quarter SMALLINT,

month SMALLINT, day SMALLINT, is_holiday BOOLEAN

);

Snowflake Schema

Normalized dimensions for storage efficiency. Dimensions are split into sub-dimensions, saving storage at the cost of more joins.

ETL Pipeline

class ETLPipeline:

def extract(self, query):

return pd.read_sql(query, self.source_engine, chunksize=10000)

def transform(self, df):

df = df.drop_duplicates(subset=["order_id"])

df["order_date"] = pd.to_datetime(df["order_date"])

df["date_key"] = df["order_date"].dt.strftime("%Y%m%d").astype(int)

return df

def load(self, df, table_name):

df.to_sql(table_name, self.warehouse_engine, if_exists="append", index=False)

Modern Data Warehousing

Cloud data warehouses like Snowflake and BigQuery separate storage and compute, enabling elastic scaling. Materialized views pre-compute aggregations for dashboard queries.

Conclusion

Design with star schema for performance. Build resilient ETL pipelines. Leverage cloud warehouses for elastic scaling. Start simple and evolve.