The Data Pipeline — AI Reference Library

A data pipeline is the set of steps that move data from its source to wherever it needs to go — cleaned, transformed, and ready for analysis or model training. Building reliable pipelines is one of the most critical (and underappreciated) parts of real-world AI.

Stage 1 — Ingestion

Data arrives from many sources: databases, APIs, spreadsheets, web scraping, sensors, log files, user events. Ingestion is the process of getting this data into a central location — typically a data warehouse or data lake.

Batch ingestion happens on a schedule (nightly database dumps). Stream ingestion happens continuously in real time (user click events, sensor readings). Modern AI systems often need both.

Stage 2 — Storage

Where data lives determines how it can be used:

Data warehouse — structured, cleaned data optimised for analytics queries. Examples: BigQuery, Snowflake, Redshift.
Data lake — raw data in any format, stored cheaply. Processed later as needed.
Feature store — pre-computed ML features, ready for model training and serving.

Stage 3 — Transformation (ETL / ELT)

Raw data is almost never in the right format. Transformation includes: joining tables, aggregating values, converting data types, creating derived features, handling missing values, normalising text.

# SQL transformation example

SELECT

  customer_id,

  COUNT(order_id) as total_orders,

  SUM(order_value) as lifetime_value,

  MAX(order_date) as last_order_date,

  DATEDIFF(NOW(), MAX(order_date)) as days_since_last_order

FROM orders

GROUP BY customer_id  -- transform raw orders into customer features

Stage 4 — Validation

Data quality degrades silently. Validation checks ensure data meets expectations before it flows downstream. Schema checks (are columns the right type?), range checks (are values in expected bounds?), completeness checks (are there more missing values than usual?) — these catch problems before they corrupt your models.

# Python: basic data validation with pandas

import pandas as pd

df = pd.read_csv("customers.csv")

assert df["age"].between(0, 120).all(), "Invalid ages detected"

assert df["email"].str.contains("@").all(), "Invalid emails detected"

assert df.duplicated("customer_id").sum() == 0, "Duplicate IDs found"

print("Validation passed")

Stage 5 — Serving

Processed data or model outputs need to be served to applications. This might be a REST API that returns predictions in real time, a dashboard that refreshes hourly, or a batch job that scores all customers overnight.

Pipeline reliability

A pipeline that fails silently is worse than one that fails loudly. Good pipelines have monitoring, alerts, and clear failure modes. Many real-world AI incidents trace back to a silent data pipeline failure — models trained on stale, corrupt, or missing data.

Key takeaways

Data pipelines move data from source to analysis/training, in cleaned and transformed form
Five stages: ingest → store → transform → validate → serve
Data warehouses hold structured, clean data; data lakes hold raw data of any type
SQL is the lingua franca of data transformation — an essential skill
Validation is critical — silent data quality failures corrupt models downstream