SQL Patterns I Use for Reliable ETL
Production-tested SQL patterns for building ETL pipelines that handle edge cases, maintain data quality, and fail gracefully.
After years of building ETL pipelines, I've accumulated a set of SQL patterns that make transformations more reliable, maintainable, and debuggable.
These aren't theoretical patterns—they're production-tested approaches that handle real-world edge cases and data quality issues.
Pattern 1: Incremental Loads with Deduplication
Incremental loads are faster than full refreshes, but they require careful handling of duplicates and updates.
-- Incremental load pattern
WITH source_data AS (
SELECT
id,
name,
email,
updated_at,
ROW_NUMBER() OVER (
PARTITION BY id
ORDER BY updated_at DESC
) AS rn
FROM staging.customers
WHERE updated_at > '{{ last_run_timestamp }}'
),
deduplicated AS (
SELECT
id,
name,
email,
updated_at
FROM source_data
WHERE rn = 1
)
MERGE INTO marts.customers AS target
USING deduplicated AS source
ON target.id = source.id
WHEN MATCHED AND target.updated_at < source.updated_at THEN
UPDATE SET
name = source.name,
email = source.email,
updated_at = source.updated_at
WHEN NOT MATCHED THEN
INSERT (id, name, email, updated_at)
VALUES (source.id, source.name, source.email, source.updated_at);
Key features:
- Handles duplicates in source data
- Only updates if source is newer
- Works across different database systems
Pattern 2: Data Quality Checks Embedded in Transformation
Don't wait for separate quality checks—validate during transformation:
WITH validated_data AS (
SELECT
order_id,
customer_id,
order_date,
amount,
-- Quality checks
CASE
WHEN amount < 0 THEN 'ERROR: Negative amount'
WHEN amount > 1000000 THEN 'WARNING: Unusually large amount'
ELSE NULL
END AS quality_flag,
CASE
WHEN order_date > CURRENT_DATE THEN 'ERROR: Future date'
WHEN order_date < '2020-01-01' THEN 'WARNING: Very old date'
ELSE NULL
END AS date_quality_flag
FROM staging.orders
WHERE order_date >= '{{ start_date }}'
)
SELECT
order_id,
customer_id,
order_date,
amount,
quality_flag,
date_quality_flag
FROM validated_data
WHERE quality_flag IS NULL OR quality_flag LIKE 'WARNING%';
-- Filter out errors, keep warnings for review
This pattern:
- Catches issues early
- Provides context (WARNING vs ERROR)
- Allows selective filtering based on severity
Pattern 3: Slowly Changing Dimensions (Type 2)
Track historical changes while maintaining query performance:
-- SCD Type 2 pattern
WITH source_with_hashes AS (
SELECT
customer_id,
name,
email,
address,
-- Generate hash for change detection
MD5(CONCAT(name, email, address)) AS current_hash
FROM staging.customers
),
current_records AS (
SELECT
customer_id,
name,
email,
address,
current_hash,
CURRENT_TIMESTAMP AS valid_from,
NULL AS valid_to,
1 AS is_current
FROM source_with_hashes
),
existing_records AS (
SELECT *
FROM marts.customers_scd2
WHERE is_current = 1
),
records_to_close AS (
SELECT
e.customer_id,
e.valid_from,
CURRENT_TIMESTAMP AS valid_to,
0 AS is_current
FROM existing_records e
INNER JOIN current_records c
ON e.customer_id = c.customer_id
WHERE e.hash != c.current_hash
),
new_records AS (
SELECT c.*
FROM current_records c
LEFT JOIN existing_records e
ON c.customer_id = e.customer_id
WHERE e.customer_id IS NULL
OR e.hash != c.current_hash
)
-- Insert new records and close old ones
INSERT INTO marts.customers_scd2
SELECT * FROM new_records
UNION ALL
SELECT
customer_id, name, email, address, hash,
valid_from, valid_to, is_current
FROM records_to_close;
This pattern:
- Maintains complete history
- Makes current records easy to query
- Handles inserts, updates, and unchanged records
Pattern 4: Pivot with Dynamic Columns
When you need to pivot but don't know all possible values upfront:
-- Dynamic pivot using aggregation
SELECT
date,
SUM(CASE WHEN product_category = 'Electronics' THEN revenue ELSE 0 END) AS electronics_revenue,
SUM(CASE WHEN product_category = 'Clothing' THEN revenue ELSE 0 END) AS clothing_revenue,
SUM(CASE WHEN product_category = 'Food' THEN revenue ELSE 0 END) AS food_revenue,
-- Add more categories as needed
SUM(revenue) AS total_revenue
FROM marts.sales
WHERE date >= '{{ start_date }}'
GROUP BY date
ORDER BY date;
For truly dynamic pivots, use your ETL tool's templating (dbt Jinja, Airflow Jinja) to generate CASE statements from a configuration table.
Pattern 5: Window Functions for Time-Based Calculations
Window functions are powerful for time-series analysis:
-- Calculate running totals and period-over-period changes
SELECT
date,
revenue,
-- Running total
SUM(revenue) OVER (
ORDER BY date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS running_total,
-- Previous period value
LAG(revenue, 1) OVER (ORDER BY date) AS previous_day_revenue,
-- Period-over-period change
revenue - LAG(revenue, 1) OVER (ORDER BY date) AS day_over_day_change,
-- 7-day moving average
AVG(revenue) OVER (
ORDER BY date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS seven_day_avg,
-- Rank within month
RANK() OVER (
PARTITION BY DATE_TRUNC('month', date)
ORDER BY revenue DESC
) AS rank_in_month
FROM marts.daily_revenue
ORDER BY date;
Window functions are essential for:
- Running calculations
- Period comparisons
- Moving averages
- Rankings and percentiles
Pattern 6: Handling NULLs and Defaults
NULLs break calculations. Handle them explicitly:
-- NULL-safe calculations
SELECT
order_id,
-- Coalesce for defaults
COALESCE(discount_amount, 0) AS discount_amount,
-- NULLIF to convert sentinel values to NULL
NULLIF(status, 'UNKNOWN') AS status,
-- CASE for conditional defaults
CASE
WHEN customer_type IS NULL THEN 'UNKNOWN'
ELSE customer_type
END AS customer_type,
-- Safe division
CASE
WHEN total_orders > 0 THEN revenue / total_orders
ELSE NULL
END AS avg_order_value
FROM staging.orders;
Pattern 7: Transactional Consistency
Ensure related records stay in sync:
-- Use transactions for multi-table updates
BEGIN TRANSACTION;
-- Update fact table
UPDATE marts.orders
SET status = 'CANCELLED'
WHERE order_id IN (SELECT order_id FROM staging.cancellations);
-- Update dimension table
UPDATE marts.customers
SET last_order_date = (
SELECT MAX(order_date)
FROM marts.orders
WHERE orders.customer_id = customers.customer_id
)
WHERE customer_id IN (
SELECT DISTINCT customer_id
FROM staging.cancellations
);
COMMIT;
Pattern 8: Error Handling and Logging
Log issues without failing the entire pipeline:
-- Separate valid and invalid records
WITH validated AS (
SELECT
*,
CASE
WHEN customer_id IS NULL THEN 'MISSING_CUSTOMER_ID'
WHEN order_date IS NULL THEN 'MISSING_ORDER_DATE'
WHEN amount IS NULL OR amount <= 0 THEN 'INVALID_AMOUNT'
ELSE 'VALID'
END AS validation_status
FROM staging.orders
)
-- Process valid records
INSERT INTO marts.orders
SELECT
customer_id, order_date, amount, product_id
FROM validated
WHERE validation_status = 'VALID';
-- Log invalid records for review
INSERT INTO etl.error_log
SELECT
'orders' AS table_name,
validation_status AS error_type,
CURRENT_TIMESTAMP AS error_timestamp,
* -- All columns for debugging
FROM validated
WHERE validation_status != 'VALID';
Putting It Together
These patterns work best when combined:
- Use incremental loads for performance
- Embed quality checks for reliability
- Use window functions for analytics
- Handle NULLs explicitly
- Log errors without failing
The Bottom Line
Reliable ETL isn't about writing perfect SQL—it's about patterns that handle edge cases, provide visibility, and fail gracefully.
Adopt these patterns, adapt them to your needs, and build pipelines that work in production, not just in development.