Data EngineeringJuly 10, 202411 min read

SQL Patterns I Use for Reliable ETL

Production-tested SQL patterns for building ETL pipelines that handle edge cases, maintain data quality, and fail gracefully.

SQLETLData EngineeringBest Practices

After years of building ETL pipelines, I've accumulated a set of SQL patterns that make transformations more reliable, maintainable, and debuggable.

These aren't theoretical patterns—they're production-tested approaches that handle real-world edge cases and data quality issues.

Pattern 1: Incremental Loads with Deduplication

Incremental loads are faster than full refreshes, but they require careful handling of duplicates and updates.

-- Incremental load pattern
WITH source_data AS (
    SELECT
        id,
        name,
        email,
        updated_at,
        ROW_NUMBER() OVER (
            PARTITION BY id 
            ORDER BY updated_at DESC
        ) AS rn
    FROM staging.customers
    WHERE updated_at > '{{ last_run_timestamp }}'
),
deduplicated AS (
    SELECT
        id,
        name,
        email,
        updated_at
    FROM source_data
    WHERE rn = 1
)
MERGE INTO marts.customers AS target
USING deduplicated AS source
ON target.id = source.id
WHEN MATCHED AND target.updated_at < source.updated_at THEN
    UPDATE SET
        name = source.name,
        email = source.email,
        updated_at = source.updated_at
WHEN NOT MATCHED THEN
    INSERT (id, name, email, updated_at)
    VALUES (source.id, source.name, source.email, source.updated_at);

Key features:

  • Handles duplicates in source data
  • Only updates if source is newer
  • Works across different database systems

Pattern 2: Data Quality Checks Embedded in Transformation

Don't wait for separate quality checks—validate during transformation:

WITH validated_data AS (
    SELECT
        order_id,
        customer_id,
        order_date,
        amount,
        -- Quality checks
        CASE 
            WHEN amount < 0 THEN 'ERROR: Negative amount'
            WHEN amount > 1000000 THEN 'WARNING: Unusually large amount'
            ELSE NULL
        END AS quality_flag,
        CASE 
            WHEN order_date > CURRENT_DATE THEN 'ERROR: Future date'
            WHEN order_date < '2020-01-01' THEN 'WARNING: Very old date'
            ELSE NULL
        END AS date_quality_flag
    FROM staging.orders
    WHERE order_date >= '{{ start_date }}'
)
SELECT
    order_id,
    customer_id,
    order_date,
    amount,
    quality_flag,
    date_quality_flag
FROM validated_data
WHERE quality_flag IS NULL OR quality_flag LIKE 'WARNING%';
-- Filter out errors, keep warnings for review

This pattern:

  • Catches issues early
  • Provides context (WARNING vs ERROR)
  • Allows selective filtering based on severity

Pattern 3: Slowly Changing Dimensions (Type 2)

Track historical changes while maintaining query performance:

-- SCD Type 2 pattern
WITH source_with_hashes AS (
    SELECT
        customer_id,
        name,
        email,
        address,
        -- Generate hash for change detection
        MD5(CONCAT(name, email, address)) AS current_hash
    FROM staging.customers
),
current_records AS (
    SELECT
        customer_id,
        name,
        email,
        address,
        current_hash,
        CURRENT_TIMESTAMP AS valid_from,
        NULL AS valid_to,
        1 AS is_current
    FROM source_with_hashes
),
existing_records AS (
    SELECT *
    FROM marts.customers_scd2
    WHERE is_current = 1
),
records_to_close AS (
    SELECT
        e.customer_id,
        e.valid_from,
        CURRENT_TIMESTAMP AS valid_to,
        0 AS is_current
    FROM existing_records e
    INNER JOIN current_records c
        ON e.customer_id = c.customer_id
    WHERE e.hash != c.current_hash
),
new_records AS (
    SELECT c.*
    FROM current_records c
    LEFT JOIN existing_records e
        ON c.customer_id = e.customer_id
    WHERE e.customer_id IS NULL
       OR e.hash != c.current_hash
)
-- Insert new records and close old ones
INSERT INTO marts.customers_scd2
SELECT * FROM new_records
UNION ALL
SELECT 
    customer_id, name, email, address, hash,
    valid_from, valid_to, is_current
FROM records_to_close;

This pattern:

  • Maintains complete history
  • Makes current records easy to query
  • Handles inserts, updates, and unchanged records

Pattern 4: Pivot with Dynamic Columns

When you need to pivot but don't know all possible values upfront:

-- Dynamic pivot using aggregation
SELECT
    date,
    SUM(CASE WHEN product_category = 'Electronics' THEN revenue ELSE 0 END) AS electronics_revenue,
    SUM(CASE WHEN product_category = 'Clothing' THEN revenue ELSE 0 END) AS clothing_revenue,
    SUM(CASE WHEN product_category = 'Food' THEN revenue ELSE 0 END) AS food_revenue,
    -- Add more categories as needed
    SUM(revenue) AS total_revenue
FROM marts.sales
WHERE date >= '{{ start_date }}'
GROUP BY date
ORDER BY date;

For truly dynamic pivots, use your ETL tool's templating (dbt Jinja, Airflow Jinja) to generate CASE statements from a configuration table.

Pattern 5: Window Functions for Time-Based Calculations

Window functions are powerful for time-series analysis:

-- Calculate running totals and period-over-period changes
SELECT
    date,
    revenue,
    -- Running total
    SUM(revenue) OVER (
        ORDER BY date 
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS running_total,
    -- Previous period value
    LAG(revenue, 1) OVER (ORDER BY date) AS previous_day_revenue,
    -- Period-over-period change
    revenue - LAG(revenue, 1) OVER (ORDER BY date) AS day_over_day_change,
    -- 7-day moving average
    AVG(revenue) OVER (
        ORDER BY date 
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) AS seven_day_avg,
    -- Rank within month
    RANK() OVER (
        PARTITION BY DATE_TRUNC('month', date)
        ORDER BY revenue DESC
    ) AS rank_in_month
FROM marts.daily_revenue
ORDER BY date;

Window functions are essential for:

  • Running calculations
  • Period comparisons
  • Moving averages
  • Rankings and percentiles

Pattern 6: Handling NULLs and Defaults

NULLs break calculations. Handle them explicitly:

-- NULL-safe calculations
SELECT
    order_id,
    -- Coalesce for defaults
    COALESCE(discount_amount, 0) AS discount_amount,
    -- NULLIF to convert sentinel values to NULL
    NULLIF(status, 'UNKNOWN') AS status,
    -- CASE for conditional defaults
    CASE 
        WHEN customer_type IS NULL THEN 'UNKNOWN'
        ELSE customer_type
    END AS customer_type,
    -- Safe division
    CASE 
        WHEN total_orders > 0 THEN revenue / total_orders
        ELSE NULL
    END AS avg_order_value
FROM staging.orders;

Pattern 7: Transactional Consistency

Ensure related records stay in sync:

-- Use transactions for multi-table updates
BEGIN TRANSACTION;

-- Update fact table
UPDATE marts.orders
SET status = 'CANCELLED'
WHERE order_id IN (SELECT order_id FROM staging.cancellations);

-- Update dimension table
UPDATE marts.customers
SET last_order_date = (
    SELECT MAX(order_date)
    FROM marts.orders
    WHERE orders.customer_id = customers.customer_id
)
WHERE customer_id IN (
    SELECT DISTINCT customer_id 
    FROM staging.cancellations
);

COMMIT;

Pattern 8: Error Handling and Logging

Log issues without failing the entire pipeline:

-- Separate valid and invalid records
WITH validated AS (
    SELECT
        *,
        CASE 
            WHEN customer_id IS NULL THEN 'MISSING_CUSTOMER_ID'
            WHEN order_date IS NULL THEN 'MISSING_ORDER_DATE'
            WHEN amount IS NULL OR amount <= 0 THEN 'INVALID_AMOUNT'
            ELSE 'VALID'
        END AS validation_status
    FROM staging.orders
)
-- Process valid records
INSERT INTO marts.orders
SELECT
    customer_id, order_date, amount, product_id
FROM validated
WHERE validation_status = 'VALID';

-- Log invalid records for review
INSERT INTO etl.error_log
SELECT
    'orders' AS table_name,
    validation_status AS error_type,
    CURRENT_TIMESTAMP AS error_timestamp,
    * -- All columns for debugging
FROM validated
WHERE validation_status != 'VALID';

Putting It Together

These patterns work best when combined:

  1. Use incremental loads for performance
  2. Embed quality checks for reliability
  3. Use window functions for analytics
  4. Handle NULLs explicitly
  5. Log errors without failing

The Bottom Line

Reliable ETL isn't about writing perfect SQL—it's about patterns that handle edge cases, provide visibility, and fail gracefully.

Adopt these patterns, adapt them to your needs, and build pipelines that work in production, not just in development.