Keeping Your Data Fresh: ( The wake-up call at 3am that taught us about observability )
Previously on the dbt migration chronicles: We escaped retail data chaos by adopting dbt, structuring our pipeline into staging/intermediate/marts layers, and using variables to handle multiple store databases with a single project. But building the pipeline was just the beginning.
The 3am Wake-Up Call
It was a Tuesday when our VP of Sales noticed something wrong. Monday's revenue numbers looked too low, and the board meeting was in four hours.We dove into the data and discovered the San Francisco store had not sent updates in 18 hours. The pipeline ran successfully on stale data. Nobody knew anything was wrong until it was too late.
That morning taught us a crucial lesson: a successful dbt run does not mean your data is fresh, accurate, or complete. You need observability.
What Actually Is Observability?
In data pipelines, observability means answering three questions at any moment:
- Is my source data fresh? When did each retail store last send updates? Are we processing today’s orders or yesterday’s?
- Did my transformations succeed? Not just did the job finish, but did it produce valid results that passed quality checks?
- How is my pipeline performing? Which models are slow? Which tests fail most often? Where should we optimize?
Source Freshness: Never Be Blindsided Again
dbt has a built-in solution for freshness. You define freshness expectations in your source configuration:
sources:- name: retail_sourceschema: "{{ var('source_schema') }}"tables:- name: ordersloaded_at_field: updated_atfreshness:warn_after: {count: 30, period: minute}error_after: {count: 2, period: hour}
Now when we run dbt source freshness, we get clear answers: Is retail_store_sf current? Has retail_store_ny stopped sending data?
With Elementary (an open-source observability tool for dbt) configured, we get Slack alerts when stores go quiet. No more surprises. No more board meetings with stale numbers.
The Performance Problem
6 hours Time to rebuild all orders from 15 stores
Our initial dbt models worked beautifully for the first month. Then Black Friday happened. Orders exploded, and the orders table grew to millions of rows per store.
Our nightly full-refresh runs started missing their morning deadline. A full rebuild across 15 stores took around 6 hours. We needed a smarter approach.
We needed a smarter approach.
Incremental Models: Process Only What Changed
Full refreshes rebuild tables from scratch every time. Incremental models are smarter: they process only new or changed rows.
Here is the transformation that saved our pipeline:
{{ config(materialized='incremental',unique_key='order_id',incremental_strategy='delete+insert',schema=var('target_schema')) }}selectid as order_id,customer_id,store_id,order_total_cents,order_status,updated_atfrom {{ source('retail_source', 'orders') }}{% if is_incremental() %}where updated_at >= (select max(updated_at) from {{ this }}){% endif %}
The magic is in the is_incremental() block. On the first run, it processes everything. On subsequent runs, it only grabs rows that changed since the last run.
How It Works
The unique_key identifies which rows to update.
The where clause filters to only recent changes.
The delete+insert strategy removes old versions of changed rows before inserting new ones.
Result: Six-hour runs became 15-minute runs. Same accuracy, 96% less processing time.
The History Problem
Three months into production, marketing asked a question we could not answer: "When this customer made their November purchase, what loyalty tier were they in?"
We had their current tier and the November order, but we had lost the historical state. Customer records get updated. Addresses change. Loyalty tiers evolve. Products get repriced.
We needed to preserve history.
Snapshots: Time Travel for Your Data
dbt snapshots implement Type 2 Slowly Changing Dimensions (SCD): keep all versions of a record with timestamps showing when each version was valid.
{% snapshot customers_snapshot %}{{ config(unique_key='customer_id',strategy='timestamp',updated_at='updated_at',target_schema=var('target_schema')) }}selectid as customer_id,email,loyalty_tier,city,updated_atfrom {{ source('retail_source', 'customers') }}{% endsnapshot %}
Now when we run dbt snapshot, dbt checks for changes and preserves history. Each customer record gets dbt_valid_from and dbt_valid_to timestamps.
The power of snapshots: we can now answer questions like "Show me all orders where the customer was in the Gold tier at the time of purchase" by joining orders to snapshot tables and filtering by validity ranges.
Putting It All Together
Our production pipeline now combines all three techniques:
- Morning routine: dbt source freshness checks that overnight store feeds arrived. Alerts fire if any store is late.
- Transformation run: dbt run executes incremental models for facts (orders, order_items) and full refreshes for small dimensions.
- History preservation: dbt snapshot captures changes to customers and products before they are overwritten.
- Quality gates: dbt test validates uniqueness, referential integrity, and custom business rules while Elementary dashboards show health metrics and alert patterns.
The result is a pipeline that is fast, reliable, and observable. We catch problems before they reach executives, process millions of rows in minutes, and answer historical questions that were previously impossible.
The next board meeting went differently. Revenue numbers were fresh, accurate, and ready an hour early. When the VP of Sales asked, "What was our customer retention rate for Gold tier members who joined in Q3?" we had the answer in under a minute. Historicalsnapshots made it possible.
Coming Up Next
we need strategies for reusable code, better project organization, and extending to new business entities without chaos. In Episode 3, we'll explore macros, project management patterns, and the art of scaling dbt.
The dbt Migration Chronicles · Episode 2 of 4
Written for data teams who learned observability the hard way
Production Ready ( Quality, performance, and the lessons learned shipping to 150 stores )
We chose dbt over custom scripts, built observability, optimized performance, and shipped to production...
Read MoreScaling from 15 to 150 Stores ( When copy-paste becomes technical debt, macros become salvation )
We built a pipeline with observability, incremental models for performance, and snapshots for history. Our 15-store deployment ran smoothly...
Read MoreKeeping Your Data Fresh: ( The wake-up call at 3am that taught us about observability )
That morning taught us a crucial lesson: a successful dbt run doesn't mean your data is fresh, accurate, or complete. You need observability.
Read MoreRetail Data Chaos: How We Found Our Way Out ( When spreadsheetsfail and databases multiply, where do you turn? )
Picture this: You're managing data for a growing retail chain. Store afterstore opens New York, San Francisco, Los Angeles—each with its own MySQL database...
Read MoreSecuring Your AI-Powered Future (How Authorization Ensures Safe and Appropriate Access)
Discover how authorization in MCP ensures secure, role-based access for AI-powered business workflows...
Read MoreProtecting Your AI-Powered Systems (How Rate Limiting Ensures Stability and Performance)
MCP connects AI to your applications (Episode 1) and enables powerful self-service analytics (Episode 2)...
Read MoreAI-Powered Analytics (How MCP Enables Self-Service Reporting Without Developers)
One of the most powerful applications of MCP is enabling self-service analytics. Product owners, managers, and business analysts...
Read MoreAI Meets Your Applications (What is MCP and Why Your Business Needs It Now)
Traditional application programming interfaces (APIs) have served us well, but they require technical knowledge. Developers need to understand endpoints...
Read MoreWhy Building the Right MVP Architecture No Longer Slows You Down
Just build a simple monolith for your MVP. You can fix the architecture later...
Read More