How We Migrated Our Data Warehouse Without a Big Bang

In the Kimball book, (that data warehouse bible), there's this Pitfall #8: "Tackle a galactic multiyear project rather than pursuing more manageable, although still compelling, iterative development efforts."

This maybe familiar to you: A team decides to migrate their data warehouse. They plan for it, assemble the team, kick off the project, working on it, test it, working on it, test it, working on it, test it, so on and so on until months has passed, quarters has passed, or years has passed, without the data warehouse being used in even one dashboard.

Until one day, they do the big switch. The big bang. Every table gets replaced. Every dashboard gets updated. Sometimes it works. Often it doesn't. Just like a big bang, the scale of wrongness could be colossal too.

And even when it does work technically, there's this awkward phase where nobody trusts the new numbers yet. People are questioning it. Everyone has a feeling something isn't right.

We tried something different when we migrated our PostgreSQL data warehouse to BigQuery. We don't like the big bang. We don't like months or years without actual results we can feel in our hands. I want to share what we learned.

When "iterative development" should be "iterative deployments"

When people say "iterative," they should not think about it as:

Demos every two weeks
Routine progress report to stakeholders
Weekly meetings
Sprint planning, sprint review
Get feedback
Keep building

Yes, that's nice, and we should do that. But nice isn't enough when people are asking "when can we actually USE it?"

Right. Iterative doesn't mean showing progress just for the sake of being busy. Iterative means iteratively useful. For me, it means shipping to production often, use it early, let real users use it. Even when it's incomplete.

In our first iteration we migrated just one dashboard. The one that one business line team check every morning and afternoon. That's it. Not all dashboards. Not all tables. Not all data sources. Just that one dashboard.

We were worried, but we were not scared. What if the numbers were wrong? What if people got confused? What if they lost trust in our team? Those are valid concerns. But that's why we always have test phase before deployments, just like in the bible, I mean the toolkit book.

BUT

We needed to know. Does it actually work? Do people trust it? Are the numbers right? What breaks in real usage?

Turns out, you can never answer these questions in a staging environment. You need real users hitting real data.

Why iterative deployments isn't the default "iterative" mode

Many reasons. Complexity, consistency concerns, technical constraints, costs. I'm not saying it's easy. No one ever said iterative process is easy. It's an approach with challenges and very sweet rewards.

But I want to emphasize this particular reason: because it's not obvious. It's not that easy to define.

In software engineering, iterations are easy to define. "This sprint we'll ship the login feature. Next sprint, the input form." But in data warehouse migration, how would you define it? What is a "feature"? What gets deployed in a deployment? What is the unit of an iteration?

Is one iteration a single table? That's too small. A table by itself isn't useful. Is it a whole subject area? That might be too big. Takes months, defeats the purpose.

This is where we need to think a little tiny bit more to define the chunk.

For us, a unit needs to be something a real person can actually use to make a real decision. A table might be the unit if it gets used, such as data marts. In some cases, one table is not enough, because one chart might be needing more than one joins.

But wait, even migrating a single table could be not that easy. It may have octupus legs of table dependencies!

Untangling the dependencies

I have to be honest, one of the reason why iterative deployments approach worked for us is because we already invested in the most crucial parts: data catalog, data ownership, and data lineage.

If we didn't have that, I'd direct my team to work on that first.

Data catalog tells us what do we need to migrate. What tables. What dashboards. Data ownership tells who use that data, that dashboard. Which analytics team prepare that report. Data lineage discovers which tables used by this dashboard, which tables are used to build this table, and if we migrate this table, what breaks downstream?

This made prioritization less mysterious. Dependencies are clear. The unit of iteration, that challenging thing, gets easier. Instead of guessing "let's migrate the users table first!", we could ask:

Who owns this?
Can we collaborate with them?
How many things depend on this?
What's the blast radius if we mess up?

Without this, we'd be gambling. With this, we were making informed decisions. Still risky, but at least we knew what we were risking.

Collaborate (don't screw this up alone)

We didn't do this alone in a dark room. We couldn't. We're fine but we're not that smart.

We worked closely with the data analyst teams.

They know what users actually need. They know which reports are critical. They know which numbers, if wrong, cause panic at 8am Monday morning.

We sat with them. "What do you need first? What would make your life easier?" And the most important question: "What are you afraid of if we migrate this?"

They told us, we plan together, we build together, we test together. They found issues. We fixed them. Repeat.

We also talked to end users in the company.

The business team. Sales. Marketing. Finance. VP and C-level.

We told them that we were having a migration. We asked and listened to every concerns they might have. We assured them the numbers should look the same, but we were also honest that we needed their help to inform us when something isn't right.

Sometimes they found discrepancies. Things that passed our tests. We investigated. Sometimes it was our bug. Sometimes it was a bug in the OLD system that nobody noticed. Either way, it was worth it. Their dashboard is now both much faster and correct.

The sweet rewards of doing interative deployments.

People need to trust your data. And trust isn't built by announcing "the migration is complete!" Trust is built slowly, by people checking the numbers and seeing they're right.

Make it makes sense

From management's perspective, our approach was less scary:

Traditional big bang:

Year 1: building, no release, "trust us"
Year 2: building, no release, "we're almost done"
Year 3: release, maybe value? maybe chaos?

What we did:

Month 1: Three dashboards in production, some people using it.
Month 2: Ten more reports, more people using it.
Month 6: Done. Fully migrated. People don't even realize we're doing a big migration. They thought the whole thing was just BAU.

Management could see actual progress. Not "we're 60% done" progress. But "200 users tried the new system this week" progress.

Final thoughts

We, as engineers, have the privilege to release an incomplete product. We're not building refrigerators. Our moms would be mad if we delivered a fridge without a door. But software? Data? We can!

Let's take advantage of it. Release the incomplete product. It's fine. Because we will keep iterating until it's completed.