Join Shane Gibson as he chats with Andrew Jones on the pattern of Data Contracts.
Listen
Listen on all good podcast hosts or over at:
https://podcast.agiledata.io/e/agiledata-54-data-contracts-with-andrew-jones/
Read
Read the podcast transcript at:
https://agiledata.io/podcast/agiledata-podcast/data-contracts-with-andrew-jones/#read
Google NoteBookLLM Briefing
Briefing Document: Data Contracts with Andrew Jones
Introduction
This briefing document summarises the key points discussed in the Agile Data Podcast episode featuring Andrew Jones, a pioneer in the concept of "Data Contracts." Jones, with 20 years in the tech industry, explains what data contracts are, why they're important, and how they can be implemented. It's all about getting your data ducks in a row, eh?
Key Themes and Ideas
What is a Data Contract?
At its core, a data contract is a description of your data – think of it as metadata on steroids. It's like a good ol' spec sheet for your data.
It includes a schema definition, owner, and any other relevant metadata for your situation.
It's used to create interfaces that move away from the traditional capture-and-dump approach of dealing with upstream data changes.
The aim? To stop upstream changes from breaking all your downstream data pipelines. "The problem I was trying to solve initially was we had upstream changes, breaking downstream data pipelines and data applications."
Beyond the Tech - It's About People and Agreements
A data contract isn't just about the tech – it's about having an agreement between the data provider and consumer. "It’s only partly about tech. In some ways, it’s more about. The people side and the agreement and what you’re promising each other."
It's not about one side dictating to the other: "Here's the schema, make sure you follow it." It's about collaboration and agreement, like a proper yarn.
The contract defines who owns, is accountable, and responsible for the data.
The Data Producer Should Own the Contract
Andrew reckons the data producer should be the ones creating and owning the contract as they are the only ones who can meet the requirements. "They’re going to be people who have the ability to change the data, to change the schemas. They’re going to be people who have all the context around the data they’re creating."
It’s about starting a conversation and getting everyone on board.
It makes it easier to get the required commitment to producing quality data from the source.
How To Start
Start small, focusing on solving a particular business problem rather than trying to change everything at once. You don’t want to go “full noise” from the get-go.
Focus on the experience of those who will be using the contract, particularly software engineers. Make it easy for them. "If I want the data contract to be owned by the software engineering team, I need to make it as easy as I can for them to do that."
Involve the software engineers early in the solution so that they become champions of the solution. "Software engineers love solving problems as well. If they feel like they’re part of producing the solution, they’re more likely to use it."
Data Contracts with Out-of-the-Box Software (SaaS)
Implementing data contracts becomes trickier when dealing with SaaS providers.
Aim to put the contract as close to the source as possible.
The admin who manages that SaaS service should own the data contract if possible. It's about finding who owns the gear and getting them involved.
If its a smaller SaaS vendor it might be possible to build Data Quality SLOs into your actual legal agreement.
Why Not Just Use CDC?
While CDC (Change Data Capture) is useful, applying data contracts to CDC data directly doesn't really work out.
Software engineers need the ability to change their schemas to add new features, but having a data team reviewing that every time doesn't help anybody. "Now suddenly, the software engineering teams can’t change their schema with autonomy. They, they’re kind of stuck waiting for a review by some data team."
A Data Contract acts as an abstraction over the database, similar to an API, for better stability.
Data Contracts and Flexibility vs Reliability
There's a perception that using CDC offers more flexibility because you capture all the data and you can decide on the design later.
However, this means data transformations are often brittle, and logic implemented in the data warehouse gets out of sync with changes in the source systems.
Data changes aren’t as frequent as people think, so the contract approach isn’t as rigid as some might believe.
"You are giving up a bit of flexibility, but in exchange, you’re giving And maybe it was better to prioritise flexibility when all you were doing was reporting."
Standardization
There is an effort to create open standards for data contract definitions. The idea is to have interoperability between different tools.
It's useful, because data contracts end up needing to be converted into various formats so that you can work with different tooling. Imagine not having to do that, how good would that be?
The open data contract standard is backed by the Linux Foundation.
It's all a bit early days, so things are still developing, but it looks promising.
Data Teams Supporting Software Teams
Data teams need to be thinking about how they can help software teams implement data contracts.
This means providing tooling and platforms that help software engineers publish data with contracts.
Data platform teams are providing capabilities around orchestration, data contracts, data retention, backup and access control.
Everything is being built on top of the data contract. "And we found for any platform capability we wanted to add, we can add on top of data contracts."
Data Contracts as Policy
Data Contracts can become the basis for data policy. The data generator knows the data so they can classify it, for example, whether it is PII or not.
The data platform enforces that policy, so you get things like automated data masking. The data producer doesn't need to think about the implementation of that policy, just what the data is.
This concept also extends to other things, like retention and backup policies. It’s all about saying “this is the data” and the system takes it from there.
Moving Away From Heavy Data Transformations?
With data contracts, the aim is to produce quality data at the source.
This means less need for complex transformations in the data warehouse. The goal is to reduce brittleness, and provide more consistent data.
Data engineering work might move towards joining existing data products rather than constant transformation of brittle raw data.
Data Contracts and Data Combination
There can still be issues when joining data from different systems.
Data contracts between SaaS systems are harder, but at least you have a baseline of what to expect.
Libraries that enforce consistent key formats can assist with source data.
A central ID service can be used to generate IDs that get injected into the data streams. This is where Andrew sees the future going.
Data Contracts for Every Handoff
Data Contracts are useful at every handoff, even internal transformations.
Where you have some kind of ownership change, a contract makes sense. But even without ownership changes, it could make sense if the tools are easy and provide value.
The future could be more about describing what needs to happen rather than the how, moving away from SQL and DBT.
Dealing with Broken Contracts
When data doesn't meet a contract, what should happen? Dead letter queues are a common approach but have their own challenges.
The correct approach is really dependent on what is expected from the data user. Do they prefer partial data, no data, or a complete halt until the data is fixed?
This decision needs to come from an agreement between provider and consumer. "If, as a user, you’d rather not see any data, if it’s not all there, you care most about completeness, then maybe you want that."
Data Contracts and Dashboards
Even dashboards should have data contracts associated with them, especially if it is critical for the business.
If you don't have confidence in the data in a dashboard, why would anyone use it?
Data Contracts as a Maturity Curve
The use of data contracts is part of a general maturity curve in data engineering, and the move towards more discipline.
As the value of the data we produce increases, the need for more disciplined approaches is important. This echoes software engineering practices, such as code reviews and automated deployments.
Data Contracts - A Journey
Don't expect to implement data contracts overnight, it's a journey that will take time. You’ve gotta put in the mahi.
Focus on the culture change as much as the technology. "It’s a journey that involves a bit of tech and a lot of communication, a lot of people side. That’s probably the harder part, but it’s the necessary part."
Over time, people will start to understand the value of data contracts and how they lead to more active and consistent data.
Key Quote:
"We are moving from more. Passive data generation to more active data generation. We’re actively providing data to you because you want, you need to use it downstream for these reasons."
Next Steps / Call to Action
Check out Andrew Jones' website: andrew-jones.com
Download his white paper: dc101.io
Follow Andrew on LinkedIn.
Consider how you might use data contracts in your own work, focusing on solving key problems and collaborating with data producers.
Conclusion
Data contracts, while a relatively new concept, offer a more structured, collaborative, and reliable approach to managing data. It's not just about the technology, it's about establishing agreements and relationships between those who generate the data and those who use it. It’s about getting everyone rowing in the same waka, aye? This briefing document provides a good overview of the key points raised in the podcast and how they might apply to your organisation.