Join Shane Gibson as he chats with Hans Hultgren on the core Data Vault Modeling patterns.
Listen
Listen on all good podcast hosts or over at:
https://agiledata.podbean.com/e/agiledata-41-the-patterns-of-data-vault-with-hans-hultgren/
Read
Read or download the podcast transcript at:
https://agiledata.io/podcast/agiledata-podcast/the-patterns-of-data-vault-with-hans-hultgren/#read
Google NoteBookLM Briefing
Data Vault Modelling: A Deep Dive into Agile Data Architecture
This briefing document summarises key concepts, patterns, and benefits of Data Vault modelling, drawing insights directly from the "The patterns of Data Vault with Hans Hultgren - AgileData #41" podcast. It explores Data Vault's core components, its relationship with agile methodologies, and its advantages over traditional data modelling technique
Introduction to Data Vault
Data Vault is a data modelling technique that emerged in the late 1990s, gaining traction for its pattern-based approach to data warehousing and business intelligence. Hans Hultgren, a pioneer in the field and co-founder of Genesee Academy, describes his discovery of Data Vault as akin to "I love this razor so much I bought the company." This sentiment highlights the profound impact Data Vault had on his approach to data management.
At its core, Data Vault aims to "democratize data modelling," moving away from the traditional model where a single specialist creates a monolithic entity model that is often difficult to implement. Instead, Data Vault provides a set of understandable and applicable patterns that can be used by a wider range of practitioners.
Core Data Vault Patterns: Hubs, Satellites, and Links
The foundational elements of Data Vault are Hubs, Satellites, and Links. These three physical modelling objects allow for the automation of Data Vault construction, reducing the "six bits of code" (create Hub, populate Hub, create Satellite, populate Satellite, create Link, populate Link) to build and load an entire Vault. This automation leads to "incredibly hardened" and "bulletproof" code, essential for agile data operations and CI/CD pipelines.
Hubs
Definition: A Hub represents a unique "Core Business Concept" or a uniquely identifiable key for an entity (e.g., Customer, Product, Order).
Purpose: To establish an "Enterprise wide key" for a concept, recognising that data may originate from multiple departments or systems. This is the "first challenge" in enterprise warehousing.
Key Characteristics:Immutability: Once a key is identified and stored in a Hub, it is never deleted. "The Hub just lives on its own as only basically one attribute." This ensures an "immutable record," a core benefit of Data Vault.
Simplicity: A Hub "just holds a key," with a technical wrapper for identification and source tracking.
Satellites
Definition: Satellites hold descriptive attributes or "context" about a Hub. They are always keyed by the Hub's key.
Purpose: To store the "detail about the customer" or other business concepts, separating attributes from the core identifier.
Key Characteristics:Temporal (SCD2 by Default): Satellites are designed to track history automatically. "Every time we see a customer and let's say they change their name... we insert a new record." This "rack and stacking" of temporal changes is a "core pattern of Vault."
Logical Grouping: Attributes within Satellites are logically grouped (e.g., based on data rate of change, function, or logical meaning). This allows for incremental building: "if a new function of the business starts to deliver additional context about a customer, it can put that additional context in a new satellite without any re-engineering."
Flexibility: While there can be multiple Satellites per Hub, practical experience shows that typically "three to seven satellites on the foundational concepts that we model" is common, with a bell curve rarely exceeding "a couple dozen."
Links
Definition: Links represent "relationships between Concepts," forming "a combination of keys that form a relationship." They are akin to foreign key constraints in traditional modelling, but externalised from the Hubs.
Purpose: To model "natural business relationships" and "core business processes" (e.g., Customer orders Product).
Key Characteristics:Separation of Relationships: Unlike 3NF where foreign keys are embedded in entities, Data Vault "move those relationships out of the concept." This is crucial for "Unified Decomposition."
Natural Business Relationship Driven: The decision on "how to combine what keys is going to be based on what we refer to as the natural business relationship." For instance, a "sale" requiring a customer, employee, and store would be combined in a singular Link.
Event-Based Modelling (Peter the Fly): Hans Hultgren uses the "Peter the fly" analogy to illustrate how to model Links: "you're in the organization you're standing there and you're Peter the fly you're stuck on the wall and you're watching these things happen what does he see." This direct observation of business processes drives Link creation.
Many-to-Many Relationships: Links inherently support many-to-many relationships. If data indicates an unexpected relationship (e.g., "a customer ID and an order ID turn up with a nil product"), the Link "just absorbs that," serving as an observability point to highlight potential business process or data capture issues.
Additive and Incremental: New business processes or changing relationships result in new Links, rather than re-engineering existing ones. "The relationships that are unique to shipment are not the same relationships that are unique to sale. You don't have to change modify or impact them in any way you just bolt on the new one."
Advanced Data Vault Patterns
While Hubs, Satellites, and Links form the core, Data Vault also incorporates specialised patterns for specific challenges:
Same-As Links (SALs): Used for "duping" or identifying potential matches between records from different sources that refer to the same real-world entity. For example, linking two customer records believed to be the same person. They allow for "as many as you want" algorithms to determine the strength of a match without impacting the core Hub.
Hierarchical Links (HALs): Model parent-child relationships, providing flexibility to represent multiple, varying hierarchies (e.g., product hierarchies based on marketing, sales, or kit assembly). They overcome the challenges of "jagged and sparse hierarchies" often seen in dimensional modelling.
Data Vault and Agile Methodologies
Data Vault's design principles align strongly with agile data practices:
Adaptability to Change: The modular nature of Hubs, Satellites, and Links means that "if something needs to change we create a new link or if the data changes it absorbs that change for us." This "very small change on the rest of the model" enables continuous adaptation.
Incremental Delivery: Data Vault allows teams to "deliver incrementally" and "deliver small bits of value fast." As an example, one can "build customer orders product deliver that value to stakeholders early and then say right now we're going to go and deal with shipment."
Automation (DataOps): The standardisation of Data Vault patterns facilitates automation. "Our code becomes hardened over time so it's the data ops right once use many approach."
Feedback Loop with Business: Data Vault's mirroring of business processes acts as a "catalyst" for communication. If a model reveals a "gap wherein the data you have can't fill the model," it signals either a model adjustment or a deficiency in the source system's data capture, prompting valuable discussions with the business.
Data Vault's Relationship with Other Modelling Techniques
Data Vault serves as an "Ensemble modelling" approach, sharing commonalities with other methods like Anchor Modelling and 6NF. There is significant overlap, with "85 90% the same" at least in foundational principles, often referred to as "Unified decomposition."
While Data Vault is a distinct modelling paradigm, it acknowledges and often integrates with other popular techniques for consumption:
Dimensional Modelling (Star Schemas): A common pattern is to "turn it back into a dimensional model for consumption by tools." This can be done physically or virtually as views. The podcast highlights that "analysts everybody knows how to use a star schema," making it a familiar "lens" for end-users.
3rd Normal Form (3NF): Data Vault represents a "counterintuitive" shift from 3NF's encapsulated concept model. The concept of "circular relationships bad" in 3NF is challenged in Data Vault, which embraces multiple Links between the same concepts as long as they represent distinct business processes.
Addressing Common Concerns and Benefits
Complexity (Table Count): A common criticism is the "plethora of tables" generated by Data Vault. However, with modern cloud-based analytical databases, "storage is relatively cheap," and they "can hold thousands of tables." The perceived complexity is mitigated by the ability to automate view creation for consumption.
Cost of Compute: While more joins might be needed due to multiple tables, the ability to "restructure our let's put the commonly H fields in a set" or create specific views for frequently accessed data can "make your cost goes down."
Master Data Management: Data Vault "deals with Master data" by providing a framework to centralise and manage common business concepts across an enterprise, helping to achieve a "single view of customer" even when data originates from disparate systems.
Total Cost of Ownership (TCO): Surprisingly, Data Vault often leads to a lower initial and total cost of ownership. The ability to "start with one small component that once you finish it does not have to be re-engineered again" reduces future effort and increases efficiency.
Global Adoption and Future Outlook
Hans Hultgren observes a higher adoption rate of Data Vault in Europe (particularly the Nordics and Netherlands), Australia, and New Zealand compared to the USA. He speculates this might be due to a cultural difference in accountability, with enterprise architects in these regions having "a little more of the authority and the weight and the responsibility" for warehouse programs.
Despite acknowledging that "Vault has problems" like any modelling pattern, Hultgren remains a strong advocate, stating, "I think it's awesome and I think it's easy I think it's really the best way to go for a lot of things." He stresses the importance of continuous learning and adaptability in the data profession, echoing the sentiment: "as soon as the next best thing comes out I will jump on it in heartbeat."
The overarching message is the crucial role of "modeling" in data and analytics, regardless of the specific flavour. "We need to get into modeling again and there's a lot of things that that are effective and can work and what we're focused on here in these discussions is what might work better for certain purposes and that's it." Data Vault offers a robust, agile, and adaptable solution for complex enterprise data challenges.
«oo»
Stakeholder - “Thats not what I wanted!”
Data Team - “But thats what you asked for!”
Struggling to gather data requirements and constantly hearing the conversation above?
Want to learn how to capture data and information requirements in a repeatable way so stakeholders love them and data teams can build from them, by using the Information Product Canvas.
Have I got the book for you!
Start your journey to a new Agile Data Way of Working.