The Data Vault methodology is a data modeling and architectural approach designed for data warehousing and business intelligence purposes. It provides a systematic and scalable framework for integrating and managing large volumes of data from diverse sources while ensuring data integrity, traceability, and flexibility. The methodology was developed by Dan Linstedt in the early 2000s and has gained popularity in the data management community.
Here is a detailed description of the key components and principles of the Data Vault methodology:
- Hub: A Hub is a central repository that stores a unique list of business keys for a particular business entity, such as a customer or a product. It represents the core entities of the business and acts as a single point of reference. Hubs are designed to be immutable, meaning that once a record is created in the Hub, it should never change. Each record in the Hub is assigned a surrogate key, which is a system-generated identifier used for referencing the entity across the data model.
- Link: A Link connects two or more Hubs and represents the relationships between the entities. It contains the surrogate keys of the connected Hubs as foreign keys and may also include descriptive attributes related to the relationship itself. Links are used to capture many-to-many relationships and provide a way to navigate the data model. Like Hubs, Links are also immutable and are never updated after their creation.
- Satellite: A Satellite contains the descriptive attributes of a Hub or a Link. It provides the context and history of the data stored in the Hubs and Links. Satellites are identified by the surrogate key of the corresponding Hub or Link and include attributes such as valid from and valid to dates, source system identifiers, timestamps, and other relevant metadata. Satellites are designed to be mutable, allowing updates to the descriptive attributes over time while preserving the historical changes.
- Business Vault: The Business Vault is the central component of the Data Vault methodology. It consists of interconnected Hubs, Links, and Satellites that capture the raw, unaltered data from the source systems. The Business Vault acts as a persistent, auditable, and scalable store of the data, enabling traceability and data lineage. It serves as the foundation for building downstream data marts and data warehouses.
- Loading Patterns: The Data Vault methodology defines specific loading patterns to populate and update the Business Vault. The three main loading patterns are:
- Initial Load: In the initial load, historical data from source systems is loaded into the Hubs and Satellites. It captures the complete history of the data from the beginning of time.
- Delta Load: The delta load mechanism is used to capture incremental changes in the source systems. New records are added to the Hubs, new relationships are established in the Links, and the descriptive attributes in the Satellites are updated to reflect the changes.
- Historical Load: The historical load is used to backfill or correct historical data in the Business Vault when errors or updates are detected in the source systems.
- Agile and Iterative Development: The Data Vault methodology promotes an agile and iterative development approach. It allows for incremental delivery of data models and encourages quick iterations, making it easier to adapt to changing business requirements. The methodology supports the concept of "just-in-time" modeling, where data structures are designed and implemented as needed, reducing upfront design and development efforts.
- Scalability and Flexibility: The Data Vault methodology is designed to handle large volumes of data and accommodate changes in the data model over time. The modular nature of the methodology allows for easy integration of new data sources and extension of the existing model without requiring significant rework. The scalability and flexibility make it well-suited for data warehousing and business intelligence solutions.
Overall, the Data Vault methodology provides a robust and scalable approach to data modeling and data management in a data warehousing environment. It emphasizes data traceability, flexibility, and scalability, making it suitable for organizations dealing with complex and rapidly changing data landscapes.