Join Shane Gibson as he chats with Ahmed Elsamadisi on the Activity Schema Modeling pattern.
Listen
Listen on all good podcast hosts or over at:
Read
Read or download the podcast transcript at:
https://agiledata.io/podcast/agiledata-podcast/the-patterns-of-activity-schema-with-ahmed-elsamadisi/#read
Google NoteBookLM Briefing
The Activity Schema: Revolutionising Data Modelling
Source: Excerpts from "The Activity Schema data modeling pattern with Ahmed Elsamadisi - AgileData #42" podcast.
Date: 2024-02-23
Purpose: To provide a comprehensive overview of the Activity Schema data modelling pattern, its origins, core principles, advantages, challenges, and its place within the broader data landscape, as discussed by Ahmed Elsamadisi and Shane Gibson.
Executive Summary
The podcast episode with Ahmed Elsamadisi introduces the Activity Schema as a novel data modelling pattern designed to address the inherent scalability and usability issues commonly faced by modern data teams. Born from experiences at companies like WeWork, where traditional data stack implementations and modelling approaches repeatedly failed to deliver, Activity Schema proposes a shift from siloed, question-specific tables to a unified, event-centric model. Its core tenets include "rack and stacking" all relevant activities (events) into a single, comprehensive table per core entity (e.g., customer, product), and leveraging a set of "temporal joins" to answer complex, time-based behavioural questions. This approach aims to make data more accessible, reduce modelling complexity, and enable faster iteration on analytical queries, particularly valuable in fast-growing and dynamic business environments.
The Problem with Traditional Data Modelling and Stacks
Ahmed Elsamadisi’s journey, from self-driving cars to missile defence and then to WeWork, highlighted a pervasive issue in data: data modelling is the core problem for why data scales terribly. At WeWork, they cycled through "maybe four or five data stack implementations," switching databases (Postgres, Redshift, Snowflake), schedulers (Unicron, Luigi, Airflow), and BI tools (Chariot, Tableau, Looker, Anodot, Dictionary, Lineage). Despite these significant and expensive changes, the fundamental problem persisted: "numbers are not matching," "people are still yelling at us," and questions remained hard to answer.
Tooling vs. Way of Working: Shane Gibson notes that "as data people and as technologists we love to refactor our stack… but we hardly ever refactor our way of working." The focus often remains on faster databases or new tools, while the underlying approach to data organisation remains unchanged.
Scaling with Bodies, Not Systems: A story from a large "fan company" (e.g., Airbnb, Amazon) illustrates this: when asked how they deal with the data problem, the response was, "I just go to slack I log a call 15 minutes later the data team have got me the data." This signifies scaling "with bodies not with ways of working and not with systems," which is unsustainable for most organisations.
The Thousand Tables Problem: The common outcome of traditional modelling is an explosion of tables: "Why do you have a thousand tables?" This complexity leads to difficulties in finding data, understanding column meanings, and maintaining data lineage and dictionaries.
Star Schema Limitations in Modern Contexts: The Star Schema, while brilliant for its original context, was designed for a world with "one data system" (e.g., one Postgres database) and expensive compute. It excelled at aggregating data within a single system (e.g., "how many comments are on my ticket?"). However, in today's multi-system world, where businesses ask complex behavioural questions across disparate systems (e.g., "when people submit a ticket are they more likely to turn?"), traditional Star Schema joins become "very complicated," "very nuanced," and "never predefined," leading to the creation of numerous, unique, and hard-to-maintain tables.
The "Model" Misnomer (DBT): The term "model" in the DBT world, often referring to "a blob of code," has caused "so much damage in the marketplace" by fostering an environment of "5,000 Blobs of code," making rigour difficult to maintain, especially with self-service democratisation.
Analyst Preference for "One Big Table": Shane Gibson argues that analysts generally prefer "one big table" of data, challenging the notion that they inherently "love star schemas and understand it."
Iterative Questioning & SQL Complexity: The process of answering business questions is iterative, moving from "how many" to "where," "how long," "exceptions," and ultimately "why." Current SQL approaches treat "every step like a production system," leading to long, complex, and error-prone queries (due to potential row duplication or dropping) that can take "two weeks per question" and result in "50 tables" to maintain.
Ambiguity of Terms: Simple terms like "sales" can have vastly different meanings to different departments (e.g., when contract signed, when reservation started, when first payment made). Traditional column naming often fails to capture this nuance, leading to misinterpretation and "bad decisions."
Core Principles of Activity Schema
The Activity Schema pattern fundamentally rethinks data organisation by focusing on actions and entities.
Focus on Activities: The first core insight was: "What if we can actually take a structure that represents the way people talk about data and what if we can actually relate data in the same way that people think about relating data." This means breaking the world into "activities" – discrete actions or events (e.g., "completed order," "got shipped," "signed contract," "started subscription").
"Rack and Stack" into a Single Table: Activities related to a core concept are "rack and stacked... into a single table," creating "one place to go" for questions related to that entity. This aims for "one table to rule them all," simplifying data discovery, lineage, and dictionary needs.
Temporal Joins / Analytical Functions: Instead of traditional joins, Activity Schema uses a defined set of "temporal joins" (or "analytical functions") to understand relationships between activities over time. Ahmed states there are "only 10 of them," making querying consistent and standardised. Examples include:
First Ever / Last Ever: (e.g., "first web visit," "last order").
First Before / Last Before: (e.g., "last time before [this order] did they submit a ticket").
In Between: (e.g., "in between orders did they email").
These allow adding "as many more Dimensions as people ask follow-up questions and slice some nice side data without ever having to worry about going back to the original definitions of models."
Core Actor/Entity Binding: Activities are always related to a "core actor or the entity that you care about." For most businesses, this primary entity is a "customer," "person," or "organization" ("who does what"). This means adapting language (e.g., "customer has product shipped to them from the store" instead of "store ships product"). This binding prevents "bad decisions" by ensuring analysis is tied to the entity experiencing the action (e.g., "when a customer first sees a class" rather than "when a class gets launched").
Multiple Activity Streams: While aiming for "one table," complex businesses may have "multiple entities and multiple activity schemas" (called "activity streams"). For instance, a "person stream" and a "company stream," or a "customer stream" and a "product stream." The decision to create a separate stream is driven by the questions being asked (e.g., "how long take for a product to be manufactured" would go to the product stream).
Duplication Across Streams (Intentionally): It's "okay for activities to be in both streams" if relevant to multiple core concepts (e.g., "customer has product manufactured" activity could be in both customer and product streams). This avoids complex "cross-join[s] across those two activity schemas."
Features as JSON Blobs: While initially supporting only three "features" (dimensional attributes) per activity, the pattern now allows "a Json blob" for "as many features as you want." This embeds detailed information directly into the activity row, eliminating the need for separate lookups (e.g., order_feature_json.order_value).
"Attributes" as Activities: The pattern encourages treating traditionally seen "attributes" as activities if they represent an action or event. For example, instead of an "ad source" being an attribute of a page view, it could be modelled as "customer viewed ad," correlating it with "customer viewed web page."
Agile and Iterative: The Activity Schema is an "agile modeling technique." Users are encouraged to "start small," typically with a customer activity stream and a few events, adding more as questions arise. If an activity is never used, it can be "removed" as the data is "just captured in the warehouse."
Advantages of Activity Schema
Scalability: Directly addresses the "why data scales terribly" problem by standardising data structure.
Simplicity and Usability: Significantly reduces the number of tables, making data easier to find and understand ("if you had one table only... problem solved").
Consistent Querying: The 10 temporal join patterns provide a "more consistent standardized way of relating data," reducing ad-hoc, complex SQL.
Faster Analysis: Enables rapid iteration on questions. "You can always add as many more Dimensions as people ask follow-up questions and slice some nice side data without ever having to worry about going back to the original definitions of models."
Improved Decision Making: By explicitly tying activities to core entities (e.g., customers), it helps prevent misinterpretations and "bad decisions" often caused by ambiguous data.
Self-Explanatory Data: When structured consistently, the table "should be self-explanatory," reducing the need for extensive lineage and data dictionaries.
Facilitates Sharing and Reuse: A standardised structure allows for sharing of "packages" (common activities, dashboards, analyses) across companies within the same industry.
Compatibility with AI/ML: Its rigid, consistent structure, aligned with how humans articulate questions, makes it ideal for text-to-SQL tools and Chat GPT integration: "the activity schema is I believe to be the only way that chat GPT will ever work on top of data."
Anti-Patterns & Considerations
Ahmed outlines several anti-patterns and guiding principles:
More Than Two Streams: "If you built more than two streams you should ask yourself why."
More Than 30 Activities: "If you have more than 30 activities you should stop and say why." Most companies don't need this many.
Query Time > 10 Minutes: "If you find yourself taking more than 10 minutes to answer a question you just stop and say why."
Activity Definition Time > 15 Minutes: "If you're spending more than 15 minutes to Define an activity you should stop and ask why." Activities should be "simple," "slight mapping," not "really complicated things."
Shoving Derived Data into Features: Avoid putting derived attributes (like "ad source of the order" or "attribution of the order") directly into activity features. These are often better modelled as separate activities (e.g., "customer viewed ad") that can then be temporally joined.
Over-reliance on Features: While JSON blobs allow many features, it's recommended "that you don't put more than three or five." The ability to "borrow features from other activities" via temporal joins often negates the need for excessive features on a single activity.
Slowly Changing Dimensions (SCDs): While the pure Activity Schema would model changes (e.g., customer name change) as new activities (e.g., "customer updated their location"), it can support SCDs via background processes for specific edge cases.
Not a Replacement for All Modelling: For companies like Nike with "core metrics" or "CFO reports" requiring "very specific" definitions, traditional semantic layers or dimensional models might still be appropriate. Activity Schema excels in "open-ended fuzzy repetitive question[s]" and "chaos situations."
Software Engineers Tracking Everything: While beneficial for software engineers to engage with data, their tendency to "track everything" (e.g., "log open modal") can lead to "garbage activities." The recommendation is to "start from what people are asking and build models that represents that language," rather than what's represented in the database.
Activity Schema's Place in the Data Landscape
Contrast with Data Vault: While Shane Gibson initially thought Data Vault could be a middle "combining layer," Ahmed suggests that "most our customers don't do that," preferring to build activities directly "on top of production systems."
Collaboration and Team Structure: The pattern naturally supports domain-bound data teams (e.g., "customer activities team," "manufacturing activities team"). Hand-offs occur when activities defined by one team contain a core concept relevant to another.
Complementary to Data Mesh (Potentially): While the podcast doesn't explicitly endorse Activity Schema as Data Mesh, it aligns with principles of pushing data ownership closer to source and domain teams. However, Ahmed cautions against engineering teams blindly tracking everything, advocating for business-driven activity definitions.
Strong Alignment with Generative AI (ChatGPT): The opinionated, standardised structure and the use of natural language in defining activities and temporal joins make Activity Schema highly compatible with generative AI tools for text-to-SQL functionality. It provides the "controlled space" and "standard structure" needed for these tools to perform effectively.
Conclusion
The Activity Schema offers a compelling alternative to traditional data modelling, particularly for organisations grappling with data scalability, analytical complexity, and the need for rapid insights from diverse data sources. By centralising event data around core entities and standardising temporal relationships, it aims to democratise data access and accelerate the analytical workflow. While it presents a shift in mindset and requires adherence to its core principles, its promise of simplifying complex questions and enabling advanced analytics makes it a significant development in modern data architecture.
«oo»
Stakeholder - “Thats not what I wanted!”
Data Team - “But thats what you asked for!”
Struggling to gather data requirements and constantly hearing the conversation above?
Want to learn how to capture data and information requirements in a repeatable way so stakeholders love them and data teams can build from them, by using the Information Product Canvas.
Have I got the book for you!
Start your journey to a new Agile Data Way of Working.