Join Shane Gibson as he chats with Chris Gambill about a number of Data Engineering patterns.
Listen
Listen on all good podcast hosts or over at:
https://podcast.agiledata.io/e/data-engineering-patterns-with-chris-gambill-episode-65/
Read
Read or download the podcast transcript at:
https://agiledata.io/podcast/agiledata-podcast/data-engineering-patterns-with-chris-gambill/#read
Google NoteBookLM Briefing
Briefing: Data Engineering Patterns Explained
Source: Excerpts from "Data Engineering Patterns Explained" - Agile Data Podcast with Chris Gambill
Date: [Implicit - Recent, given discussion of Fabric and current tech trends]
Prepared For: Anyone interested in modern data engineering practices, particularly those seeking to understand repeatable solutions to common problems in data architecture and operations.
Executive Summary
This podcast delves into the concept of "data engineering patterns" – repeatable solutions for common data-related problems, often fitting specific contexts. Chris Gambill, a data engineering veteran with 25 years of experience, shares his insights on various patterns he's encountered, particularly within the Microsoft and AWS ecosystems. A key takeaway is the dynamic nature of these patterns, necessitating continuous review due to the rapid evolution of data technologies. The discussion also highlights the importance of documentation, context-awareness, and the emerging role of AI in leveraging and sharing these patterns.
Key Themes and Most Important Ideas/Facts
Defining Data Engineering Patterns:
Patterns are conceptualised as "solutions for common problems which fit a certain context." This analogy is drawn from architectural patterns in building design, emphasizing their repeatability and suitability to specific scenarios.
Examples across various domains illustrate the concept: "the way people submit code to get," "the way people peer program," "the five ceremonies of Scrum," or "a four tier data architecture." Each serves as a "solution to a common problem."
Crucially, a pattern's suitability is context-dependent: "given your context, it may fit, it may be valuable, or it may actually be an anti-pattern."
Core Data Engineering Patterns and Their Nuances:
Python Script ETL/ELT with Docker & AWS Fargate (Batch Processing):
Pattern: Writing a Python script for extract, transform, and load (ETL/ELT), containerising it with Docker, deploying it to AWS Elastic Container Repository (ECR), and orchestrating its execution via AWS Fargate and cron schedules.
Context/Use Case: Ideal for batch scheduling, "maybe once or twice a day, bigger processes."
Anti-Pattern: Not suitable for high-frequency loads (e.g., "every 15 minutes" or "five minute load") or real-time streaming, where "you need to probably go the Kafka route or like I said, Lambdas." Fargate costs can be high for frequent runs.
Deployment: Typically "spin it up and then you're killing it at the end," adopting a "deploy and destroy" serverless pattern.
AWS Components:ECR (Elastic Container Registry): Stores containers.
ECS (Elastic Container Service): Manages scheduling and resource allocation (e.g., EC2 or Fargate clusters).
Fargate: Serverless compute for running containers.
Landing Data in File-Based Storage (Data Lake Pattern):
Pattern: Landing raw data into file-based storage like Google Cloud Storage (GCS), AWS S3, or Azure Data Lake Storage (ADLS Gen2/OneLake).
Benefit: Enables cost-effective loading into downstream systems (e.g., "if we load from Google Cloud storage into BigQuery, it's free. We don't pay any compute").
Exceptions: Out-of-the-box adapters (e.g., GA4 to BigQuery) may bypass this layer for convenience.
Data Adapters/Connectors:
Challenge: Organizations often have a mix of well-known systems with off-the-shelf adapters (e.g., HubSpot, Salesforce) and custom or niche systems requiring bespoke adapters ("bring your own adapter" - BYOA).
Considerations: Customisations to off-the-shelf packages can render commercial adapters unusable, forcing custom builds. Security and governance requirements (e.g., in cybersecurity, government domains) might favour custom Python scripts over "low-code/no-code" tools due to greater control.
Databricks Serverless Library Management (Bootstrap Notebooks):
Problem: Databricks serverless compute, being ephemeral, doesn't support traditional initialisation scripts for pre-loading libraries. Every new orchestration effectively starts with an empty container.
Pattern (Bootstrap Notebook): A "bootstrap notebook" is used to consistently install required Python libraries. This notebook determines if a "wheel" (a local, static file of the library) is available in a data lake (ADLS, S3) for faster, version-controlled installation. If not, it falls back to PyPI.
Benefits: "one central location for maintainability and for consistency across your full environment so that when you're troubleshooting, you have one place to go."
"Wheel": A pre-compiled or static package for Python libraries, offering consistency and faster installation compared to downloading from PyPI every time. It allows for version control, ensuring "a consistent version of that library."
Azure Data Factory (ADF) Orchestration:
Pattern: Using ADF pipelines to orchestrate various tasks, including running Databricks notebooks and other ADF pipelines. Includes built-in success/failure handling, alerting (e.g., Teams messages), and metadata tracking (logging run stats, records loaded, run IDs).
Comparison to Airflow:ADF: Can orchestrate within the Azure ecosystem, but may require creative solutions for external systems without native connectors. People often "schedule and forget" ADF pipelines, underutilising its orchestration capabilities.
Airflow: Often seen as the go-to orchestration tool, especially for hybrid cloud environments or orchestrating disparate systems due to its robust connectivity.
Databricks Native Orchestration: While not a "native orchestrator" like Airflow or ADF, custom Python notebooks within Databricks can orchestrate child notebooks, leveraging Spark's concurrency for sequential or parallel execution. These can be "table-driven" for dynamic orchestration.
Cross-Cutting Principles and Challenges:
Context is King: The suitability of any pattern is entirely dependent on the specific organizational, technical, and business context. "If you think about it again, that context is key." This includes factors like data residency, security requirements, existing technology stacks, and skill availability.
Patterns Within Patterns: Data engineering solutions are often composed of multiple interlocking patterns. An orchestration pattern, for instance, encompasses sub-patterns for alerting, logging, and dynamic configuration.
The Technical Deployment May Change: Even if the core conceptual pattern remains, its technical implementation will vary significantly across different technologies (e.g., Databricks vs. Azure Synapse). "The technical deployment of a patent may change depending on which technology you use."
Rapid Technological Evolution: The data platform landscape changes at an incredible pace. Patterns need "iterated over time" and reviewed frequently (Chris aims for "at least once every 12 months, if not every six months") to ensure efficiency and relevance. Microsoft Fabric is cited as a prime example of rapid evolution from a "marketing architecture" to a robust platform.
Vendor Patterns/Strategies:Microsoft (Fabric): Focus on unified control planes, a "one lake" data strategy, and integrated UI. Initially priced to attract large customers as early adopters, it's now more accessible and robust. Microsoft is "coercing people towards fabric in many different ways" (e.g., deprecating Power BI licenses, changing certifications).
Databricks/Snowflake: A tendency to "announce features that don't exist now" at summits, followed by phased rollouts (internal, trusted users, early adopters, generally available) over a ~12-month cycle.
Snowflake's "End-to-End" Strategy: Increasingly integrating third-party functionalities directly into their platform, putting pressure on customers to "decommission all the third party products."
DBT (Fusion): As an open-source project transitioning to a commercial entity with VC funding, it's a "common pattern" for more features to move behind a paywall to drive growth and profitability.
Dynamic/Table-Driven Architecture (Mega Pattern - "Context Layer"):Pattern: Storing configuration, business logic, dependencies, and attributes in relational tables to dynamically generate and orchestrate data processes, rather than hardcoding them.
Benefits: Reduces "overhead," improves maintainability, especially for inherited codebases. "It's so much more maintainable than hard coding it."
Analogy: Think of it as a "context layer" that drives orchestration and ETL processes, making them adaptable to change without extensive code modification.
The Overwhelming Landscape for Newcomers: The sheer volume of tools, platforms, languages (Python, Java, SQL), and undocumented patterns makes data engineering incredibly challenging for junior professionals. "You don't know what you don't know."
The Challenge of Documentation & Sharing:Importance: Crucial to avoid "tribal knowledge" where patterns are lost if key personnel leave. "So important I think, to document these things and write them down."
Reality: Documentation is often an "overhead" for consultants paid hourly, making it less incentivised. Even documented patterns can be hard to find and reuse, leading engineers to "just write from scratch" based on memory.
AI's Potential Role: LLMs could potentially act as repositories for documented patterns, providing step-by-step guidance and even generating code based on project descriptions, helping less experienced individuals navigate the complexity.
Future Outlook
The rapid evolution of data technologies and vendor strategies means that data engineering patterns will continue to adapt. The increasing consolidation of platforms (like Microsoft Fabric and Snowflake's expansion) reflects a desire for integrated, end-to-end solutions. The rise of AI/LLMs holds promise for democratising access to, and leveraging of, documented patterns, potentially mitigating the "overwhelming" nature of the field for newcomers and fostering better knowledge sharing. However, the onus remains on experienced professionals to document and refine these patterns.
«oo»
Stakeholder - “Thats not what I wanted!”
Data Team - “But thats what you asked for!”
Struggling to gather data requirements and constantly hearing the conversation above?
Want to learn how to capture data and information requirements in a repeatable way so stakeholders love them and data teams can build from them, by using the Information Product Canvas.
Have I got the book for you!
Start your journey to a new Agile Data Way of Working.