As we started to deliver our Fractional Data Service to customers, we started to experience the variability of data integration challenges that you would expect when you deal with different organisations, in different industries, different countries all using different systems of capture and different technologies.
But as you would expect, we have identified and adopted a core set of data collection patterns that we resuse to help us scale the machines, not the humans, following the Define it Once Reuse it Often (DORO) principle that is key to our AgileData Way of Working.
Over the last five years we have settled on 5 core data collection patterns.
Push
data is pushed to a secure Google Cloud Storage “landing zone” from the System of Capture.Pull
data is pulled from the System of Capture by AgileData using a Google Cloud service or a third party SaaS Data Collection service.Stream
data is streamed to an Google Cloud Pub/Sub or directly into the underlying Google Cloud BigQuery instance.Share
data is shared between partner organisations, ensuring controlled access and collaboration across parties.File Drop
data is manually uploaded via the AgileData App, or manual dropped into a secure Google Cloud Storage bucket.
Of course there are always patterns within patterns, for example we may use a delta detection pattern on the AgileData side to detect changes, or we may rely on the system of capture to push change data records to us, or we maybe relying on only new events to be streamed to us.
But when we first look at a new Data Collection problem, we always start of with which of these 5 core Data Collection patterns we are going to use, before we delve into the finer implementation details.
We also have a toolkit of technology options that we have defined and tested that help us quickly leverage one or many of these patterns.
For example:
Customer can manually upload CSV or JSON files to a secure Google Cloud Storage bucket, or use the file upload screen in our AgileData App to do the same data task.
We can automagically collect data from Amazon S3, Azure Blog Storage or any other form of file based “data lake” using Google Cloud Storage Transfer Service.
We can stream Google Analytics or Google Ads data directly to Google BigQuery using the native Google data collectors.
We can stream data to Google Cloud Pub/Sub.
We can Pull data from hundreds of SaaS data sources using Dataddo, or we can create a custom data collector using Meltano if Dataddo doesn’t already have one.
This is just a subset of the technology patterns we now use, and no doubt we will add to them the next time we onboard a new Customer with a new technology problem we need to solve.
The key IMHO is to abstract the Data Collection patterns from the technical implementation patterns, to create a shared language.
Our shared language always starts with:
Do we need to Push, Pull, Stream, Share or File Drop the data to get it into AgileData?
If there is a pattern you think we haven’t found yet, feel free to drop it into the comments.
automagically :) Not sure it was intentional, but apt nonetheless!