I’ve been a data analyst consumer in a large retail organization, and the EDW team provided the current state table, with the historical one suffixed with _HIST. It was pretty straightforward to use & understand, and once I started using dbt and actually running data tests, I helped them discover and debug some issues
This is not a new idea at all. It is at least 30 years old. Maintaining a separate current state snapshot for the high proportion of analysts who only care about current state is a pattern consistent with minimising processing on constrained resources and even more so now when some cloud platforms charge for the entire set of rows in the table not the subset extracted. What has changed is the rise of Data Science driven feature engineering interested in variable windows applied to that history. A similar approach to creating a current snapshot can be applied to maintaining the outputs of feature engineering. If multiple Data Scientists require the same feature, why would you require them all to re-run that process.
But despite that, some people claim that keeping the history of change is pointless. It is not. Snowflake has "time travel" which in many respects achieves something similar, but is definitely not the same.
SCD 2 just keeps the deltas and marks of the prior record as the previous state.
Time travel is presented as a complete copy of the table (even if under the covers that is not what is stored)
No you did not. But I am clearly stating it is not new so that a newbie reading this understands that is the case. There are patterns associated with the use of SCD2 that simplify and perform. It takes a while to learn and appreciate them. The Data Science pattern that I described is something I have implemented in several environments and has been appreciated by Data Scientists in all of them. You being a pattern fanatic, I thought you would appreciate it.
Was just makig sure I hadn’t inferred anywhere that I created or invented that pattern in anyway. At best I innovate on patterns that were invented by others, more often or not I just discover them and disseminate the fact they should be used, and more importantly when they are useful.
Sharing is Caring!
And yup I liked the pattern you described of automating the Analytical Feature Flag tables (as compared to App deployment feature flags) so anybody can use them and not have to invent their own every time.
They are not just analytical feature flags. DS features can be flags [0,1] or continuous ( eg amount spent on something), or discrete values such as age group. The point is that once they prove useful in one ML model, they are likely to be useful in another, or multiple Data Scientists want to see if they are significant in their model. You do not want a team of Data Scientists running the same engineering process independently if you can do it once for all.
Wait until you hear about the client I'm working with that insists on type 145 fact tables EVERYWHERE... Classically of course, it depends. Definitely consumers shouldn't be concerned with the technical minutiae - they should have a requirement and that requirement should be valuable and fulfilled in a usable, viable and timely manner. I'm sure I have a biased experience, but I know that many BI tools prefer a dimensional model, and so some analysts preference will be swayed by their tooling.
I’ve been a data analyst consumer in a large retail organization, and the EDW team provided the current state table, with the historical one suffixed with _HIST. It was pretty straightforward to use & understand, and once I started using dbt and actually running data tests, I helped them discover and debug some issues
This is not a new idea at all. It is at least 30 years old. Maintaining a separate current state snapshot for the high proportion of analysts who only care about current state is a pattern consistent with minimising processing on constrained resources and even more so now when some cloud platforms charge for the entire set of rows in the table not the subset extracted. What has changed is the rise of Data Science driven feature engineering interested in variable windows applied to that history. A similar approach to creating a current snapshot can be applied to maintaining the outputs of feature engineering. If multiple Data Scientists require the same feature, why would you require them all to re-run that process.
It’s helpful to share old ideas because there will always be people where it’s the first time they’re hearing them
Absolutely!
But despite that, some people claim that keeping the history of change is pointless. It is not. Snowflake has "time travel" which in many respects achieves something similar, but is definitely not the same.
SCD 2 just keeps the deltas and marks of the prior record as the previous state.
Time travel is presented as a complete copy of the table (even if under the covers that is not what is stored)
Not sure I ever said it was a new idea did I Mike?
No you did not. But I am clearly stating it is not new so that a newbie reading this understands that is the case. There are patterns associated with the use of SCD2 that simplify and perform. It takes a while to learn and appreciate them. The Data Science pattern that I described is something I have implemented in several environments and has been appreciated by Data Scientists in all of them. You being a pattern fanatic, I thought you would appreciate it.
Was just makig sure I hadn’t inferred anywhere that I created or invented that pattern in anyway. At best I innovate on patterns that were invented by others, more often or not I just discover them and disseminate the fact they should be used, and more importantly when they are useful.
Sharing is Caring!
And yup I liked the pattern you described of automating the Analytical Feature Flag tables (as compared to App deployment feature flags) so anybody can use them and not have to invent their own every time.
They are not just analytical feature flags. DS features can be flags [0,1] or continuous ( eg amount spent on something), or discrete values such as age group. The point is that once they prove useful in one ML model, they are likely to be useful in another, or multiple Data Scientists want to see if they are significant in their model. You do not want a team of Data Scientists running the same engineering process independently if you can do it once for all.
Wait until you hear about the client I'm working with that insists on type 145 fact tables EVERYWHERE... Classically of course, it depends. Definitely consumers shouldn't be concerned with the technical minutiae - they should have a requirement and that requirement should be valuable and fulfilled in a usable, viable and timely manner. I'm sure I have a biased experience, but I know that many BI tools prefer a dimensional model, and so some analysts preference will be swayed by their tooling.
Deffo agree on the tooling sometimes influencing the choice of physical modeling patterns.
From memory Microstrategy always preferred snowflake data models (that’s snowflake the modelling pattern not snowflake the cloud analytics database)
It’s always about the right pattern given your context.
Even so... As a Kimballite I'd probably build OBTs as a mart on top of a dimensional model so consumers have a choice.
IMHO that is often a great layered data architecture pattern.