Agile Data N’ Info

This is not a new idea at all. It is at least 30 years old. Maintaining a separate current state snapshot for the high proportion of analysts who only care about current state is a pattern consistent with minimising processing on constrained resources and even more so now when some cloud platforms charge for the entire set of rows in the table not the subset extracted. What has changed is the rise of Data Science driven feature engineering interested in variable windows applied to that history. A similar approach to creating a current snapshot can be applied to maintaining the outputs of feature engineering. If multiple Data Scientists require the same feature, why would you require them all to re-run that process.

Expand full comment

Reply (2)

Brent Brewington

May 9

It’s helpful to share old ideas because there will always be people where it’s the first time they’re hearing them

Expand full comment

May 10

Absolutely!

But despite that, some people claim that keeping the history of change is pointless. It is not. Snowflake has "time travel" which in many respects achieves something similar, but is definitely not the same.

SCD 2 just keeps the deltas and marks of the prior record as the previous state.

Time travel is presented as a complete copy of the table (even if under the covers that is not what is stored)

Expand full comment

Shagility

Not sure I ever said it was a new idea did I Mike?

Expand full comment

No you did not. But I am clearly stating it is not new so that a newbie reading this understands that is the case. There are patterns associated with the use of SCD2 that simplify and perform. It takes a while to learn and appreciate them. The Data Science pattern that I described is something I have implemented in several environments and has been appreciated by Data Scientists in all of them. You being a pattern fanatic, I thought you would appreciate it.

Expand full comment

Shagility

May 8

Was just makig sure I hadn’t inferred anywhere that I created or invented that pattern in anyway. At best I innovate on patterns that were invented by others, more often or not I just discover them and disseminate the fact they should be used, and more importantly when they are useful.

Sharing is Caring!

And yup I liked the pattern you described of automating the Analytical Feature Flag tables (as compared to App deployment feature flags) so anybody can use them and not have to invent their own every time.

Expand full comment

May 8

They are not just analytical feature flags. DS features can be flags [0,1] or continuous ( eg amount spent on something), or discrete values such as age group. The point is that once they prove useful in one ML model, they are likely to be useful in another, or multiple Data Scientists want to see if they are significant in their model. You do not want a team of Data Scientists running the same engineering process independently if you can do it once for all.

Expand full comment

Johnny Winter

Wait until you hear about the client I'm working with that insists on type 145 fact tables EVERYWHERE... Classically of course, it depends. Definitely consumers shouldn't be concerned with the technical minutiae - they should have a requirement and that requirement should be valuable and fulfilled in a usable, viable and timely manner. I'm sure I have a biased experience, but I know that many BI tools prefer a dimensional model, and so some analysts preference will be swayed by their tooling.

Expand full comment