Sunday, August 7, 2016

Let's talk about how Slowly we are Changing

Data warehouses are great. I better think that, I've made a career out of developing them. Like any software, they exist to provide value and solve problems, and like any software there are design patterns for solving particular problems. It is my opinion that these design patterns are too often applied without enough thought or understanding of exactly what they are for, and when they are not suitable - I will spend the next few posts explaining this, and suggesting some variations for particular issues that I seem to come across quite often, and which are not adequately solved by the existing data warehousing patterns.

Ok, there's the abstract over - what am I actually talking about here? The standard data warehousing tools you have in your tool-belt: transaction facts, snapshot facts and slowly changing dimensions (of various kinds). Sorry if some of this gets a little simple for you - I think it's important to go over the basics and investigate the guiding principles , especially since most DW specialists I know of picked their knowledge up on the job in one form or another, and have spent less time than they might thinking purely theory (I know I've been guilty of this, though I have also been fortunate enough to be surrounded by some really excellent people who like talking about things). So let's start with some definitions.

Data warehouses are made to allow analysis of data, particularly over time. This is why the date dimension is so important - in the words of Kimball, it should be attached to virtually every fact table, so that's cool.

Next, a fact is something that has happened - something measurable, even if the measurement is just the observation that it happened. This needs fleshing out more, but I'll get to that later, this is good enough for now.

The definition of the word "dimension" is a little tricky, because different people mean different things by it. Here are a couple:
  • A dimension is a a group of related attributes by which we measure our facts (space-time might be a dimension in this context). This is likely what a data warehousing specialist means when they say the word "dimension".
  • A dimension is the axis by which we measure things, for example the x axis in a graph, or the three dimensions that we live in, or time. Note that this is what is referred to as a dimension attribute above. This is the closest version to the non-jargon meaning of the word.
  • A dimension is a plane of existence, and you can create cool portals and kill versions of yourself in other dimensions to reap their power for yourself. I won't talk about this one any more, as it's irrelevant to us.
Now, even though we as data warehousing specialists typically use the first definition, we managed to mire ourselves with the term "slowly-changing dimension" or SCD. This is the most awkwardly named term I've run across in the field, because while it's definitely jargon, the "dimension" which is slowly changing is actually a dimension attribute according to our jargon, and that's without even getting in to the whole "slowly-changing" part. So that sucks, but it's good to make sure it's remember this, or it can and will cause confusion.

SCDs of course come in a variety, and exist for one reason only: to decide how the data warehouse should treat changing data in source systems. Different SCD types are suitable for different issues, and no SCD should be considered a "default" lest we make asses out of you and me. That, of course, doesn't stop us from treating Type 1 as a default, but let's pretend we don't do that for a while, because I want to talk about how much they don't work.

Of course, they do work, but they don't work in every situation, and I see people trying to make up for that with odd, ad-hoc solutions based on what seemed like a good idea at the time. This is painful to me, as that is exactly what design patterns are supposed to solve in the first place. Slowly-changing dimensions are seen as the only options, and that's not great, because we should be able to define new patterns (the same is true for fact tables - Kimball defines 3 fundamental types, but there are other options - something I will be getting in to over the next few weeks).

So, we have a claim: SCDs work in some situations and not in others. Obviously the question is, which situations work where? This is intimately wrapped up in the question of what fact tables work where, and I will start to get in to it thoroughly in my next post, but the underlying principle is that SCDs (like much Kimball doctrine, and for a reason) are really geared towards transaction fact tables, and quickly fall short in many real-world scenarios that I've come across. So, maybe it's time for us as a community to change a little less slowly, and start working with a few extra patterns.

See my next post on the topic for more specifics on how the patterns currently serve us

No comments:

Post a Comment