The Warehouse of Data: August 2016

Monday, August 29, 2016

DW Patterns: The good - type 3

Ok, today I will be continuing on from last week - this time talking about the first of the less-well-known SCD types:

Type 3: Add new attribute

See definition on Wikipedia

This is an interesting pattern - of all of the SCD patterns that actually retain attribute history, this is the only one that provides easy access to any attribute value other than the current value (i.e. type 1) or the as-at value (i.e. type 2). With type 3 we get the current value, as well as one or more immediately prior values. So, why would you want to use this?

The type 3 is great for transitions. Now, I'm not talking about transitions from one entity state to another - if you're wanting that, you're almost certainly better off with Type 2, 4 or 6 (coming in the next post). Those are great at managing the changes to entities as the react with your facts. What I mean is changes in business state; things like changes to business rules, changes to organisational structure and so on.

Type 3 is a good tool for evaluating the effectiveness of changes - you can compare, for example, the effectiveness of a new workload distribution method vs. the previous method, side-by-side. Once you've decided that the new method is better, the old doesn't matter - and when you're then assessing another new method later, you probably don't care what the old, worse method was - that history is unnecessary. You might even decide, once your evaluation period is over, that the field doesn't need to be Type 3 any more, and get rid of the "previous value" attribute altogether, which is fine.

This really comes into its own when the grain of a dimension becomes complicated - if it's a single key, then you can quite happily implement Type 3 over the top of a Type 2 or 6 by using a view; if that involves matching on multiple keys though it could become quite nasty.

It's also dead simple to implement - if your manager says "can you put together something that lets us compare the current method with the old method? Take your time, my meeting isn't for another two hours..." then Type 3 is probably a good option.

This series is continued with commentary on Type 4.

Monday, August 22, 2016

DW Patterns: The good - type 2

I'm afraid this is going to be another short one - buying a house apparently takes time, and so does watching Netflix. It's a tough job, but someone's gotta do it.

Anyway, the topic today is Type 2 - probably the most misunderstood of dimensions. Since this is The Good, I'm going to be outlining just what it is good for, but just a word of warning - Type 2 SCD only solves one problem in terms of change tracking - it's not a magic "now we have all history" bullet, and under no circumstances should you apply it as a default pattern. If you do so, you risk doing some fairly hefty damage to your ability to perform meaningful analysis.

Type 2: Add new row

See definition on Wikipedia

This is the first kind of SCD where we actually do tracking of changes to dimension entities over time. An important thing to get your head around is the word "entity" - this is where you have a business object which forms the grain of your dimension table. If you have a dimension based around such an entity, you will probably have at least some attributes that you want to be type 2. As an example, you might have Dim_Student; the student may have names (which you probably don't want to make type 2 as I mentioned previously), an ID, probably demographics like ethnicity and age.

Type 2 will work for some of these, and not others. Some candidates are obvious - we know that age changes over time, and we want to be able to relate our facts to the age of the student at the time that the event that our fact measures occurred (e.g. enrolment). This works well because we have an attribute (age) of our entity (student) which changes over time, and most importantly the value changing over time represents a change in the entity over time.

Let's take another example - ethnicity. Students' ethnicity can change over time for a variety of reasons - maybe they didn't know their ancestry before, maybe they've reconnected with it, or in some cases they're claiming it because they realize it allows them to apply for certain scholarships. Whatever the reason, it doesn't represent a change in the student, it represents an update to what we know about the student. This is a Type 1 attribute.

"Hold on, we need to match what we report to the government, so it has to be Type 2."

Absolutely - in this case, your Dim_Student is not modelling a student - it's modelling a student record. It's very important that you are completely honest with yourself about what it is that you're modelling - student records have very different needs from actual students. Fear not, though, I'm not suggesting that you have a Dim_Student and a Dim_Student_Record dimension - you can have the best of both worlds and model both entities in one dimension, though you may wind up making some compromises in order to get that simplicity. Stay tuned to find out how! (Just kidding - it's type 6. I really will get there though...)

Continue your journey through my mind with DW Patterns: The good - type 3

Monday, August 15, 2016

DW Patterns: The good - types 0 - 1

In my previous post I talked about data warehousing design patterns and the idea that we might need to start branching out. Before we get to actually looking at potential new patterns, we should take stock of what patterns we currently have and what they're good for. This post and the next couple will be about the different types of Slowly Changing Dimensions, then we'll pick up on facts. Importantly, I will be looking at the SCDs as currently presented in Wikipedia.

Before I get into the details, I need to debunk a common misconception. No Slowly Changing Dimension is better than any other. The choice of which SCD to use depends on the situation. Type 2 is not "better" than type 1, unless the situation needs it. Type 0 is not "worse" than the other types, except in that it is less frequently required. So, without further ado:

Type 0: Retain Original

See definition on Wikipedia

This is where we never update data once it's gone into the table. The only example anyone's been able to provide me with in regards to this is the Date dimension, but that's kind of cheating because the source data doesn't really change, or if it does (e.g. holidays etc) then it's actually type 1. It is conceivable that it might come in handy though, and I'm sure you'll know if it does (please let me know in comments if you have come across any examples of this in the wild).

As I was writing the second-to-last paragraph of this article I realized that junk dimensions are another case for being called type 0, but since there's nothing really being updated in the source system (just different combinations over time) I'm not convinced that those can be legitimately called Type 0 either.

Type 1: Overwrite

See definition on Wikipedia

In a Type 1 attribute, if data in the source system changes then we overwrite it. In a solely type 1 table, date won't form part of the grain (unlike in type 2) unless you're talking about a date-related table, like the Date dimension.

Overwrite attributes are the singularly most over-used attribute type in data warehousing. This is because it's easy, and frequently acceptable. We're also not here to talk about when they shouldn't be used, so when should you use Type 1?

If you want to be able to find all historical facts for the latest version of something then you want Type 1. A great example here is that of people's names. If I go to look up a student's records, I don't want to have to specify every name they've ever had. That's just common sense. Of course, it depends on the situation; marital status is a highly situational example: in some situations it may be very important to know if a person was married at the time, while for other businesses you may only care about what they are now.

Another use for Type 1, as alluded to earlier, is if you just don't care. It's a lot of effort to set up type 2+, so if it's an attribute that you don't care about then type 1 may be what you want. If you don't think something's going to change at all, you might set it up as type 1 and rely on backups to let you backdate a type 2 fix if it turns out that it does change.

Notably, Type 1 dimensions only work for dimensions that represent entities. If you try to deal with a junk dimension (or any other grain-free dimension for that matter) with Type 1 you will have a bad time - don't do it.

Sorry this one's a bit short, I've had a busy weekend. Stay tuned for my next post at the end of the week continuing discussion about where our patterns do work.

Sunday, August 7, 2016

Let's talk about how Slowly we are Changing

Data warehouses are great. I better think that, I've made a career out of developing them. Like any software, they exist to provide value and solve problems, and like any software there are design patterns for solving particular problems. It is my opinion that these design patterns are too often applied without enough thought or understanding of exactly what they are for, and when they are not suitable - I will spend the next few posts explaining this, and suggesting some variations for particular issues that I seem to come across quite often, and which are not adequately solved by the existing data warehousing patterns.

Ok, there's the abstract over - what am I actually talking about here? The standard data warehousing tools you have in your tool-belt: transaction facts, snapshot facts and slowly changing dimensions (of various kinds). Sorry if some of this gets a little simple for you - I think it's important to go over the basics and investigate the guiding principles , especially since most DW specialists I know of picked their knowledge up on the job in one form or another, and have spent less time than they might thinking purely theory (I know I've been guilty of this, though I have also been fortunate enough to be surrounded by some really excellent people who like talking about things). So let's start with some definitions.

Data warehouses are made to allow analysis of data, particularly over time. This is why the date dimension is so important - in the words of Kimball, it should be attached to virtually every fact table, so that's cool.

Next, a fact is something that has happened - something measurable, even if the measurement is just the observation that it happened. This needs fleshing out more, but I'll get to that later, this is good enough for now.

The definition of the word "dimension" is a little tricky, because different people mean different things by it. Here are a couple:

A dimension is a a group of related attributes by which we measure our facts (space-time might be a dimension in this context). This is likely what a data warehousing specialist means when they say the word "dimension".
A dimension is the axis by which we measure things, for example the x axis in a graph, or the three dimensions that we live in, or time. Note that this is what is referred to as a dimension attribute above. This is the closest version to the non-jargon meaning of the word.
A dimension is a plane of existence, and you can create cool portals and kill versions of yourself in other dimensions to reap their power for yourself. I won't talk about this one any more, as it's irrelevant to us.

Now, even though we as data warehousing specialists typically use the first definition, we managed to mire ourselves with the term "slowly-changing dimension" or SCD. This is the most awkwardly named term I've run across in the field, because while it's definitely jargon, the "dimension" which is slowly changing is actually a dimension attribute according to our jargon, and that's without even getting in to the whole "slowly-changing" part. So that sucks, but it's good to make sure it's remember this, or it can and will cause confusion.

SCDs of course come in a variety, and exist for one reason only: to decide how the data warehouse should treat changing data in source systems. Different SCD types are suitable for different issues, and no SCD should be considered a "default" lest we make asses out of you and me. That, of course, doesn't stop us from treating Type 1 as a default, but let's pretend we don't do that for a while, because I want to talk about how much they don't work.

Of course, they do work, but they don't work in every situation, and I see people trying to make up for that with odd, ad-hoc solutions based on what seemed like a good idea at the time. This is painful to me, as that is exactly what design patterns are supposed to solve in the first place. Slowly-changing dimensions are seen as the only options, and that's not great, because we should be able to define new patterns (the same is true for fact tables - Kimball defines 3 fundamental types, but there are other options - something I will be getting in to over the next few weeks).

So, we have a claim: SCDs work in some situations and not in others. Obviously the question is, which situations work where? This is intimately wrapped up in the question of what fact tables work where, and I will start to get in to it thoroughly in my next post, but the underlying principle is that SCDs (like much Kimball doctrine, and for a reason) are really geared towards transaction fact tables, and quickly fall short in many real-world scenarios that I've come across. So, maybe it's time for us as a community to change a little less slowly, and start working with a few extra patterns.

See my next post on the topic for more specifics on how the patterns currently serve us