The Warehouse of Data: December 2016

Saturday, December 24, 2016

DW Patterns: The good - a recap

Ok, the next obvious step for looking at DW design patterns is fact design patterns - which I will start on next time. First, however, I want to take a quick look back and summarize the dimension patterns.

So the main takeaway from this is that Type 2 is the master pattern, at least for the under-the-hood part. Thought should be put in as to whether your users actually want type 2, but even if they don't, it's still the way to go. Unless you're confident that you're users will never want it, or that your disk space will suffer too much from it, it should be your default implementation.

Let me just say that again: you should default to Type 2 and only consider not retaining the history in exceptional circumstances. This is because if you have the history in type 2, it's trivially easy to implement types 0, 1, 2, 3 and 6 as a view - plus it renders type 4 pointless. That leaves us with types 4K and 5K, which are both optimizations, and thus should only be considered in exceptional circumstances anyway.

So: if you're making an ETL automation tool, or anything like that, you should start with Type 2 and add others only as they are needed.

Stepping back from the individual patterns for a moment though, let's talk about what kinds of problems they solve and what assumptions they make about our data. First, the most important aspect to understand about SCDs - they do not store all history. The only thing any of the above SCDs track is changes to the entities represented by dimensions. This is great when your dimensions nicely fit into business concepts, but breaks down the moment you have something like, say, a junk dimension. Or a dimension with an awkwardly split scope (e.g. a student dimension that's split into two records - one for the student's international study, one for their domestic study). Or any other such information.

The weakness of SCDs for tracking scope doesn't stop there though - what if a fact record (an event) is updated so that it refers to a completely different dimension member (e.g. a student's enrolment is updated to change the paper they're enrolled in). Since no entity is changed, SCDs are hopeless here -we have to hope that our facts deal with it instead.

Sadly, the standard patterns don't do this, as we'll see...

Saturday, December 3, 2016

DW Patterns: The good - type 6

Alright, carrying on from last week where we looked at Type 5, this time we'll be spending a fair amount of word-count on...

Type 6: Hybrid

See definition on Wikipedia

Yay, mommy and daddy are speaking again! For Type 6, Kimball and Wikipedia are once again in agreement in terms of definitions. Type six is apparently named because it's like combining types 1, 2 and 3, which because it's such a perfect number both multiply and add to 6. The similarities are this:

It's like Type 1 because we have a type 1 column for the attribute
It's like Type 2 because we also have a type 2 column for the attribute
It's like Type 3 because there's more than one column. Yup.

I guess that sarcasm about type 3 isn't really warranted - it is kinda like taking Type 3 to the extreme - instead of "current and previous" or "current, previous and time-before-that", it's "current and at-the-time". I won't let that stop me making snarky remarks though, because that's just who I am.

Anyway, I've said before that if you're going to pick a default, you should pick type 6 rather than type 1 or 2. It also makes Type 5's "outrigger" redundant, since we can just lump the Type 1 attributes in the same spot as the Type 2s and be done with it. Not only that, but the same concept allows us to produce Type 3 attributes, if that's what you're going for.

Actually we get all of this from keeping Type 2 history, so that's the important bit. The reason that I like the Type 6 as a default is that if you're not going to make a decision about whether an attribute should be type 1 or 2, you should allow the user to do so. Otherwise, someone (or a lot of people) are going to think they're looking at one thing when they're actually looking at another. If they get confused and ask you what the current versus historical attributes mean, well great - you can explain and make sure they get it right.

Let's back up a minute though, because this blog hasn't had any SQL in it since my discussion of gluten-free dimension baking and I think it's time to rectify that, because I'm kind of horrified at the thought that someone might actually take Kimball or the wiki article's design seriously. To be clear, your table should look like this:

Dim_Student
SK_Table	TableID	Type2Attribute
1	1234	Original Value
2	1234	New Value
3	1234	Newer Value
4	1234	Newest Value

That's right - no Type 1 columns, no "last-value" columns, nothing, just Type 2. What you then do is calculate whatever else you need in the view. In terms of performance, if it start's getting slow, you can always throw an index or two on - it should be very easy to get the necessary indexes to make this trivial for almost any number of records. Let's have some examples though, using modern tsql windowing functions:

select

SK_Table,

TableID,

Type2Attribute,

FIRST_VALUE(Type2Attribute) over (partition by TableID order by SK_Table desc) as Type2AttributeAsType1,

Lag(Type2Attribute) over (partition by TableID order by SK_Table asc) as Type2AttributeAsType3

from type6.Dim_Something

Chuck this in a view, and bam, all the most useful actually functionally different dimension types (you can even achieve Type 0 by taking the Type 1 and ordering asc instead of desc). You'd probably want to order by something other than the surrogate key in practice (RowEffectiveDate or something like that) but the principle stands.

So that means that we can achieve Types 0, 1, 2, and 3. It's also like a better version of Type 4, and Type 4k is just a performance optimization for Type 2 dimensions, so who cares about that. We even make Type 5 redundant, since we can just use the mini-dimension and skip the outrigger. So really, what it comes down to is that Type 2 is god - as long as you know how to use it. And now you do!

So in terms of what it is and how to do it, that's all very well and good - but how about what it's useful for? Well... lots of things. The most common use case has to be "Type 2 is what we need, but Type 1 is so darn useful for filtering". What do I mean by that? Well, let's take one of the textbook cases of a Type 2 dimension attribute Employee Team (or department or whatever). You want to be able to have the team as it was when things happened, because that's important context. However, isn't it really handy to just be able to say "How do the people in Bob's team compare to the people in Terry's team?"

Moral questions of comparing people like this aside (it's just the first example that popped into my head - there are plenty of non-worrying questions like it), the Type 2 is necessary for being able to do analysis of the relevant attribute, but Type 1 is really nice entry point into the dimensional model. It's great to have as a filter in situations like these, where you want to evaluate the whole history of a group of individuals even before they were in the group you're using to identify them.

To take the flipped case, you actually find there could be a use case for Type 0 attributes when along-side a Type 2. Ever had someone ask something like "we have a lot of students dropping out of our BAs - where are they going?" You then proceeded to beat your head against the wall, and cry, because arbitrary digraphs aren't exactly one of data warehousing's strong points. With a Type 0/Type 2 combination, however, you can get at least an approximation of this, by filtering Programme Type 0 to "BA" and then grouping by the Type 2 dimension - bam! You've got an idea of how many students are transferring to which programme, and it didn't even take five minutes. Incidentally, I will get to discussing how you can deal with arbitrary digraphs in DWs and, specifically, SSAS/Power BI, since those are my technologies of choice.

DW Patterns: The good - type 5

Ok, so my dramatic return from hiatus didn't work out so well - turns out life is very life-like in that respect. This time I'm writing up a backlog so that when life happens I can continue updating, so there shouldn't be another hiatus until I actually run out of things to say.

Anyway, the last episode was about SCD 4 attributes, so today it's time for...

Type 5k: Add Mini-Dimension and a Type 1 Outrigger

See Kimball definition

Last time you may remember that Kimball and Wikipedia disagreed as to what SCD 4 meant (not just in details, but completely different concepts). Well, Type 5 doesn't even exist on the Wikipedia definition, so this time we just have to deal with the Kimball definition, so that's good.

What's not so good is that Kimball's Type 5 is even worse than the type 4 - it's an optimization on an optimization and the way they do it offends one of the most basic DW design sensibilities beaten in to me by Kimball themselves - don't snowflake. We're getting ahead of ourselves though - what is Type 5? Well, in Type 4 we pull out rapidly-changing Type 2 attributes into a separate table called a mini-dimension in order to spare us some space in the main dimension. Type 5 adds an "outrigger" to this, which is basically a snowflake from the original dimension containing Type 1 versions of the attributes in the mini-dimension. Why Kimball suggest you snowflake this I can't conceive - if it's that you want it attached to the original dimension, why not just add the Type 1 attributes there? If you want it as a separate dimension, just make it separate!

Anyway, as we'll see when we get to Type 6, there are better ways of doing this any way, even if you really must have the Type 4 mini-dimension, so I'd rate this SCD type as basically a non-pattern - why waste our consciousness remembering this when we could just attach a comment to the design and learn it on-the-spot if it's ever used?

I know this is supposed to be "The Good" of the SCDs, but I find myself hard-pressed to write anything good about this design.

The series continues with a discussion of the significantly more useful (despite not introducing any new concepts), Type 6