Saturday, December 3, 2016

DW Patterns: The good - type 6

Alright, carrying on from last week where we looked at Type 5, this time we'll be spending a fair amount of word-count on...

Type 6: Hybrid

Yay, mommy and daddy are speaking again! For Type 6, Kimball and Wikipedia are once again in agreement in terms of definitions. Type six is apparently named because it's like combining types 1, 2 and 3, which because it's such a perfect number both multiply and add to 6. The similarities are this:
  • It's like Type 1 because we have a type 1 column for the attribute
  • It's like Type 2 because we also have a type 2 column for the attribute
  • It's like Type 3 because there's more than one column. Yup.
I guess that sarcasm about type 3 isn't really warranted - it is kinda like taking Type 3 to the extreme - instead of "current and previous" or "current, previous and time-before-that", it's "current and at-the-time". I won't let that stop me making snarky remarks though, because that's just who I am.

Anyway, I've said before that if you're going to pick a default, you should pick type 6 rather than type 1 or 2. It also makes Type 5's "outrigger" redundant, since we can just lump the Type 1 attributes in the same spot as the Type 2s and be done with it. Not only that, but the same concept allows us to produce Type 3 attributes, if that's what you're going for.

Actually we get all of this from keeping Type 2 history, so that's the important bit. The reason that I like the Type 6 as a default is that if you're not going to make a decision about whether an attribute should be type 1 or 2, you should allow the user to do so. Otherwise, someone (or a lot of people) are going to think they're looking at one thing when they're actually looking at another. If they get confused and ask you what the current versus historical attributes mean, well great - you can explain and make sure they get it right.

Let's back up a minute though, because this blog hasn't had any SQL in it since my discussion of gluten-free dimension baking and I think it's time to rectify that, because I'm kind of horrified at the thought that someone might actually take Kimball or the wiki article's design seriously. To be clear, your table should look like this:

Dim_Student
SK_TableTableIDType2Attribute
11234Original Value
21234New Value
31234Newer Value
41234Newest Value

That's right - no Type 1 columns, no "last-value" columns, nothing, just Type 2. What you then do is calculate whatever else you need in the view. In terms of performance, if it start's getting slow, you can always throw an index or two on - it should be very easy to get the necessary indexes to make this trivial for almost any number of records. Let's have some examples though, using modern tsql windowing functions:

select
       SK_Table,
       TableID,
       Type2Attribute,
       FIRST_VALUE(Type2Attribute) over (partition by TableID order by SK_Table desc) as Type2AttributeAsType1,
       Lag(Type2Attribute) over (partition by TableID order by SK_Table asc) as Type2AttributeAsType3
from type6.Dim_Something

Chuck this in a view, and bam, all the most useful actually functionally different dimension types (you can even achieve Type 0 by taking the Type 1 and ordering asc instead of desc). You'd probably want to order by something other than the surrogate key in practice (RowEffectiveDate or something like that) but the principle stands.

So that means that we can achieve Types 0, 1, 2,  and 3. It's also like a better version of Type 4, and Type 4k is just a performance optimization for Type 2 dimensions, so who cares about that. We even make Type 5 redundant, since we can just use the mini-dimension and skip the outrigger. So really, what it comes down to is that Type 2 is god - as long as you know how to use it. And now you do!

So in terms of what it is and how to do it, that's all very well and good - but how about what it's useful for? Well... lots of things. The most common use case has to be "Type 2 is what we need, but Type 1 is so darn useful for filtering". What do I mean by that? Well, let's take one of the textbook cases of a Type 2 dimension attribute Employee Team (or department or whatever). You want to be able to have the team as it was when things happened, because that's important context. However, isn't it really handy to just be able to say "How do the people in Bob's team compare to the people in Terry's team?"

Moral questions of comparing people like this aside (it's just the first example that popped into my head - there are plenty of non-worrying questions like it), the Type 2 is necessary for being able to do analysis of the relevant attribute, but Type 1 is really nice entry point into the dimensional model. It's great to have as a filter in situations like these, where you want to evaluate the whole history of a group of individuals even before they were in the group you're using to identify them.

To take the flipped case, you actually find there could be a use case for Type 0 attributes when along-side a Type 2. Ever had someone ask something like "we have a lot of students dropping out of our BAs - where are they going?" You then proceeded to beat your head against the wall, and cry, because arbitrary digraphs aren't exactly one of data warehousing's strong points. With a Type 0/Type 2 combination, however, you can get at least an approximation of this, by filtering Programme Type 0 to "BA" and then grouping by the Type 2 dimension - bam! You've got an idea of how many students are transferring to which programme, and it didn't even take five minutes. Incidentally, I will get to discussing how you can deal with arbitrary digraphs in DWs and, specifically, SSAS/Power BI, since those are my technologies of choice.

No comments:

Post a Comment