Saturday, September 17, 2016

DW Patterns: The good - type 4

I have to apologize - it's been three weeks since my last update. This is because two weeks ago my fiancée and I bought a house. I know that's a pretty pathetic excuse for being busy, but there it is. Anywho, I'm not here to lament my debt, I'm here to continue my series on data warehousing design patterns. I last left off at Type 3, so the next in line is...

Type 4: ???

Ok so there's a bid of disagreement from here on. Wikipedia says Type 4 is "Add history table" while Kimball Group say that it's "Add mini-dimension". Anyway, it's not my ambition to disambiguate this (currently) so I'm going to cover both here - in which case let's get on with it:

Type 4: Add history table

See definition on Wikipedia

This is just how it sounds - you keep a history table mirroring the dimension, containing the history of the data in that table. Now, the Wikipedia article (currently) isn't completely clear as to how the temporal table is referenced in the fact table ("Both the surrogate keys are referenced in the Fact table to enhance query performance"), meaning there are a couple of interpretations:
  • The fact table only contains references to the "current" record (i.e. type 1) and the history table doesn't have direct links (hereafter referenced as type 4a)
  • The fact table contains two references - one to the "current" record (i.e. type 1) and another to the "as-at" record (i.e. type 2) (hereafter referenced as type 4b)
Now, I haven't used either of these before, so I'm going to be commenting based on comparing these strategies with problems I've encountered before. 4b is interesting, but isn't terrifically user-friendly - imagine trying to use a pivot table against a cube with 20 dimensions, where there are two copies of each! It is, however, a relatively easy-to-implement version of Type 6, for every attribute on the dimension, which might be quite attractive in certain situations. I wouldn't recommend using this for every dimension, but one or two could actually improve usability - I could imagine using this for something like a Course dimension, where lots might change (e.g. the name or even the code, which some might consider the natural key) and users want to be able to do analysis on the current state or look up a specific version.
Type 4a is actually really cool, and if you're using SQL server 2016 then this actually comes out of the box with Temporal Tables. Just whip up a Type 1 dimension, make it a Temporal Table and you're done - simple as that. The reason I like this is that if I were working on 2016 and wasn't storing history of my data in another way, I would make Type 4a my default - it's strictly better than Type 1. This is because if you've made everything a type 4a and your BA decides "now X attribute needs to be type 2/4a/6/etc. then you have the necessary information to backdate it. Otherwise you're stuck in the frustrating situation of having to say "we can make it Type 2, but only going forwards - it won't have history from before today" which is a conversation I always hate having.


Alright cool, so what about the Kimball one, which I'm going to refer to as...

Type 4k: Add mini-dimension

I'm not going to talk much about this, as the mini-dimension is purely an optimization technique. The idea is to pull out a few very volatile or extremely frequently queried attributes from a Type 2 dimension and form them into their own junk dimension. Because it's an optimization technique only, I'm not a huge fan, and I would suggest looking at indexes or checking your dimensional architecture to see if you've made a dimension granularity mistake. If, however, your dimension meets all of the following conditions, you might consider pulling some attributes out into a mini-dimension:
  • The dimension you're looking at is Type 2-ish (2, 4a, 4b, 6 etc...)
  • The dimension is extremely large (to the point that it slows queries significantly) and a few attributes are in high demand (e.g. an invoice line-item dimension for a large company where item-type and quantity banding is used a lot)
  • Or the dimension is unmanageably fast-growing due to a few attributes (I can't even think of an example that I don't immediately rule out as a bad design for this)
  • User-transparent methods for optimizing (e.g. indexing, stats etc.) have failed to provide adequate performance increases
So it could be useful occasionally, but it's not exactly something I'd teach in Data Warehousing 101...

Stick with us as I continue this series on existing DW design patterns by looking in to SCD Type 5

No comments:

Post a Comment