The Warehouse of Data: Sorry, my Dimension isn't having grains right now

You should always be careful about your grains. You should know that picking the grain of the fact table is critical - there is a lot of advice about this (see what Kimball says about grain for example). What about dimensions though?

I haven't seen a lot said about grains for dimension tables - maybe I'm looking in all the wrong places, or maybe it's something that people just don't talk about much, but lack of thought about the grain of dimension tables has bitten me in the backside historically. Maybe it's because the grain of dimension tables always seems like it should be straightforward and simple - have a Student dimension? Clearly the grain will be one row per student. Product dimension? One row per product. Nothing to think about, right?

Sure, except that the whole point of a dimensional model is to make it as intuitive and easy for end-users to find fields they want (or discover new ones) as possible. Where would a user go to view whether a student is an International student? Without thinking, it makes sense to go to the Student dimension right? But wait, in this hypothetical situation my school defines a student as International based on the study they are taking - and we have these students over here that are doing some International study and some Domestic study. Well, we have a few choices:

We can have a "mixed" definition
We can split it out into an extra dimension (say, a StudentProgramme dimension or StudentPaper or something)
We can have multiple rows in the dimension for each student
We can classify a student as the majority of what they study

Now, the first and fourth options are terrible, because suddenly we can't see what proportion of their study is International and what is Domestic. The third would mess with the grain of the Student dimension table so that's out, so we're left with option 2 - well, it kind of makes sense, we can stick all our other StudentPaper attributes on there too; let me just go find some attributes to justify it, I'm sure we can rustle up something.

Hang on, wait just a moment - did you see what we just did there? We made life worse for the users, in order to keep true to our modelling ideals. I'm not saying that's always a bad thing, but we should probably take a very close look at what we're talking about, since making users' lives easy should be our foremost ideal. So where did we go wrong? Or maybe "where can we compromise to produce the best result?" should be the question, as it is so often with these kinds of things.

Well, my opinion is that we should take option three, International-ness is (to our users) a property of a student (and for the vast majority of cases that holds true) so it should be on the student. This is not a case of a slowly-changing dimension, since the two papers can be studied concurrently. So, we're left splitting the grain. I think that's fine - let me explain how I would do (and have successfully done) this.

Keep the grain in a separate table

Create a table that has all of the natural keys that define the grain of the dimension, and include the surrogate key. When building the dimension table this is dead easy - build a temporary table that all of the dimension attributes including the natural keys, and use something like Dense_Rank to provide the surrogate key, as ordered by the columns that will actually be in the dimension table (only include the natural keys in this if they have business meaning - a Student ID might, but a purely system-internal id might not). Then, SELECT DISTINCT them all (excluding any natural keys that don't need to be in the dim table) into the dimension, and select just the surrogate key and natural keys into your separate "grain" table. Here's an example:

Dim_Student
SK_Student	StudentID	International_Indicator	...
1	1234	International	...
2	1234	Domestic	...

Grain_Student
SK_Student	StudentID	PaperID
1	1234	1963
1	1234	2345
2	1234	5684

So here, our grain is Per Student, Per Paper - in fact exactly what's in our grain table. What you might notice here is that had we kept our grain in the dimension, we'd have not only an extra column, making the dimension table wider, we'd also have extra rows. This has two effects (I haven't found any good sources as to the severity of the impact on these sorry). Our dimension table is wider which increases space used and reduces performance, and we will have more surrogate keys in the fact table, which also reduces performance - at least if you're using column-store indexes as you should be. This happens because the fewer values in a column, the better compression we can get in our column-store indexes, which translates into better performance.

The benefits don't stop at performance, though - there are two major "soft" benefits I see to this pattern. One is that it's really easy to join to from other tables (e.g. when building your fact tables) - you know exactly what columns you need, and there is not going to be any "wait, you're telling me I can't just join to DimStudent with StudentID? What a dumb design" kind of reflex by other people using the dimension. The other benefit being that it makes it less scary to change the grain of the dimension in the future, meaning you'll be more free to make good user-centric design decisions. This is part psychological and part that you can much more easily track down any references to Grain_Student, and you can be sure that every one of those references will need updating (not necessarily true for every reference to a dimension).

Check out my next post for SQL details as to how I would go about implementing this.

The Warehouse of Data

Friday, July 22, 2016

Sorry, my Dimension isn't having grains right now

Keep the grain in a separate table

No comments:

Post a Comment