The Warehouse of Data: July 2016

In my last post on the topic of grain-free dimensions I talked about how it's possible to easily generate a "grain" table that provides access to the dimension when building a dimension. Here I'm going to get into the gritty details of how to do it - feel free to rip my SQL apart in the comments and suggest better methods for doing this, or completely alternative ways! I'm always keen to learn from others' experience.

Anyway, let's get started. I will continue with the example from last time, so let's kick this off with some revision. We started off with a Student dimension (with the grain defined by the StudentID), and wanted to add an International_Indicator column. In our imaginary school, this necessitates changing the grain of the Student table to include PaperID - since apparently International study is defined by the paper the student is studying. So, we want to wind up with something like this:

Dim_Student
SK_Student	StudentID	International_Indicator	...
1	1234	International	...
2	1234	Domestic	...

Grain_Student
SK_Student	StudentID	PaperID
1	1234	1963
1	1234	2345
2	1234	5684

Before we get into how we build this, a disclaimer: I'm afraid that I've only ever implemented this for medium-sized businesses where truncating and re-populating was the method of choice, so that's what I'm going to present here. I suppose that if you were persisting the tables you'd have to resort to a row-hashing technique, but that's out of scope for this post.

Ok, so how do we get there? Well, let's check out my database in SSMS - I've skipped the part of actually changing the dimension structure, since I think that's pretty trivial:

Now, there's an important aspect here that's not shown above: Dim_student.SK_Student is not an identity column - we will load in the values manually. Ok, so let's take a look at the code required to load this - I'll be using a SQL stored procedure to do it, but it should be easy to port to another method. The basic principle here is that we want to go through three steps: build a table including all of the columns (IDs, attributes and surrogate key). Then, load into the dimension, and load into the grain table. And here's the code that I use to do that:

CREATE PROCEDURE grainfreedimension.Load_Dim_Student

BEGIN

SET NOCOUNT ON;

truncate table grainfreedimension.Dim_Student

truncate table grainfreedimension.Grain_Student

select

Paper.PaperID,

Student.StudentID,

Student.Name as Student_Name,

case

when Paper.InternationalIndicator = 1 then 'International'

else 'Domestic'

end as International_Indicator,

DENSE_RANK() over (

order by

Student.StudentID,

Student.Name,

Paper.InternationalIndicator

) as SK_Student

into #transform

from grainfreedimension.Paper

inner join grainfreedimension.StudentPaper on Paper.PaperID = StudentPaper.PaperID

inner join grainfreedimension.Student on StudentPaper.StudentID = Student.StudentID

insert into grainfreedimension.Dim_Student (

SK_Student,

StudentID,

International_Indicator,

Student_Name

)

select distinct

SK_Student,

StudentID,

International_Indicator,

Student_Name

from #transform

insert into grainfreedimension.Grain_Student (

SK_Student,

StudentID,

PaperID

)

select distinct

SK_Student,

StudentID,

PaperID

from #transform

END

It's really pretty simple code - it's really just the use of the DENSE_RANK function, allowing us to build the transform table without having to do joins on a bunch of dimension attributes, that makes this really cool.

You should always be careful about your grains. You should know that picking the grain of the fact table is critical - there is a lot of advice about this (see what Kimball says about grain for example). What about dimensions though?

I haven't seen a lot said about grains for dimension tables - maybe I'm looking in all the wrong places, or maybe it's something that people just don't talk about much, but lack of thought about the grain of dimension tables has bitten me in the backside historically. Maybe it's because the grain of dimension tables always seems like it should be straightforward and simple - have a Student dimension? Clearly the grain will be one row per student. Product dimension? One row per product. Nothing to think about, right?

Sure, except that the whole point of a dimensional model is to make it as intuitive and easy for end-users to find fields they want (or discover new ones) as possible. Where would a user go to view whether a student is an International student? Without thinking, it makes sense to go to the Student dimension right? But wait, in this hypothetical situation my school defines a student as International based on the study they are taking - and we have these students over here that are doing some International study and some Domestic study. Well, we have a few choices:

We can have a "mixed" definition
We can split it out into an extra dimension (say, a StudentProgramme dimension or StudentPaper or something)
We can have multiple rows in the dimension for each student
We can classify a student as the majority of what they study

Now, the first and fourth options are terrible, because suddenly we can't see what proportion of their study is International and what is Domestic. The third would mess with the grain of the Student dimension table so that's out, so we're left with option 2 - well, it kind of makes sense, we can stick all our other StudentPaper attributes on there too; let me just go find some attributes to justify it, I'm sure we can rustle up something.

Hang on, wait just a moment - did you see what we just did there? We made life worse for the users, in order to keep true to our modelling ideals. I'm not saying that's always a bad thing, but we should probably take a very close look at what we're talking about, since making users' lives easy should be our foremost ideal. So where did we go wrong? Or maybe "where can we compromise to produce the best result?" should be the question, as it is so often with these kinds of things.

Well, my opinion is that we should take option three, International-ness is (to our users) a property of a student (and for the vast majority of cases that holds true) so it should be on the student. This is not a case of a slowly-changing dimension, since the two papers can be studied concurrently. So, we're left splitting the grain. I think that's fine - let me explain how I would do (and have successfully done) this.

Keep the grain in a separate table

Create a table that has all of the natural keys that define the grain of the dimension, and include the surrogate key. When building the dimension table this is dead easy - build a temporary table that all of the dimension attributes including the natural keys, and use something like Dense_Rank to provide the surrogate key, as ordered by the columns that will actually be in the dimension table (only include the natural keys in this if they have business meaning - a Student ID might, but a purely system-internal id might not). Then, SELECT DISTINCT them all (excluding any natural keys that don't need to be in the dim table) into the dimension, and select just the surrogate key and natural keys into your separate "grain" table. Here's an example:

Dim_Student
SK_Student	StudentID	International_Indicator	...
1	1234	International	...
2	1234	Domestic	...

Grain_Student
SK_Student	StudentID	PaperID
1	1234	1963
1	1234	2345
2	1234	5684

So here, our grain is Per Student, Per Paper - in fact exactly what's in our grain table. What you might notice here is that had we kept our grain in the dimension, we'd have not only an extra column, making the dimension table wider, we'd also have extra rows. This has two effects (I haven't found any good sources as to the severity of the impact on these sorry). Our dimension table is wider which increases space used and reduces performance, and we will have more surrogate keys in the fact table, which also reduces performance - at least if you're using column-store indexes as you should be. This happens because the fewer values in a column, the better compression we can get in our column-store indexes, which translates into better performance.

The benefits don't stop at performance, though - there are two major "soft" benefits I see to this pattern. One is that it's really easy to join to from other tables (e.g. when building your fact tables) - you know exactly what columns you need, and there is not going to be any "wait, you're telling me I can't just join to DimStudent with StudentID? What a dumb design" kind of reflex by other people using the dimension. The other benefit being that it makes it less scary to change the grain of the dimension in the future, meaning you'll be more free to make good user-centric design decisions. This is part psychological and part that you can much more easily track down any references to Grain_Student, and you can be sure that every one of those references will need updating (not necessarily true for every reference to a dimension).

Check out my next post for SQL details as to how I would go about implementing this.

The Warehouse of Data

Saturday, July 30, 2016

How to bake a coeliac-friendly dimension

Friday, July 22, 2016

Sorry, my Dimension isn't having grains right now

Keep the grain in a separate table