The Warehouse of Data: 2016

Saturday, December 24, 2016

DW Patterns: The good - a recap

Ok, the next obvious step for looking at DW design patterns is fact design patterns - which I will start on next time. First, however, I want to take a quick look back and summarize the dimension patterns.

So the main takeaway from this is that Type 2 is the master pattern, at least for the under-the-hood part. Thought should be put in as to whether your users actually want type 2, but even if they don't, it's still the way to go. Unless you're confident that you're users will never want it, or that your disk space will suffer too much from it, it should be your default implementation.

Let me just say that again: you should default to Type 2 and only consider not retaining the history in exceptional circumstances. This is because if you have the history in type 2, it's trivially easy to implement types 0, 1, 2, 3 and 6 as a view - plus it renders type 4 pointless. That leaves us with types 4K and 5K, which are both optimizations, and thus should only be considered in exceptional circumstances anyway.

So: if you're making an ETL automation tool, or anything like that, you should start with Type 2 and add others only as they are needed.

Stepping back from the individual patterns for a moment though, let's talk about what kinds of problems they solve and what assumptions they make about our data. First, the most important aspect to understand about SCDs - they do not store all history. The only thing any of the above SCDs track is changes to the entities represented by dimensions. This is great when your dimensions nicely fit into business concepts, but breaks down the moment you have something like, say, a junk dimension. Or a dimension with an awkwardly split scope (e.g. a student dimension that's split into two records - one for the student's international study, one for their domestic study). Or any other such information.

The weakness of SCDs for tracking scope doesn't stop there though - what if a fact record (an event) is updated so that it refers to a completely different dimension member (e.g. a student's enrolment is updated to change the paper they're enrolled in). Since no entity is changed, SCDs are hopeless here -we have to hope that our facts deal with it instead.

Sadly, the standard patterns don't do this, as we'll see...

Saturday, December 3, 2016

DW Patterns: The good - type 6

Alright, carrying on from last week where we looked at Type 5, this time we'll be spending a fair amount of word-count on...

Type 6: Hybrid

See definition on Wikipedia

Yay, mommy and daddy are speaking again! For Type 6, Kimball and Wikipedia are once again in agreement in terms of definitions. Type six is apparently named because it's like combining types 1, 2 and 3, which because it's such a perfect number both multiply and add to 6. The similarities are this:

It's like Type 1 because we have a type 1 column for the attribute
It's like Type 2 because we also have a type 2 column for the attribute
It's like Type 3 because there's more than one column. Yup.

I guess that sarcasm about type 3 isn't really warranted - it is kinda like taking Type 3 to the extreme - instead of "current and previous" or "current, previous and time-before-that", it's "current and at-the-time". I won't let that stop me making snarky remarks though, because that's just who I am.

Anyway, I've said before that if you're going to pick a default, you should pick type 6 rather than type 1 or 2. It also makes Type 5's "outrigger" redundant, since we can just lump the Type 1 attributes in the same spot as the Type 2s and be done with it. Not only that, but the same concept allows us to produce Type 3 attributes, if that's what you're going for.

Actually we get all of this from keeping Type 2 history, so that's the important bit. The reason that I like the Type 6 as a default is that if you're not going to make a decision about whether an attribute should be type 1 or 2, you should allow the user to do so. Otherwise, someone (or a lot of people) are going to think they're looking at one thing when they're actually looking at another. If they get confused and ask you what the current versus historical attributes mean, well great - you can explain and make sure they get it right.

Let's back up a minute though, because this blog hasn't had any SQL in it since my discussion of gluten-free dimension baking and I think it's time to rectify that, because I'm kind of horrified at the thought that someone might actually take Kimball or the wiki article's design seriously. To be clear, your table should look like this:

Dim_Student
SK_Table	TableID	Type2Attribute
1	1234	Original Value
2	1234	New Value
3	1234	Newer Value
4	1234	Newest Value

That's right - no Type 1 columns, no "last-value" columns, nothing, just Type 2. What you then do is calculate whatever else you need in the view. In terms of performance, if it start's getting slow, you can always throw an index or two on - it should be very easy to get the necessary indexes to make this trivial for almost any number of records. Let's have some examples though, using modern tsql windowing functions:

select

SK_Table,

TableID,

Type2Attribute,

FIRST_VALUE(Type2Attribute) over (partition by TableID order by SK_Table desc) as Type2AttributeAsType1,

Lag(Type2Attribute) over (partition by TableID order by SK_Table asc) as Type2AttributeAsType3

from type6.Dim_Something

Chuck this in a view, and bam, all the most useful actually functionally different dimension types (you can even achieve Type 0 by taking the Type 1 and ordering asc instead of desc). You'd probably want to order by something other than the surrogate key in practice (RowEffectiveDate or something like that) but the principle stands.

So that means that we can achieve Types 0, 1, 2, and 3. It's also like a better version of Type 4, and Type 4k is just a performance optimization for Type 2 dimensions, so who cares about that. We even make Type 5 redundant, since we can just use the mini-dimension and skip the outrigger. So really, what it comes down to is that Type 2 is god - as long as you know how to use it. And now you do!

So in terms of what it is and how to do it, that's all very well and good - but how about what it's useful for? Well... lots of things. The most common use case has to be "Type 2 is what we need, but Type 1 is so darn useful for filtering". What do I mean by that? Well, let's take one of the textbook cases of a Type 2 dimension attribute Employee Team (or department or whatever). You want to be able to have the team as it was when things happened, because that's important context. However, isn't it really handy to just be able to say "How do the people in Bob's team compare to the people in Terry's team?"

Moral questions of comparing people like this aside (it's just the first example that popped into my head - there are plenty of non-worrying questions like it), the Type 2 is necessary for being able to do analysis of the relevant attribute, but Type 1 is really nice entry point into the dimensional model. It's great to have as a filter in situations like these, where you want to evaluate the whole history of a group of individuals even before they were in the group you're using to identify them.

To take the flipped case, you actually find there could be a use case for Type 0 attributes when along-side a Type 2. Ever had someone ask something like "we have a lot of students dropping out of our BAs - where are they going?" You then proceeded to beat your head against the wall, and cry, because arbitrary digraphs aren't exactly one of data warehousing's strong points. With a Type 0/Type 2 combination, however, you can get at least an approximation of this, by filtering Programme Type 0 to "BA" and then grouping by the Type 2 dimension - bam! You've got an idea of how many students are transferring to which programme, and it didn't even take five minutes. Incidentally, I will get to discussing how you can deal with arbitrary digraphs in DWs and, specifically, SSAS/Power BI, since those are my technologies of choice.

DW Patterns: The good - type 5

Ok, so my dramatic return from hiatus didn't work out so well - turns out life is very life-like in that respect. This time I'm writing up a backlog so that when life happens I can continue updating, so there shouldn't be another hiatus until I actually run out of things to say.

Anyway, the last episode was about SCD 4 attributes, so today it's time for...

Type 5k: Add Mini-Dimension and a Type 1 Outrigger

See Kimball definition

Last time you may remember that Kimball and Wikipedia disagreed as to what SCD 4 meant (not just in details, but completely different concepts). Well, Type 5 doesn't even exist on the Wikipedia definition, so this time we just have to deal with the Kimball definition, so that's good.

What's not so good is that Kimball's Type 5 is even worse than the type 4 - it's an optimization on an optimization and the way they do it offends one of the most basic DW design sensibilities beaten in to me by Kimball themselves - don't snowflake. We're getting ahead of ourselves though - what is Type 5? Well, in Type 4 we pull out rapidly-changing Type 2 attributes into a separate table called a mini-dimension in order to spare us some space in the main dimension. Type 5 adds an "outrigger" to this, which is basically a snowflake from the original dimension containing Type 1 versions of the attributes in the mini-dimension. Why Kimball suggest you snowflake this I can't conceive - if it's that you want it attached to the original dimension, why not just add the Type 1 attributes there? If you want it as a separate dimension, just make it separate!

Anyway, as we'll see when we get to Type 6, there are better ways of doing this any way, even if you really must have the Type 4 mini-dimension, so I'd rate this SCD type as basically a non-pattern - why waste our consciousness remembering this when we could just attach a comment to the design and learn it on-the-spot if it's ever used?

I know this is supposed to be "The Good" of the SCDs, but I find myself hard-pressed to write anything good about this design.

The series continues with a discussion of the significantly more useful (despite not introducing any new concepts), Type 6

Saturday, September 17, 2016

DW Patterns: The good - type 4

I have to apologize - it's been three weeks since my last update. This is because two weeks ago my fiancée and I bought a house. I know that's a pretty pathetic excuse for being busy, but there it is. Anywho, I'm not here to lament my debt, I'm here to continue my series on data warehousing design patterns. I last left off at Type 3, so the next in line is...

Type 4: ???

Ok so there's a bid of disagreement from here on. Wikipedia says Type 4 is "Add history table" while Kimball Group say that it's "Add mini-dimension". Anyway, it's not my ambition to disambiguate this (currently) so I'm going to cover both here - in which case let's get on with it:

Type 4: Add history table

See definition on Wikipedia

This is just how it sounds - you keep a history table mirroring the dimension, containing the history of the data in that table. Now, the Wikipedia article (currently) isn't completely clear as to how the temporal table is referenced in the fact table ("Both the surrogate keys are referenced in the Fact table to enhance query performance"), meaning there are a couple of interpretations:

The fact table only contains references to the "current" record (i.e. type 1) and the history table doesn't have direct links (hereafter referenced as type 4a)
The fact table contains two references - one to the "current" record (i.e. type 1) and another to the "as-at" record (i.e. type 2) (hereafter referenced as type 4b)

Now, I haven't used either of these before, so I'm going to be commenting based on comparing these strategies with problems I've encountered before. 4b is interesting, but isn't terrifically user-friendly - imagine trying to use a pivot table against a cube with 20 dimensions, where there are two copies of each! It is, however, a relatively easy-to-implement version of Type 6, for every attribute on the dimension, which might be quite attractive in certain situations. I wouldn't recommend using this for every dimension, but one or two could actually improve usability - I could imagine using this for something like a Course dimension, where lots might change (e.g. the name or even the code, which some might consider the natural key) and users want to be able to do analysis on the current state or look up a specific version.

Type 4a is actually really cool, and if you're using SQL server 2016 then this actually comes out of the box with Temporal Tables. Just whip up a Type 1 dimension, make it a Temporal Table and you're done - simple as that. The reason I like this is that if I were working on 2016 and wasn't storing history of my data in another way, I would make Type 4a my default - it's strictly better than Type 1. This is because if you've made everything a type 4a and your BA decides "now X attribute needs to be type 2/4a/6/etc. then you have the necessary information to backdate it. Otherwise you're stuck in the frustrating situation of having to say "we can make it Type 2, but only going forwards - it won't have history from before today" which is a conversation I always hate having.

Alright cool, so what about the Kimball one, which I'm going to refer to as...

Type 4k: Add mini-dimension

See Kimball definition

I'm not going to talk much about this, as the mini-dimension is purely an optimization technique. The idea is to pull out a few very volatile or extremely frequently queried attributes from a Type 2 dimension and form them into their own junk dimension. Because it's an optimization technique only, I'm not a huge fan, and I would suggest looking at indexes or checking your dimensional architecture to see if you've made a dimension granularity mistake. If, however, your dimension meets all of the following conditions, you might consider pulling some attributes out into a mini-dimension:

The dimension you're looking at is Type 2-ish (2, 4a, 4b, 6 etc...)
The dimension is extremely large (to the point that it slows queries significantly) and a few attributes are in high demand (e.g. an invoice line-item dimension for a large company where item-type and quantity banding is used a lot)
Or the dimension is unmanageably fast-growing due to a few attributes (I can't even think of an example that I don't immediately rule out as a bad design for this)
User-transparent methods for optimizing (e.g. indexing, stats etc.) have failed to provide adequate performance increases

So it could be useful occasionally, but it's not exactly something I'd teach in Data Warehousing 101...

Stick with us as I continue this series on existing DW design patterns by looking in to SCD Type 5

Monday, August 29, 2016

DW Patterns: The good - type 3

Ok, today I will be continuing on from last week - this time talking about the first of the less-well-known SCD types:

Type 3: Add new attribute

See definition on Wikipedia

This is an interesting pattern - of all of the SCD patterns that actually retain attribute history, this is the only one that provides easy access to any attribute value other than the current value (i.e. type 1) or the as-at value (i.e. type 2). With type 3 we get the current value, as well as one or more immediately prior values. So, why would you want to use this?

The type 3 is great for transitions. Now, I'm not talking about transitions from one entity state to another - if you're wanting that, you're almost certainly better off with Type 2, 4 or 6 (coming in the next post). Those are great at managing the changes to entities as the react with your facts. What I mean is changes in business state; things like changes to business rules, changes to organisational structure and so on.

Type 3 is a good tool for evaluating the effectiveness of changes - you can compare, for example, the effectiveness of a new workload distribution method vs. the previous method, side-by-side. Once you've decided that the new method is better, the old doesn't matter - and when you're then assessing another new method later, you probably don't care what the old, worse method was - that history is unnecessary. You might even decide, once your evaluation period is over, that the field doesn't need to be Type 3 any more, and get rid of the "previous value" attribute altogether, which is fine.

This really comes into its own when the grain of a dimension becomes complicated - if it's a single key, then you can quite happily implement Type 3 over the top of a Type 2 or 6 by using a view; if that involves matching on multiple keys though it could become quite nasty.

It's also dead simple to implement - if your manager says "can you put together something that lets us compare the current method with the old method? Take your time, my meeting isn't for another two hours..." then Type 3 is probably a good option.

This series is continued with commentary on Type 4.

Monday, August 22, 2016

DW Patterns: The good - type 2

I'm afraid this is going to be another short one - buying a house apparently takes time, and so does watching Netflix. It's a tough job, but someone's gotta do it.

Anyway, the topic today is Type 2 - probably the most misunderstood of dimensions. Since this is The Good, I'm going to be outlining just what it is good for, but just a word of warning - Type 2 SCD only solves one problem in terms of change tracking - it's not a magic "now we have all history" bullet, and under no circumstances should you apply it as a default pattern. If you do so, you risk doing some fairly hefty damage to your ability to perform meaningful analysis.

Type 2: Add new row

See definition on Wikipedia

This is the first kind of SCD where we actually do tracking of changes to dimension entities over time. An important thing to get your head around is the word "entity" - this is where you have a business object which forms the grain of your dimension table. If you have a dimension based around such an entity, you will probably have at least some attributes that you want to be type 2. As an example, you might have Dim_Student; the student may have names (which you probably don't want to make type 2 as I mentioned previously), an ID, probably demographics like ethnicity and age.

Type 2 will work for some of these, and not others. Some candidates are obvious - we know that age changes over time, and we want to be able to relate our facts to the age of the student at the time that the event that our fact measures occurred (e.g. enrolment). This works well because we have an attribute (age) of our entity (student) which changes over time, and most importantly the value changing over time represents a change in the entity over time.

Let's take another example - ethnicity. Students' ethnicity can change over time for a variety of reasons - maybe they didn't know their ancestry before, maybe they've reconnected with it, or in some cases they're claiming it because they realize it allows them to apply for certain scholarships. Whatever the reason, it doesn't represent a change in the student, it represents an update to what we know about the student. This is a Type 1 attribute.

"Hold on, we need to match what we report to the government, so it has to be Type 2."

Absolutely - in this case, your Dim_Student is not modelling a student - it's modelling a student record. It's very important that you are completely honest with yourself about what it is that you're modelling - student records have very different needs from actual students. Fear not, though, I'm not suggesting that you have a Dim_Student and a Dim_Student_Record dimension - you can have the best of both worlds and model both entities in one dimension, though you may wind up making some compromises in order to get that simplicity. Stay tuned to find out how! (Just kidding - it's type 6. I really will get there though...)

Continue your journey through my mind with DW Patterns: The good - type 3

Monday, August 15, 2016

DW Patterns: The good - types 0 - 1

In my previous post I talked about data warehousing design patterns and the idea that we might need to start branching out. Before we get to actually looking at potential new patterns, we should take stock of what patterns we currently have and what they're good for. This post and the next couple will be about the different types of Slowly Changing Dimensions, then we'll pick up on facts. Importantly, I will be looking at the SCDs as currently presented in Wikipedia.

Before I get into the details, I need to debunk a common misconception. No Slowly Changing Dimension is better than any other. The choice of which SCD to use depends on the situation. Type 2 is not "better" than type 1, unless the situation needs it. Type 0 is not "worse" than the other types, except in that it is less frequently required. So, without further ado:

Type 0: Retain Original

See definition on Wikipedia

This is where we never update data once it's gone into the table. The only example anyone's been able to provide me with in regards to this is the Date dimension, but that's kind of cheating because the source data doesn't really change, or if it does (e.g. holidays etc) then it's actually type 1. It is conceivable that it might come in handy though, and I'm sure you'll know if it does (please let me know in comments if you have come across any examples of this in the wild).

As I was writing the second-to-last paragraph of this article I realized that junk dimensions are another case for being called type 0, but since there's nothing really being updated in the source system (just different combinations over time) I'm not convinced that those can be legitimately called Type 0 either.

Type 1: Overwrite

See definition on Wikipedia

In a Type 1 attribute, if data in the source system changes then we overwrite it. In a solely type 1 table, date won't form part of the grain (unlike in type 2) unless you're talking about a date-related table, like the Date dimension.

Overwrite attributes are the singularly most over-used attribute type in data warehousing. This is because it's easy, and frequently acceptable. We're also not here to talk about when they shouldn't be used, so when should you use Type 1?

If you want to be able to find all historical facts for the latest version of something then you want Type 1. A great example here is that of people's names. If I go to look up a student's records, I don't want to have to specify every name they've ever had. That's just common sense. Of course, it depends on the situation; marital status is a highly situational example: in some situations it may be very important to know if a person was married at the time, while for other businesses you may only care about what they are now.

Another use for Type 1, as alluded to earlier, is if you just don't care. It's a lot of effort to set up type 2+, so if it's an attribute that you don't care about then type 1 may be what you want. If you don't think something's going to change at all, you might set it up as type 1 and rely on backups to let you backdate a type 2 fix if it turns out that it does change.

Notably, Type 1 dimensions only work for dimensions that represent entities. If you try to deal with a junk dimension (or any other grain-free dimension for that matter) with Type 1 you will have a bad time - don't do it.

Sorry this one's a bit short, I've had a busy weekend. Stay tuned for my next post at the end of the week continuing discussion about where our patterns do work.

Sunday, August 7, 2016

Let's talk about how Slowly we are Changing

Data warehouses are great. I better think that, I've made a career out of developing them. Like any software, they exist to provide value and solve problems, and like any software there are design patterns for solving particular problems. It is my opinion that these design patterns are too often applied without enough thought or understanding of exactly what they are for, and when they are not suitable - I will spend the next few posts explaining this, and suggesting some variations for particular issues that I seem to come across quite often, and which are not adequately solved by the existing data warehousing patterns.

Ok, there's the abstract over - what am I actually talking about here? The standard data warehousing tools you have in your tool-belt: transaction facts, snapshot facts and slowly changing dimensions (of various kinds). Sorry if some of this gets a little simple for you - I think it's important to go over the basics and investigate the guiding principles , especially since most DW specialists I know of picked their knowledge up on the job in one form or another, and have spent less time than they might thinking purely theory (I know I've been guilty of this, though I have also been fortunate enough to be surrounded by some really excellent people who like talking about things). So let's start with some definitions.

Data warehouses are made to allow analysis of data, particularly over time. This is why the date dimension is so important - in the words of Kimball, it should be attached to virtually every fact table, so that's cool.

Next, a fact is something that has happened - something measurable, even if the measurement is just the observation that it happened. This needs fleshing out more, but I'll get to that later, this is good enough for now.

The definition of the word "dimension" is a little tricky, because different people mean different things by it. Here are a couple:

A dimension is a a group of related attributes by which we measure our facts (space-time might be a dimension in this context). This is likely what a data warehousing specialist means when they say the word "dimension".
A dimension is the axis by which we measure things, for example the x axis in a graph, or the three dimensions that we live in, or time. Note that this is what is referred to as a dimension attribute above. This is the closest version to the non-jargon meaning of the word.
A dimension is a plane of existence, and you can create cool portals and kill versions of yourself in other dimensions to reap their power for yourself. I won't talk about this one any more, as it's irrelevant to us.

Now, even though we as data warehousing specialists typically use the first definition, we managed to mire ourselves with the term "slowly-changing dimension" or SCD. This is the most awkwardly named term I've run across in the field, because while it's definitely jargon, the "dimension" which is slowly changing is actually a dimension attribute according to our jargon, and that's without even getting in to the whole "slowly-changing" part. So that sucks, but it's good to make sure it's remember this, or it can and will cause confusion.

SCDs of course come in a variety, and exist for one reason only: to decide how the data warehouse should treat changing data in source systems. Different SCD types are suitable for different issues, and no SCD should be considered a "default" lest we make asses out of you and me. That, of course, doesn't stop us from treating Type 1 as a default, but let's pretend we don't do that for a while, because I want to talk about how much they don't work.

Of course, they do work, but they don't work in every situation, and I see people trying to make up for that with odd, ad-hoc solutions based on what seemed like a good idea at the time. This is painful to me, as that is exactly what design patterns are supposed to solve in the first place. Slowly-changing dimensions are seen as the only options, and that's not great, because we should be able to define new patterns (the same is true for fact tables - Kimball defines 3 fundamental types, but there are other options - something I will be getting in to over the next few weeks).

So, we have a claim: SCDs work in some situations and not in others. Obviously the question is, which situations work where? This is intimately wrapped up in the question of what fact tables work where, and I will start to get in to it thoroughly in my next post, but the underlying principle is that SCDs (like much Kimball doctrine, and for a reason) are really geared towards transaction fact tables, and quickly fall short in many real-world scenarios that I've come across. So, maybe it's time for us as a community to change a little less slowly, and start working with a few extra patterns.

See my next post on the topic for more specifics on how the patterns currently serve us

Saturday, July 30, 2016

How to bake a coeliac-friendly dimension

In my last post on the topic of grain-free dimensions I talked about how it's possible to easily generate a "grain" table that provides access to the dimension when building a dimension. Here I'm going to get into the gritty details of how to do it - feel free to rip my SQL apart in the comments and suggest better methods for doing this, or completely alternative ways! I'm always keen to learn from others' experience.

Anyway, let's get started. I will continue with the example from last time, so let's kick this off with some revision. We started off with a Student dimension (with the grain defined by the StudentID), and wanted to add an International_Indicator column. In our imaginary school, this necessitates changing the grain of the Student table to include PaperID - since apparently International study is defined by the paper the student is studying. So, we want to wind up with something like this:

Dim_Student
SK_Student	StudentID	International_Indicator	...
1	1234	International	...
2	1234	Domestic	...

Grain_Student
SK_Student	StudentID	PaperID
1	1234	1963
1	1234	2345
2	1234	5684

Before we get into how we build this, a disclaimer: I'm afraid that I've only ever implemented this for medium-sized businesses where truncating and re-populating was the method of choice, so that's what I'm going to present here. I suppose that if you were persisting the tables you'd have to resort to a row-hashing technique, but that's out of scope for this post.

Ok, so how do we get there? Well, let's check out my database in SSMS - I've skipped the part of actually changing the dimension structure, since I think that's pretty trivial:

Now, there's an important aspect here that's not shown above: Dim_student.SK_Student is not an identity column - we will load in the values manually. Ok, so let's take a look at the code required to load this - I'll be using a SQL stored procedure to do it, but it should be easy to port to another method. The basic principle here is that we want to go through three steps: build a table including all of the columns (IDs, attributes and surrogate key). Then, load into the dimension, and load into the grain table. And here's the code that I use to do that:

CREATE PROCEDURE grainfreedimension.Load_Dim_Student

BEGIN

SET NOCOUNT ON;

truncate table grainfreedimension.Dim_Student

truncate table grainfreedimension.Grain_Student

select

Paper.PaperID,

Student.StudentID,

Student.Name as Student_Name,

case

when Paper.InternationalIndicator = 1 then 'International'

else 'Domestic'

end as International_Indicator,

DENSE_RANK() over (

order by

Student.StudentID,

Student.Name,

Paper.InternationalIndicator

) as SK_Student

into #transform

from grainfreedimension.Paper

inner join grainfreedimension.StudentPaper on Paper.PaperID = StudentPaper.PaperID

inner join grainfreedimension.Student on StudentPaper.StudentID = Student.StudentID

insert into grainfreedimension.Dim_Student (

SK_Student,

StudentID,

International_Indicator,

Student_Name

)

select distinct

SK_Student,

StudentID,

International_Indicator,

Student_Name

from #transform

insert into grainfreedimension.Grain_Student (

SK_Student,

StudentID,

PaperID

)

select distinct

SK_Student,

StudentID,

PaperID

from #transform

END

It's really pretty simple code - it's really just the use of the DENSE_RANK function, allowing us to build the transform table without having to do joins on a bunch of dimension attributes, that makes this really cool.

Friday, July 22, 2016

Sorry, my Dimension isn't having grains right now

You should always be careful about your grains. You should know that picking the grain of the fact table is critical - there is a lot of advice about this (see what Kimball says about grain for example). What about dimensions though?

I haven't seen a lot said about grains for dimension tables - maybe I'm looking in all the wrong places, or maybe it's something that people just don't talk about much, but lack of thought about the grain of dimension tables has bitten me in the backside historically. Maybe it's because the grain of dimension tables always seems like it should be straightforward and simple - have a Student dimension? Clearly the grain will be one row per student. Product dimension? One row per product. Nothing to think about, right?

Sure, except that the whole point of a dimensional model is to make it as intuitive and easy for end-users to find fields they want (or discover new ones) as possible. Where would a user go to view whether a student is an International student? Without thinking, it makes sense to go to the Student dimension right? But wait, in this hypothetical situation my school defines a student as International based on the study they are taking - and we have these students over here that are doing some International study and some Domestic study. Well, we have a few choices:

We can have a "mixed" definition
We can split it out into an extra dimension (say, a StudentProgramme dimension or StudentPaper or something)
We can have multiple rows in the dimension for each student
We can classify a student as the majority of what they study

Now, the first and fourth options are terrible, because suddenly we can't see what proportion of their study is International and what is Domestic. The third would mess with the grain of the Student dimension table so that's out, so we're left with option 2 - well, it kind of makes sense, we can stick all our other StudentPaper attributes on there too; let me just go find some attributes to justify it, I'm sure we can rustle up something.

Hang on, wait just a moment - did you see what we just did there? We made life worse for the users, in order to keep true to our modelling ideals. I'm not saying that's always a bad thing, but we should probably take a very close look at what we're talking about, since making users' lives easy should be our foremost ideal. So where did we go wrong? Or maybe "where can we compromise to produce the best result?" should be the question, as it is so often with these kinds of things.

Well, my opinion is that we should take option three, International-ness is (to our users) a property of a student (and for the vast majority of cases that holds true) so it should be on the student. This is not a case of a slowly-changing dimension, since the two papers can be studied concurrently. So, we're left splitting the grain. I think that's fine - let me explain how I would do (and have successfully done) this.

Keep the grain in a separate table

Create a table that has all of the natural keys that define the grain of the dimension, and include the surrogate key. When building the dimension table this is dead easy - build a temporary table that all of the dimension attributes including the natural keys, and use something like Dense_Rank to provide the surrogate key, as ordered by the columns that will actually be in the dimension table (only include the natural keys in this if they have business meaning - a Student ID might, but a purely system-internal id might not). Then, SELECT DISTINCT them all (excluding any natural keys that don't need to be in the dim table) into the dimension, and select just the surrogate key and natural keys into your separate "grain" table. Here's an example:

Dim_Student
SK_Student	StudentID	International_Indicator	...
1	1234	International	...
2	1234	Domestic	...

Grain_Student
SK_Student	StudentID	PaperID
1	1234	1963
1	1234	2345
2	1234	5684

So here, our grain is Per Student, Per Paper - in fact exactly what's in our grain table. What you might notice here is that had we kept our grain in the dimension, we'd have not only an extra column, making the dimension table wider, we'd also have extra rows. This has two effects (I haven't found any good sources as to the severity of the impact on these sorry). Our dimension table is wider which increases space used and reduces performance, and we will have more surrogate keys in the fact table, which also reduces performance - at least if you're using column-store indexes as you should be. This happens because the fewer values in a column, the better compression we can get in our column-store indexes, which translates into better performance.

The benefits don't stop at performance, though - there are two major "soft" benefits I see to this pattern. One is that it's really easy to join to from other tables (e.g. when building your fact tables) - you know exactly what columns you need, and there is not going to be any "wait, you're telling me I can't just join to DimStudent with StudentID? What a dumb design" kind of reflex by other people using the dimension. The other benefit being that it makes it less scary to change the grain of the dimension in the future, meaning you'll be more free to make good user-centric design decisions. This is part psychological and part that you can much more easily track down any references to Grain_Student, and you can be sure that every one of those references will need updating (not necessarily true for every reference to a dimension).

Check out my next post for SQL details as to how I would go about implementing this.