Tuesday, January 31, 2017

DW Patterns: The good - Periodic Snapshot facts

Last time I left off talking about the Transactional fact type, and what it does for us (matches how we want data warehouses to behave for many scenarios) and what it doesn't do (provide any history tracking). This time I'll get into...
This one is the "other" fact type - the main fallback for any DW developer for whom Transactional facts don't work. The principle behind this pattern is that rather than an "event-based" measurement (which is what the Transactional fact would be) we have a "polling" measurement. This is where we take stock at the end of the day and say "ah, we had X amount of Y stock in the warehouse today." If you take that each day, you get to see trends.

So, what you wind up with is a log, that looks something like this:

Transactional_Fact
SK_ProductSK_DateSK_WarehouseAmount_Of_Stock
6320160528758
642016052812103
6320160529754
64201605291291

Now, there are a couple of important things to note here. First, the measure (in fact almost any measure associated with this table) is non-additive. This is a pain, but not the end of the world. Worse than that, though, is that we're no longer tracking each individual item - our data is not at the finest possible grain that it can be. This may not be a problem and don't get me wrong - when this is used it's frequently as good as it gets (an important term to get used to...), and it may not be fixable (if we don't track data about each item, this as good as it gets), but consider the case where one of the products is cabbages.

Now, to start with the manager just said "please show me my stock of cabbages each day". No problem! Daily snapshot, that's what you use for tracking stock, so we'll go ahead and slap it in.

Next day, the manager says "Oh, that's great. Now how to I get this to show me the longest amount of time that a cabbage is sitting in the warehouse? We've had some complaints of spoiled cabbages and I want to make sure the problem is downstream of us." Huh, well, OK no you can't do that, I'll have to start from scratch. If you read my previous post, you'll have already noted that I would suggest treating each incoming and outgoing item as a transaction - then you can track the duration it's spent in the warehouse. Again, may not be possible, which is one reason why this pattern exists - after all, design patterns are here to help in tricky situations.

The other situation where transaction facts shine is when the amount of data is so phenomenal that you can't use a transaction fact. If possible of course you should use the Transactional pattern and do things like partitioning, but frequently life isn't all friendly like that. As far as pre-aggregated data goes, this pattern is a good contestant, but as always when violating principles, make sure you need to, and make sure that this is the way you need to do it. There may be other patterns more appropriate.

Finally, more importantly that the above scenarios (depressing since those are the ones most frequently espoused as when this pattern is appropriate) there actually are facts that by their very nature don't support transactional data. Of course, they don't perfectly fit this pattern as it is typically outlined either, so that's a shame. Let me give you an example.

Consider whatever national park service you have in your country. In New Zealand, that's the Department of Conservation. They might be interested in temperature readings, in order to care for an endangered species. This species is very sensitive to high temperatures, but can survive cold ones OK. So, they set up a temperature monitor in the creature's nest that measure every hour, except during the summer months, when it measures every minute (they want to conserve power during the winter).

Now, let's take a look at how we could model that. Periodic snapshot looks bad as written, since it's a specific period, but we have variable periods. We could decide our period is the fastest period (per minute) and simply copy out the previous reading during the winter months, but that's going to store 60 times as much data as necessary during winter, and the data will be misleading anyway (those "fake" readings are actually fake readings - the temperature will certainly not stay consistent throughout each hour). In addition, that's asking for trouble when the rangers get new devices that measure every 30 seconds. Instead, we can alter it to be an arbitrary period - a pretty minor change to the pattern really. All of our aggregate functions still work (we won't be using sums, but averages, max, min and many of the more arcane methods you might think of will still work as normal).

Alternatively, we could treat each reading as a transaction - the difference between the reading and the previous reading could be treated as a delta - if the temperature went up by 1.3 degrees, we'd store 1.3; if the temperature went down by 0.8, we'd store -0.8. This way the temperature would be summable - but have we gained anything? Our rationale behind wanting to swap snapshots for transactions was to not reduce our grain - but in this case the snapshot is at the finest grain we can get. The only way we can get a finer grain is to increase precision of the measuring instruments, in which case our snapshot is at the finer grain anyway. Furthermore, if you take the snapshot then if someone says "what was the temperature at 5:30 on a midwinter morning" you can have an aggregate function that performs interpolation between the 5:00 reading and the 6:00 reading - something that wouldn't be (easily) possible with the transactional version.

It's probably worth pointing out here that we now have two main uses (with different implementations) for Periodic Snapshot. One that's a performance optimization, or a a necessary evil, when we would otherwise rather use a Transaction fact. The other is when we have a continuous variable over time, and the snapshot is (probably) the most appropriate option. This might actually be worth splitting out and giving two names, but I'll get to that later...

So, our snapshot fact has some legitimate uses, some worrying uses, and some outright harmful uses. That's cool. There's one more thing to check though - history tracking. Now in my previous post I mentioned three cases where we might see changes to a fact:

  1. Data entry error (the value was always supposed to be X, but it was originally measured as Y)
  2. The grain of the measure doesn't match our fact table because the quantity we're measuring is not temporally related to when the event occurred (e.g. attention score)
  3. The measure is a continuous variable (e.g. influence of individuals, or temperature)
You'll see that point 3 above naturally falls under the Daily Snapshot's area of effectiveness. What about 1 and 2? Well, let's assume that at 4:00 on the 5th May, our temperature reader read a value of 80°C. That seems unlikely, so we go back and update the source database to an average of the surrounding reads. The snapshot table has no capability for handling this kind of change other than to overwrite the relevant record. That's fine in some circumstances, but fails in others (I'll get more into this in the next post) but regardless, it doesn't track that change so it fails our test.

For 2, it seems on the surface like the snapshot fact is great. We're good at tracking continuous variables right? Well, it's not really continuous. You see, the attention score is altered every time a person tweets about a publication, or it's mentioned in a policy document or whatever. These are discreet events. However, each paper is also published on a certain date, and many measures are interested in specifically this date. This is getting a little confusing, so let's jump to some example questions:

"How many papers were published by my department in January?" - OK, so we need a count of publications, department and the publication date. Not a problem, we can just use a transaction fact, with each record representing a publication.

"What is the average attention score of papers published from my department?" - Also easy; again, this can just be the transaction fact. We'll have the up-to-date attention score as a fact, and just overwrite it (incidentally, overwriting fact as part of day-to-day business is a BIG RED FLAG that you've got your fact very wrong, because...)

"Ok, some of the papers are 5 years old, and some are only a week old, so we need a better average. Let's try... 'What is the average attention score for each paper one year after it was published?'" - Uh... OK, we can just stop storing the attention score after one year...

"Actually, I'm not sure if one year is the right threshold - can we just have that be a parameter?" ... Ok, we'll change this to a snapshot fact table. Oh, except now we can't answer the "How many papers were published by my department in January" question, since we don't have the transaction date, since our date dimension is our snapshot date...

You might notice that the obvious solution is to include another date dimension for snapshot date as well, which is very valid, but in fact this situation is only a specific example of a more generic problem. We have one fact pattern left though, let's wait to talk about it until we take a look at accumulating snapshot (sounds kinda promising right?)

No comments:

Post a Comment