Sunday, February 26, 2017

DW Patterns: The good - Accumulating Snapshot facts

In the last post I talked about Periodic Snapshot facts, and where they are good, and why they're not actually that great for what people say they're good for, but are good for some other things.

This post will focus on the other kind of standard snapshot fact. This is, of course, the...


Accumulating Snapshot Fact

While each row in a periodic snapshot represents an observation of the current state of an ongoing process (like a reading of a thermometer, or a count of the number of items in inventory), a row in the Accumulating Snapshot table represents an entire process, from beginning to end. The idea is that you have one date/time column for each step in the process, that records when the record hit that stage. It would be possible to have separate rows and treat it as a transaction fact, but frequently there will be measures that only relate to one step in the process - and then you'd have to have columns that only contain values for a small proportion of rows, and figure out what to do in other cases, and so on.

One important thing to note here is that the process must have absolutely well-defined stages in order for this pattern to work - if it's possible for the process to move backwards and forwards, skip steps or go through the stages in a non-linear order, this pattern is useless. If the process has an undefined number of stages, this pattern is useless.

In cases where you do have a strictly defined process, however, this isn't just a good solution, it's ideal - you minimize the amount of data stored, you don't waste space on pointless columns for most records (some records may have empty columns, but only when the process hasn't yet reached that stage), and you can easily see where each item going through the process is at. It's great - it just seems to me to be quite oddly-specific to earn a place among the Three Patterns of Facts.

Does it fix the problems that were left by the previous two types though? It seems to have invented its own niche problem and solved that rather than helping with these basic questions of history tracking, but let's check. The problems I was evaluating against previously when fact data changed were:
  1. Data entry error (the value was always supposed to be X, but it was originally measured as Y)
  2. The grain of the measure doesn't match our fact table because the quantity we're measuring is not temporally related to when the event occurred (e.g. attention score)
  3. The measure is a continuous variable (e.g. influence of individuals, or temperature)
As with the previous two types, 1 is not covered - we only have the choice to overwrite one of the existing measurements taken - no way to track what the incorrect value was. Not too promising - what about number 2? In order to achieve that kind of tracking, we'd have to have infinite columns. That doesn't seem feasible... Number 3 was already falls under Periodic Snapshot's area of expertise, so we can skip that (this pattern can't really do much there anyway).

So now we have 3 problems remaining - problem 1 and 2 from above, and the new problem of what can be done for modelling processes for which we'd like to use the Accumulating Snapshot, but can't because they have a flexible process (one example that I'm particularly familiar with, is student enrollment - this generally involves a lot of back-and-forth between the institution, the student and various other parties, and the process doesn't follow a set path - or if it does, it's such a convoluted one that the simplistic modelling of the Accumulating Snapshot can't handle it).

So what can be done? Well, Kimball does have a couple of pointers that may help and we even alluded to some in the previous post, but that will be the topic for next time.

Tuesday, January 31, 2017

DW Patterns: The good - Periodic Snapshot facts

Last time I left off talking about the Transactional fact type, and what it does for us (matches how we want data warehouses to behave for many scenarios) and what it doesn't do (provide any history tracking). This time I'll get into...
This one is the "other" fact type - the main fallback for any DW developer for whom Transactional facts don't work. The principle behind this pattern is that rather than an "event-based" measurement (which is what the Transactional fact would be) we have a "polling" measurement. This is where we take stock at the end of the day and say "ah, we had X amount of Y stock in the warehouse today." If you take that each day, you get to see trends.

So, what you wind up with is a log, that looks something like this:

Transactional_Fact
SK_ProductSK_DateSK_WarehouseAmount_Of_Stock
6320160528758
642016052812103
6320160529754
64201605291291

Now, there are a couple of important things to note here. First, the measure (in fact almost any measure associated with this table) is non-additive. This is a pain, but not the end of the world. Worse than that, though, is that we're no longer tracking each individual item - our data is not at the finest possible grain that it can be. This may not be a problem and don't get me wrong - when this is used it's frequently as good as it gets (an important term to get used to...), and it may not be fixable (if we don't track data about each item, this as good as it gets), but consider the case where one of the products is cabbages.

Now, to start with the manager just said "please show me my stock of cabbages each day". No problem! Daily snapshot, that's what you use for tracking stock, so we'll go ahead and slap it in.

Next day, the manager says "Oh, that's great. Now how to I get this to show me the longest amount of time that a cabbage is sitting in the warehouse? We've had some complaints of spoiled cabbages and I want to make sure the problem is downstream of us." Huh, well, OK no you can't do that, I'll have to start from scratch. If you read my previous post, you'll have already noted that I would suggest treating each incoming and outgoing item as a transaction - then you can track the duration it's spent in the warehouse. Again, may not be possible, which is one reason why this pattern exists - after all, design patterns are here to help in tricky situations.

The other situation where transaction facts shine is when the amount of data is so phenomenal that you can't use a transaction fact. If possible of course you should use the Transactional pattern and do things like partitioning, but frequently life isn't all friendly like that. As far as pre-aggregated data goes, this pattern is a good contestant, but as always when violating principles, make sure you need to, and make sure that this is the way you need to do it. There may be other patterns more appropriate.

Finally, more importantly that the above scenarios (depressing since those are the ones most frequently espoused as when this pattern is appropriate) there actually are facts that by their very nature don't support transactional data. Of course, they don't perfectly fit this pattern as it is typically outlined either, so that's a shame. Let me give you an example.

Consider whatever national park service you have in your country. In New Zealand, that's the Department of Conservation. They might be interested in temperature readings, in order to care for an endangered species. This species is very sensitive to high temperatures, but can survive cold ones OK. So, they set up a temperature monitor in the creature's nest that measure every hour, except during the summer months, when it measures every minute (they want to conserve power during the winter).

Now, let's take a look at how we could model that. Periodic snapshot looks bad as written, since it's a specific period, but we have variable periods. We could decide our period is the fastest period (per minute) and simply copy out the previous reading during the winter months, but that's going to store 60 times as much data as necessary during winter, and the data will be misleading anyway (those "fake" readings are actually fake readings - the temperature will certainly not stay consistent throughout each hour). In addition, that's asking for trouble when the rangers get new devices that measure every 30 seconds. Instead, we can alter it to be an arbitrary period - a pretty minor change to the pattern really. All of our aggregate functions still work (we won't be using sums, but averages, max, min and many of the more arcane methods you might think of will still work as normal).

Alternatively, we could treat each reading as a transaction - the difference between the reading and the previous reading could be treated as a delta - if the temperature went up by 1.3 degrees, we'd store 1.3; if the temperature went down by 0.8, we'd store -0.8. This way the temperature would be summable - but have we gained anything? Our rationale behind wanting to swap snapshots for transactions was to not reduce our grain - but in this case the snapshot is at the finest grain we can get. The only way we can get a finer grain is to increase precision of the measuring instruments, in which case our snapshot is at the finer grain anyway. Furthermore, if you take the snapshot then if someone says "what was the temperature at 5:30 on a midwinter morning" you can have an aggregate function that performs interpolation between the 5:00 reading and the 6:00 reading - something that wouldn't be (easily) possible with the transactional version.

It's probably worth pointing out here that we now have two main uses (with different implementations) for Periodic Snapshot. One that's a performance optimization, or a a necessary evil, when we would otherwise rather use a Transaction fact. The other is when we have a continuous variable over time, and the snapshot is (probably) the most appropriate option. This might actually be worth splitting out and giving two names, but I'll get to that later...

So, our snapshot fact has some legitimate uses, some worrying uses, and some outright harmful uses. That's cool. There's one more thing to check though - history tracking. Now in my previous post I mentioned three cases where we might see changes to a fact:

  1. Data entry error (the value was always supposed to be X, but it was originally measured as Y)
  2. The grain of the measure doesn't match our fact table because the quantity we're measuring is not temporally related to when the event occurred (e.g. attention score)
  3. The measure is a continuous variable (e.g. influence of individuals, or temperature)
You'll see that point 3 above naturally falls under the Daily Snapshot's area of effectiveness. What about 1 and 2? Well, let's assume that at 4:00 on the 5th May, our temperature reader read a value of 80°C. That seems unlikely, so we go back and update the source database to an average of the surrounding reads. The snapshot table has no capability for handling this kind of change other than to overwrite the relevant record. That's fine in some circumstances, but fails in others (I'll get more into this in the next post) but regardless, it doesn't track that change so it fails our test.

For 2, it seems on the surface like the snapshot fact is great. We're good at tracking continuous variables right? Well, it's not really continuous. You see, the attention score is altered every time a person tweets about a publication, or it's mentioned in a policy document or whatever. These are discreet events. However, each paper is also published on a certain date, and many measures are interested in specifically this date. This is getting a little confusing, so let's jump to some example questions:

"How many papers were published by my department in January?" - OK, so we need a count of publications, department and the publication date. Not a problem, we can just use a transaction fact, with each record representing a publication.

"What is the average attention score of papers published from my department?" - Also easy; again, this can just be the transaction fact. We'll have the up-to-date attention score as a fact, and just overwrite it (incidentally, overwriting fact as part of day-to-day business is a BIG RED FLAG that you've got your fact very wrong, because...)

"Ok, some of the papers are 5 years old, and some are only a week old, so we need a better average. Let's try... 'What is the average attention score for each paper one year after it was published?'" - Uh... OK, we can just stop storing the attention score after one year...

"Actually, I'm not sure if one year is the right threshold - can we just have that be a parameter?" ... Ok, we'll change this to a snapshot fact table. Oh, except now we can't answer the "How many papers were published by my department in January" question, since we don't have the transaction date, since our date dimension is our snapshot date...

You might notice that the obvious solution is to include another date dimension for snapshot date as well, which is very valid, but in fact this situation is only a specific example of a more generic problem. We have one fact pattern left though, let's wait to talk about it until we take a look at accumulating snapshot (sounds kinda promising right?)

Saturday, January 7, 2017

DW Patterns: The good - Transactional facts

So we left off last time recapping the different kinds of SCD patterns, and I mentioned some of the ways that SCDs don't provide adequate history capture. Some of these are, for example, when we have data related to an event which is changed, but doesn't live in an "entity" dimension, but rather a junk dimension, or when the entity related to an event is changed - something that Kimball would probably call a bad design decision; I'm not so sure.

There's another type of change which can occur - when the amount measured changes - these can be measurement errors that we correct, or they can represent a grain that perhaps isn't factored into the fact table in the traditional way - for example if a research paper is released, the public sentiment can change over time (Altmetric, for example, measure something similar), or tracking the influence of people over time a la Forbe's list of influential people (assuming they base that list on data and not thumb-sucking - I don't know either way). There are also a couple of other situations I'll get into through the remainder of this series.

Anyway, due to these issues with dimensions, next we need to take a look at the ways that the fact table can contribute - can we make up for these shortcomings using standard design patterns? Wikipedia lists four design patterns, while Kimball lists three. Fortunately this time they mostly agree on what they are - Wikipedia's fourth pattern is a variant of one of the other three. The place to start however is always with...

Transactional Fact

These are the most basic, intuitive facts, and with a bit of thought you can normally model (and should probably at least go through the exercise of doing so) the other fact types as a transactional fact - though sometimes you don't have access to that kind of information. Anyway, the basic premise of this fact is that we measure each event that happens - the classic example being sales transactions, where we basically know everything about it when it happens, we can store that, measure it, aggregate it and display it in really nice over-time examples.

In fact, as I mentioned earlier, just about everything can be measured as a transactional fact in some way or another - one example that often arises of when you might not want to use the transaction fact is for measuring stock levels. The amount of stock in a warehouse each day is usually the recommended scenario for Daily Snapshot facts, but if you were to have a positive-valued event for every item that entered the warehouse, and a negative-valued event for every item that left the warehouse, you can happily sum it up over time and get correct values.

Looking back at our list of history tracking requirements, though, we can see pretty clearly that we don't get anything from this fact type. That's not to say you can't combine it with other kinds of patterns to provide history tracking, but this pattern itself doesn't provide anything. Updates to a fact either won't get represented, or they'll overwrite the existing records - either way we don't get history.

Next time we'll take a look at the periodic snapshot fact, and see if that can get us what we want...