DBMS - September 1998
DBMS Online: Data Warehouse Architect By Ralph Kimball

Help for Hierarchies

Helper tables handle dimensions with complex hierarchies


In last month's column, I talked about the difficult modeling situation in which one dimension you want to attach to a fact table takes on multiple values for each fact table record. The most vexing aspect of the problem was that you often don't know how many values the dimension takes on until you see the data itself. The example I discussed was from the healthcare industry where one patient could have multiple diagnoses. The solution to last month's problem was to create a special helper table between the fact table and the dimension table. This helper table created a many-to-many link between the fact table and the dimension table. As a purist star-join dimensional designer, I like to avoid such situations, but there are a few compelling circumstances, such as the multiple-diagnosis example, where the physical world demands modeling a many-to-many relationship between a fact table and a dimension table.

This month I'll tackle another real-world modeling situation where the solution will turn out to be another helper table between the fact table and the dimension table. In this case, however, it is not because of a many-to-many relationship. This time the dimension has a complex hierarchical structure of variable depth, and I want to navigate this structure with my dimensional model and standard SQL.

Consider the simple business situation shown in Figure 1 You can imagine that the fact table represents the revenue from consulting services that a fictitious company called Big Nine Consultants sells to various corporate clients. The grain of the fact table is the line item on each consulting invoice sent to one of the corporate clients. The fact table has a straightforward set of dimensions, including:

I call the Invoice Number degenerate because when you try to make a normal dimension from this key, you discover that you have already used all the interesting information in the other dimensions that might otherwise have been stored with this key. This result is characteristic of dimensional models. Very often the invoice number, or the bill of lading number, or the ticket number, is a degenerate dimension. You want these keys in the design because they are the basis for grouping line items on a particular invoice or ticket, but you don't need to bother creating a dimension when there are no attributes for such keys.

In Figure 1, the main attention is focused on the Customer dimension, which I have shown in detail. With this schema design you can run all sorts of interesting queries by constraining and grouping various attributes in the Customer dimension. You can add up consulting revenue and hours billed for any configuration of Customer. The Customer table joins directly to the fact table.

Perhaps, as you are working on the design of this consulting invoices data mart, a user interview participant points out that the consulting services for the largest and most complex customers are sold at several different organizational levels. This user would like to create reports that show total consulting sold not only to individual departments, but also to divisions, subsidiaries, and overall enterprises; the report still must correctly add up the separate consulting revenues for each organization structure. Figure 2, shows a simple organizational structure, where each node in the tree is a Customer consulting services are sold to.

Figure 1 does not contain any information about how these separate Customers relate to each other. A simple computer science approach to storing such information would add a Parent Key field to the Customer dimension. The Parent Key field would be a recursive pointer that would contain the proper key value for the parent of any given customer. A special null value would be required for the topmost Customer in any given overall enterprise. Although this simple recursive pointer lets you represent an arbitrary organizational tree structure of any depth, there is a killer problem that defeats its use in your data warehouse.

The problem is that you cannot use the recursive pointer with SQL to join the dimension table with the fact table and add up the consulting revenues or hours for a set of organizations, as in Figure 2. NSI-standard SQL makes no attempt to deal with recursive pointers, and even such facilities as Oracle's CONNECT BY do not let you use a join in the same SQL statement as CONNECT BY. Thus in Oracle, although you can enumerate an organizational hierarchy defined via a recursive pointer field in a dimension table, you cannot add anything up by joining that dimension table to a fact table.

Instead of using a recursive pointer, you can solve this modeling problem by inserting a helper table between the dimension table and the fact table, as shown in Figure 3. Amazingly enough, you don't have to make any changes to either the dimension table or the fact table; you just rip the join apart and insert the helper table.

The helper table contains one record for each separate path from each node in the organization tree to itself and to every node below it. There are, then, more records in the helper table than there are nodes in the tree. In Figure 3 we need a total of 43 records in the helper table. See if you can work this out.

Each record in the helper table contains the fields:

If you are descending the tree from certain selected parents to various subsidiaries, you join the dimension table to the helper table and the helper table to the fact table with the joins as shown in Figure 3. The Depth From Parent field counts how many levels the subsidiary is below the parent. The Lowest Flag field is True only if the subsidiary has no further nodes beneath it. The Topmost Flag field is True only if the parent has no further nodes above it.

The beauty of this design is that you can place any normal dimensional constraint against the Customer dimension table and the helper table will cause all the fact table records for the directly constrained customers plus all their subsidiaries to be correctly summarized. In other words, you can use your standard relational databases and your standard query tools to analyze the hierarchical structure.

If the field Depth From Parent is equal to one, then only the immediate subsidiaries of the directly constrained customers will be summarized. If the Lowest Flag is True, then only the lowest subsidiaries of the directly constrained customers will be summarized.

The joins shown in Figure 3 let you summarize an organizational structure downward from the directly constrained nodes. By reversing the sense of the joins (for example, connecting the customer dimension primary key to the subsidiary customer key), you can move up the organizational structure instead. When Depth From Parent is equal to one, then you are referring to the immediate parent of a directly constrained customer. When Topmost Flag is True, you have selected the supreme parent of a directly constrained customer.

You can generalize this scheme by adding Begin Effective Date and End Effective Date to each record in the helper table. In this way, you can represent changing organizational structures. When a group of nodes is moved from one part of an organizational structure to another, such as with an acquisition, only the records that refer to paths from outside parents into the moved structure need to be changed. All records referring to paths entirely within the moved structure are unaffected. This is an advantage over other tree representation schemes where all the nodes in the tree need to be numbered in a global order. Also, these other representation schemes generally do not preserve the ability of standard SQL to summarize the results in an associated fact table the way this scheme does.

If you have an organization where a subsidiary is jointly owned by two or more parents, then you can add a Weighting Factor field to the helper table. Strictly speaking this is no longer a tree. I call this an irregular tree. In irregular trees with joint ownership situations, you identify those nodes with two or more direct parents. The fraction of ownership by each parent is identified, and the sum of the fractions for that jointly owned node must be equal to one. Now every helper table record from any parent that terminates at the jointly owned node or crosses through the jointly owned node must use the proper Weighting Factor. The Weighting Factor, if present in the design, must then be multiplied at query time against all additive facts being summarized from the fact table. In this way, the correct contributions to consulting revenue and total hours (in the original example) will be added up through the tree.

You can also use this approach, up to a point, to model manufacturing parts explosions. The tree structures of manufacturing parts explosions can be manipulated to fit the examples discussed in this article. You can even represent a repeated subassembly in the same way as a jointly owned subsidiary, although in this case you don't need a Weighting Factor because the subassembly is really repeated, not shared. The main limitation in using this approach for manufacturing parts explosions is the sheer number of subassemblies and parts present in a big example. A huge parts explosion with hundreds of thousands or millions of parts would almost certainly result in a helper table with "more records than there are molecules in the universe." At some point, this helper table becomes infeasible.

This column is my last in this role as "Data Warehouse Architect" for DBMS magazine. Next month the magazine and I are going to be reborn as Intelligent Enterprise and the "Warehouse Architect," respectively. I am writing a special expanded column for the inaugural issue of the new magazine on the Brave New Requirements we now face in designing data warehouses. Looking forward to seeing you there. Bye for now.



Figure 1. A revenue reporting data mart showing the Revenue fact table whose grain is the indivdual invoice line item, as well as the Customer dimension table whose grain is the individual billed customer.


Figure 2. A schematic diagram of customer organizations that Big Nine Consultants sells consulting services to. Big Nine sells to all of these customers, and they are hierarchically related.


Figure 3. I have inserted a helper table in between the fact and dimension table that lets me navigate the organizational hierarchy.


Ralph Kimball, Ph.d., was coinventor of the Xerox Star workstation. He was vice president of applications at Metaphor Computer Systems and is the founder and former CEO of Red Brick Systems. He now works as an independent consultant designing large data warehouses. He is the author of The Data Warehouse Toolkit: How to Design Dimensional Data Warehouses (Wiley, 1996). You can reach Ralph through his Web page at www.rkimball.com.
What did you think of this article? Send a letter to the editor.


Subscribe to DBMS -- It's free for qualified readers in the United States
September 1998 Table of Contents | Other Contents | Article Index | Search | Site Index | Home

DBMS (http://www.dbmsmag.com)
Copyright © 1998 Miller Freeman, Inc. ALL RIGHTS RESERVED
Redistribution without permission is prohibited.
Please send questions or comments to dbms@mfi.com
Updated Sugust 7, 1998