Reprinted with permission from John Wiley & Sons, Inc.
(Introduction, pages xxiii to xxviii)
Snatching Defeat from the Jaws of Victory
We are in danger of letting the relational database revolution pass by without making good on the original promise of delivering data that can be accessed "every which way." Why did we begin turning our IS shops upside down in the early 1980's, replacing our flat file databases and our hierarchical databases with relational databases, if it wasn't for the dream of much more flexible access? We are now very far along with replacing all of our production database systems with relational technology, yet amazingly, we have lost track of the original "fatal attraction" that led us to relational databases in the first place.
Relational databases were supposed to provide equal access. If your sales database was built around product, market, and time entities, it wasn't supposed to matter whether you asked the product question before you asked the market question, or vice versa. Either approach was just as good. The older hierarchical databases, like IMS, forced you to ask your business questions in a fixed order, beginning with the root of the database. Freed from the shackles of asking business questions in a set way, with relational databases we were supposed to finally unlock our corporate information.
The early writings on relational databases were full of the promise of equal access. Chris Date's wonderful book, An Introduction to Database Systems, Addison-Wesley, published in the early 1980's, is filled with simple and compelling examples of equal access. Who can forget his examples of parts and suppliers and cities? Or is it suppliers and cities and parts? It doesn't matter... It's relational. It is very revealing to page through this book and see that there are no discussions of transaction processing and E/R diagrams. All of that came later.
Even though we bought the relational dream in the early 1980's, we couldn't use relational databases very effectively because there wasn't any data in them yet. So we began to use relational technology to capture primary business production data like orders, invoices, and business transactions. Almost immediately we ran into a serious problem: the early relational systems were pitifully slow for transaction processing. A typical transaction rate was about one per second. No large business could possibly run at one transaction per second. For example, today's SABRE system, the reservation system for American Airlines, routinely processes 4000 transactions per second during heavy loads, and is capable of peak bursts in excess of 13,000 transactions per second.
The transaction processing performance crisis we had in the 1980's was both good news and bad news. It was good because it caused the database vendors to seriously improve and strengthen the relational database software so that transactions could be performed fast and reliably. This was a necessary precursor to making relational databases "production capable." Today it is possible to buy off the shelf a UNIX processor and a relational database that are capable of a sustained transaction rate of 1000 per second. This is a monstrous improvement from the early 1980's that would have seemed impossible at that time. It used to be said that "the price we have to pay for the wonderful flexibility of relational databases is that they will always be slow". Fortunately, this view was wrong.
The good news of the big transaction processing performance gain, however, is tempered with the bad news that we have become fixated by transaction processing. We have become so steeped in the terminology and discipline of OLTP, that we have a whole generation of IS professionals who think that point of relational databases is to get data in rather than get data out. Actually, the opposite is true. Transaction processing is an awkward graft onto relational databases, and to do transaction processing well in this environment, we have to ignore or even misuse many of the core facilities in relational databases. In Chapter 2, we will systematically explore the differences between OLTP and dimensional data warehousing, and learn how to think separately about getting the data in and getting the data out. We need to return to our roots.
The Goals of a Data Warehouse
The data warehouse is the place where people can access their data. The fundamental goals of a data warehouse can be developed by walking around the halls of any large organization and listening to management talk. The recurring themes heard from management sound like:
"We have mountains of data in this company but we can't get access to it."
"Nothing drives senior management crazier than to have two people present the same business result but with different numbers."
"We want to slice and dice the data every which way."
"Just show me what is important."
"Everyone knows that some of the data isn't very good."
These concerns are so universal that they drive the bedrock requirements for the data warehouse. Let us turn these problems into opportunities and state them as requirements:
1. The data warehouse provides access to corporate or organizational data.
Access means several things. The managers and analysts of an organization must be able to connect to the data warehouse from their personal computers. This connection must be immediate, on demand, and with high performance. It is not acceptable if the access is through another person, or if the access is unreliable or slow. High performance access means the tiniest queries run in less than 1 second. Access also means the tools available to the managers and analysts are very easy to use. Chapter 17 will discuss this issue in much more detail, but "easy to use" means a useful report can be run with one button click after opening the tool, and the report can be changed and re-run with two button clicks.
2. The data in a data warehouse is consistent.
Consistency means that when two people request the sales for the Southeast region for January they get the same number, even if they request the data at different times. Consistency also means that when these people ask the data warehouse what the definition of the "sales" data element is, they get a useful answer that lets them know what they are fetching from the database. Consistency also means that if yesterday's data has not been completely loaded, the analyst is warned that the data load is not complete and to not expect the final data load until tomorrow.
3. The data in a data warehouse can be separated and combined by means of every possible measure in the business. (The classic slice and dice requirement).
The slicing and dicing requirement speaks directly to the dimensional approach. We will see in Chapter 2 that a more operational definition of slicing and dicing is "row headers and constraints". Row headers and constraints will turn out to be the fundamental building blocks of every data warehouse application and row headers and constraints will come directly from the dimensions in our data model.
4. The data warehouse is not just data, but is a also set of tools to query, analyze, and present information.
The "back room" components, namely the central data warehouse hardware, the relational database software, and the data itself, are only about 60% of what is needed for a successful data warehouse. The remaining 40% is the set of front end tools that query, analyze and present the data. The "show me what is important" requirement needs all of these components. These points are developed in detail in Chapter 17.
5. The data warehouse is the place where we publish used data.
The responsibility to publish is at the very core of the data warehouse. Data is not simply accumulated at a central point and "let loose". Rather, data is carefully assembled from a variety of information sources around the organization, cleaned up, quality assured, and then released only if it is fit for use. If the data is unreliable or incomplete, the responsible data quality manager does not allow it be published to the user community. The data quality manager plays much the same role as a magazine editor or a book publisher. He or she is responsible for the content and quality of the publication and is identified with the deliverable.
6. The quality of the data in the data warehouse is a driver of business re-engineering.
The best data in any company is the record of how much money someone else owes the company. Data quality then goes downhill after that. Frequently a data element would be very interesting if it were of high quality, but it either isn't collected at all or it is "optional". Optional is the kiss of death for data.
The data warehouse cannot fix poor quality data. If an automobile insurer does not require the "Cause of Accident" data to be collected by the field adjusters, then there is nothing the data warehouse can do when this data arrives at the front door. The only way to fix poor quality data is return to the source of the data with better systems, better management, and better visibility of the value of good data, both by affected data entry personnel and by management. Interestingly, often a good way to justify such a business re-engineering project is to go ahead and publish the incomplete data, and then let natural pressure arise within the organization when people see how valuable the data would be if only it was better quality. In this way, the data warehouse can play a key role in the business re-engineering efforts in an organization.
The Goals of this Book
This book is a practical guide to owning and building a data warehouse. A major goal of the book is to teach the processes of 1) assembling a proper set of requirements for the data warehouse in your organization; 2) doing the logical design of the warehouse data structures; 3) planning the data extract and transformation steps down to the individual data element; 4) building a front end tool suite; and 5) managing the completed data warehouse. This book is written from years of hands on experience designing large data warehouses and launching IS organizations in the right direction. Every one of the examples developed in Chapters 3 through 12 is an operational data warehouse in a large corporation designed by the author. All of these warehouses are in the 10 gigabyte to 1 terabyte range and the largest queryable tables range up to 1 billion records.
A second goal of this book is to communicate a set of standard techniques for data warehouse design. The OLTP side of the house has developed a powerful and precise vocabulary for OLTP design. Serious OLTP designers know that transaction systems that pass the ACID test possess the qualities of Atomicity, Consistency, Isolation, and Durability. We on the data warehouse side of the house need an equally powerful vocabulary and set of design principles. Fortunately, data warehouse design is replete with a number of powerful, consistent principles. This book names and categorizes these principles and ties each of them to characteristic businesses. In this way they are easier to remember. The Design Principles appendix of this book summarizes this data warehouse design vocabulary.
A third goal of this book is to help the reader understand where the data warehouse part of the market is going and to help the reader become an effective voice for necessary change. Data warehousing is not very mature. Although the vendors are doing their best to define what is important in this market, most of the advances have been far too technology driven and not enough user driven. It is amazing that we are being sold on the benefits of scanning billion row tables from start to finish on parallel processors but most of the vendors don't have a STOP command for runaway queries. It is amazing that we have logically sophisticated SQL commands like EXISTS and ANY being implemented faithfully by all the DBMS vendors, but we don't have commands or tools that can effectively compare one number with another. If you don't believe this, stay tuned. A final goal of this book is to show how well (and how poorly) SQL and relational databases match with simple business analysis.