DBMS, April 1996
Enterprise C/S By Judith Hurwitz

Preparing for the Warehouse

When You Are Beginning a Data Warehouse Project, Start With the Metadata.

Metadata is hot these days. Everyone I run into, from major users to vendors, is planning to address the issue of metadata. It may sound straightforward, but managing metadata is actually a complex endeavor that can cause severe problems if you do not ha ndle it in the right way. Focusing on metadata can have substantial rewards for companies, as long as they approach it in a comprehensive manner.

Let me explain why this problem is so complex. First, consider what metadata is. In simple terms, metadata is the information about the data stored in databases, and it incorporates the rules about how to use that information. In this column, I outline t he complexities of this process and the techniques that organizations should use to solve this problem.

The Problem

Most large organizations have a common problem: They have multiple data sources. Each of these data sources has its own set of predefined rules, naming conventions, and unique file formats. Even when dealing with a single data source, there are complexit ies inherent in metadata. For example, a finance company with which I've worked had a single Sybase database system that contained more than 600 tables. In some cases, there were as many as 10 or 12 different representations of the same information. Beca use each version had slightly different sources, processing rules, and formatting, it was almost impossible for the user or administrator to know which data source to use under different circumstances.

As you increase the number of data sources, the complexities involved with a data warehouse increase exponentially. This complexity becomes the problem and concern of those charged with building and managing the data warehouse. Organizations now must tak e information from several different data sources and push them together to create the illusion of a centralized corporate data source. They can take several steps to accomplish this.

First, an organization must understand what problem it is trying to solve with a data warehouse project. If it is too broad (that is, if the warehouse is being designed to solve too many problems at once), the warehouse will be too complex to implement. If it is too narrow (that is, too focused on a subset of a problem), it may not help the users.

Once you understand the problem, you must then understand the differences between how a particular piece of business information is represented and stored in multiple sources. In most cases, the metadata only represents storage information. Metadata shou ld include information about the business rules, the source of that data, and how that data is being used. Without understanding all of this, it is impossible to bring these pieces of metadata together into a single data source. Therefore, you must first document the business metadata associated with the existing data sources. Once your organization understands the rules, it must agree on a common representation of the data.

Importantly, you must base your data representation on how the data is going to be used. For example, a corporate accounting database might have a large amount of financial data calculated to very precise detail. For purposes of a warehouse, analysis at the more macro level (rounded to the nearest hundreds of dollars) might be sufficient for analysis. The usage should be documented so that users of the warehouse understand how this data may differ from formats and meanings with which they are familiar.

Now you must define the data model for the target database. This can be particularly complex if you are integrating a number of different legacy data models. It is common to integrate relational, hierarchical, and flat files into a single model, but it i s also difficult because you must create a data model that reflects the business from the highest level. If you are gathering information from a local data source, such as from a departmental server, it must make sense in the context of this enterprise v iew.

Next comes actual data transformation. In this stage, data must be mapped from one source to another. If some data comes from a hierarchical mainframe source, and other data comes from relational databases, you must put them into a common format for stor age in the warehouse. At the same time, you must take the format of the warehouse into consideration. Needless to say, this mapping process is complex. Moving data is a two-step process: the initial loading or seeding of the warehouse, followed by the in cremental updating from the source database.

During the initial loading process, it is necessary to establish a metadata catalog or dictionary. In many cases, corporations find that they must hire programmers to perform the initial load because so many variables are involved, including platform, da tabase, applications, and so on. The complexity increases as the volume of data expands. If you're talking about terabytes of data, you could have a long night.

Next Steps

Once you understand the complexities and scope of the problem, you must deal with the implementation issues. Following are some recommendations about how to get started:
  • Have a clearly defined scope for your project. For the pilot project, start small and keep the complexity low. With your first few warehousing projects, you'll learn a lot about what not to do.
  • Manage user expectations. Are users expecting miracles from the warehouse? Do they understand their goals? Are they expecting a two-week project? Setting expectations will help the IT organization deliver something that is credible and effecti ve.
  • Make sure that you understand all the pieces necessary to implement a warehouse. There are no simple turnkey products on the market. Therefore, you'll have to look for software that will do everything from querying to cleansing data from multi ple sources. So far, I have discussed the requirements for designing and implementing a warehouse. But organizations considering this project must also consider the deployment or data administration issues. Without a well-thought-out strategy, the data warehouse will fail. At the very least, a warehouse deployment strategy must have three key goals: consistent and synchronized data, data quality, and adequate end-user query and access tools.

    Consistent and Synchronized Data. The problem with corporate data is that it changes constantly. Organizations must provide a mechanism with which to update a data warehouse at preset intervals. This requires techniques such as synchronizing data via replication. If you are dealing with large volumes of data, you must ensure that you have structured the warehouse so that it can be updated efficiently. For example, you may want to segment a particularly large warehouse so that updates do not take countless hours.

    Data Quality. As new information is made available to the warehouse, the organization must ensure that the new data conforms to the rules and metadata designed for the warehouse. Without attention to data quality, the warehouse will be ineffective in providing key information.

    Query and Data Access. Ensuring that users have the ability to ask questions of the warehouse and are guided through a maze of information is critical to its effectiveness. Query tools must provide a syntax that is easy for businesspeople to under stand. For example, terms such as account name (which all users recognize immediately) can be linked via a query tool. In addition, organizations should look at data mining tools that provide a more intelligent process for finding expected relationships between data elements. I will write a column later in the year that looks at specific data mining tools.

    Should You Do It?

    With all of the complexities I've mentioned, should you still move forward with your data warehouse project? My answer is a qualified yes. If you can identify clear business need and establish a well-defined project with a manageable scope, move forward. You may want to look for companies that have expertise in integrating components, unless you have a lot of internal expertise. Organizations that have created well-designed warehouses are realizing significant business benefits from them. Therefore, it is probably worth the trouble.


    Judith Hurwitz is president of Hurwitz Consulting Group Inc., a consulting, publishing, and research services firm specializing in client/server development tools, client/server infrastructure, and systems management. Hurwitz Consulting Group is based in Newton, Massachusetts. You can reach Judith at 617-965-6900 or via email at Jhurwitz@hurwitz.com.
    Subscribe to DBMS and Internet Systems -- It's free for qualified readers in the United States
    April 1996 Table of Contents | Other Contents | Article Index | Search | Site Index | Home

    DBMS and Internet Systems (http://www.dbmsmag.com)
    Copyright © 1996 Miller Freeman, Inc. ALL RIGHTS RESERVED
    Redistribution without permission is prohibited.
    Please send questions or comments to dbms@mfi.com
    Updated Monday, November 4, 1996.