Metadata is hot these days. Everyone I run into, from major users to vendors, is planning to address the issue of metadata. It may sound straightforward, but managing metadata is actually a complex endeavor that can cause severe problems if you do not ha ndle it in the right way. Focusing on metadata can have substantial rewards for companies, as long as they approach it in a comprehensive manner.
Let me explain why this problem is so complex. First, consider what metadata is. In simple terms, metadata is the information about the data stored in databases, and it incorporates the rules about how to use that information. In this column, I outline t he complexities of this process and the techniques that organizations should use to solve this problem.
As you increase the number of data sources, the complexities involved with a data warehouse increase exponentially. This complexity becomes the problem and concern of those charged with building and managing the data warehouse. Organizations now must tak e information from several different data sources and push them together to create the illusion of a centralized corporate data source. They can take several steps to accomplish this.
First, an organization must understand what problem it is trying to solve with a data warehouse project. If it is too broad (that is, if the warehouse is being designed to solve too many problems at once), the warehouse will be too complex to implement. If it is too narrow (that is, too focused on a subset of a problem), it may not help the users.
Once you understand the problem, you must then understand the differences between how a particular piece of business information is represented and stored in multiple sources. In most cases, the metadata only represents storage information. Metadata shou ld include information about the business rules, the source of that data, and how that data is being used. Without understanding all of this, it is impossible to bring these pieces of metadata together into a single data source. Therefore, you must first document the business metadata associated with the existing data sources. Once your organization understands the rules, it must agree on a common representation of the data.
Importantly, you must base your data representation on how the data is going to be used. For example, a corporate accounting database might have a large amount of financial data calculated to very precise detail. For purposes of a warehouse, analysis at the more macro level (rounded to the nearest hundreds of dollars) might be sufficient for analysis. The usage should be documented so that users of the warehouse understand how this data may differ from formats and meanings with which they are familiar.
Now you must define the data model for the target database. This can be particularly complex if you are integrating a number of different legacy data models. It is common to integrate relational, hierarchical, and flat files into a single model, but it i s also difficult because you must create a data model that reflects the business from the highest level. If you are gathering information from a local data source, such as from a departmental server, it must make sense in the context of this enterprise v iew.
Next comes actual data transformation. In this stage, data must be mapped from one source to another. If some data comes from a hierarchical mainframe source, and other data comes from relational databases, you must put them into a common format for stor age in the warehouse. At the same time, you must take the format of the warehouse into consideration. Needless to say, this mapping process is complex. Moving data is a two-step process: the initial loading or seeding of the warehouse, followed by the in cremental updating from the source database.
During the initial loading process, it is necessary to establish a metadata catalog or dictionary. In many cases, corporations find that they must hire programmers to perform the initial load because so many variables are involved, including platform, da tabase, applications, and so on. The complexity increases as the volume of data expands. If you're talking about terabytes of data, you could have a long night.
Consistent and Synchronized Data. The problem with corporate data is that it changes constantly. Organizations must provide a mechanism with which to update a data warehouse at preset intervals. This requires techniques such as synchronizing data via replication. If you are dealing with large volumes of data, you must ensure that you have structured the warehouse so that it can be updated efficiently. For example, you may want to segment a particularly large warehouse so that updates do not take countless hours.
Data Quality. As new information is made available to the warehouse, the organization must ensure that the new data conforms to the rules and metadata designed for the warehouse. Without attention to data quality, the warehouse will be ineffective in providing key information.
Query and Data Access. Ensuring that users have the ability to ask questions of the warehouse and are guided through a maze of information is critical to its effectiveness. Query tools must provide a syntax that is easy for businesspeople to under stand. For example, terms such as account name (which all users recognize immediately) can be linked via a query tool. In addition, organizations should look at data mining tools that provide a more intelligent process for finding expected relationships between data elements. I will write a column later in the year that looks at specific data mining tools.