The data conversion process for a data warehouse is complex, time-consuming, and unglamorous. It is also the very root of a good, functional data warehouse. After all, "data" is the operative word in "data warehouse." Quality data conversions are important to the data warehouse because the warehouse holds the information that is key to a corporation's decision-making process. For a corporation to obtain the ultimate goals and promises of a data warehouse, it must understand just how critical the data conversion process is.
The data conversion team examines old, dirty source systems that usually reside on mainframes. Typically, they use good old Cobol for the majority of their work on the mainframe, and they are likely to have the tightest service target of any part of the warehouse. The business users might wait a little longer to obtain the results of their queries, but they will not wait an extra hour for their data loads. In addition, this team faces pressures because the data warehouse is likely to uncover more problems with source data than any other single activity. Typically, the conversion system is questioned because, as it's the newest system, users assume that the conversion system is what introduced the problems.
To truly understand the complexities and resource requirements involved in a quality conversion, you must examine the overall process.
This team follows a data conversion process that starts with development of a conversion plan.
The conversion plan determines the best route to migrate source data to the data warehouse. It considers available resources, source data volume, number of different source schemas, number and types of different access methods and platforms, data warehouse structure, and amount of aggregation required.
The plan documents each source system's platform, access method, and programmatic language required for data extraction. If your team will extract from many source systems residing on many machines, your data conversion plan will outline the appropriate strategy for gathering the data extractions to a common staging area to condition, clean, transform, and integrate the data. (See Figure 1.)
Your strategy for moving data to the common staging area must take into account available machine resources, collective skill set of the data conversion team, and volume of source data. For example, the source systems are MVS-based, the technical skill set of the data conversion team is MVS and Cobol, and the volume of data is high. In this case, your data conversion plan would recommend a staging area that resides on one of the MVS machines. If, however, the MVS machine resources are limited and your team's technical skill set is strongest with Unix and C, your plan would recommend a staging area that resides on a Unix machine and possibly the actual data warehouse server.
In addition, your data conversion plan must consider the structure of the data warehouse and the target database schemas. Typically, data moves from its source system schema to an intermediate schema in the staging area. The intermediate schema is the common interface to which all source systems are extracted. The intermediate schemas do not exactly match the source schemas or the target schemas. They are usually somewhere in between and contain additional fields, such as percentage numbers for use in calculations or key fields to read reference or lookup tables, that enhance the data conditioning, data cleaning, and data transformation routines. Finally, your data conversion plan must consider the flow of data throughout conversion, addressing the following issues:
Typically, you will document the conversion specifications in a word-processor document or a spreadsheet printed for customer review and sign-off. However, if you are using a data migration tool, you can document the conversion specifications directly in the tool and generate conversion specification reports for customer review and sign-off.
Either way, the conversion specifications are extremely important metadata for the data warehouse. Data migration tools automate the process of storing conversion metadata in a central place, tracking the conversion metadata over time, and exporting the conversion metadata to the data warehouse's metadata repository. If you generate the conversion routines manually, you must also generate the conversion metadata manually. The data conversion plan specifies the format of the conversion metadata as well as how you will store, update, and export the metadata.
Whether conversion specifications are generated manually or automatically into programmatic code, you should use conversion specifications to understand the data mapping and required processing logic. Programmatic code consists of six types of routines that perform the following functions:
You design the data extraction routines to isolate only that data that is required to migrate to the data warehouse. The design is effective for both the initial load and incremental updates of the data warehouse, and it is important because it reduces the amount of data that migrates to the data warehouse, which in turn reduces computer and network resource requirements for the conversion. The data extraction design should consider how the source system indicates changed and new data, such as time stamps or periodic archives.
Time stamps and periodic archives (such as month-end files of paid claims for an insurance company) are ideal because the data extraction routines easily identify and isolate the changed and new data. If the source system does not indicate changed or new data, you compare the source data with master files of warehouse data to find the changed or new data.
The data extraction routines usually execute within the source system's environment, performing functions that transform, convert from binary, convert from packed decimal, and compare, combine, and parse the source data. These types of data conditioning are more easily performed in the environment that created these types of data than in the target environment that might not support these functions or types of data. The conditioning process relies heavily on knowledge of the data source; structure of the data as it resides in the operational systems or in the data warehouse; structure of the data as perceived by the end user or the warehouse applications; knowledge of the rules required to identify, map, clean, transform, and aggregate the data; and security features associated with various metadata objects, such as source system logons and data warehouse database security rules.
For the data conversion team to implement data cleaning services, the data warehouse architecture design must accommodate constraints imposed by the data cleaning software, such as any platforms on which the data cleaning software does or does not run. You only use this software if you cannot correct the data at the source. In addition, you identify all cleaning requirements in the data conversion plan.
The following are major components of data cleaning:
If you are extracting data from more than one subject area or from more than one version within a subject area, you must integrate that data into a single view. In addition, you must address data anomalies and corrections to previously migrated and loaded data.
The techniques of data conditioning, cleaning, transformation, and integration must be applied in an iterative fashion. Once you and the customer are satisfied with the condition of the data, you create load-record images for the atomic-level, dimension, and fact data.
Atomic-level data is the lowest level of data detail in the warehouse and is needed to provide the level of granularity required to effectively condition, clean, transform, and integrate the data. If the data warehouse feeds data marts that are typically designed with a star-join schema, you use the atomic-level load data to create the data mart dimension tables. If the processing for this activity is sequential, you sort the resulting files of dimension-load data to eliminate any duplicate records.
To track and generate keys for each of the dimensions of the star-join schema, you must create a key administration application. A variety of key generation and administration strategies exist. Examples are source system key integration and system-generated keys. Source system key integration transforms logical keys from multiple source systems into a unique physical key. A hospital patient might be identified with a social security number in one system and by name and birth date in another system. In the data warehouse, the keys are integrated. System-generated keys are those that the conversion system or RDBMS assigns. The conversion plan documents the strategy that is best suited for the data warehouse and its users.
To create data to load into fact tables, you generate an application that first matches the dimension-load data to the atomic-level load data. Then it populates the fact data records with the keys of the dimensions and the associated quantitative facts contained in the atomic-level data. This application reports any records from the dimension-load data that do not match back to the atomic-level load data. If the application performs properly, absolutely no mismatch conditions occur.
Typically, you execute these functions sequentially rather than within the RDBMS for three reasons. First, external sort utilities such as SyncSort (from Syncsoft Inc., Woodcliff Lake, N.J.) are fast. Second, the new aggregate data is stored apart from the data warehouse server as part of disaster-recovery requirements. If a disaster requires a data recovery, you simply reload the aggregates. Third, you use a bulk loader to populate the target database rather than an application with embedded SQL. This method is fast and efficient.
Whether you decide to update indexes during the bulk load process or later depends on time constraints and the capabilities of the RDBMS. Several RDBMSs let DBAs segment the table indexes. Then you can drop a portion of the index, bulk load the data, and rebuild the indexes as needed. Segmented table indexes reduce the amount of time needed to bulk load and index the data. Reducing the amount of time spent in the "load window" has a positive impact on the availability of the warehouse and is especially important if the data warehouse users are located around the world in many time zones.
You must work closely with the customer or end user to create the conversion plan and the conversion specifications. As mentioned earlier, you examine the current processing environment to plan the best route by which to move the data from the source systems to the data warehouse. The plan is documented and reviewed with the customer. In this manner, the customer knows up-front what to expect during the data conversion process. To show understanding, the customer formally signs the conversion plan.
To create effective conversion specifications, you should learn as much as possible about the source data, including its structures and the meanings of each of its fields. You should also document how the source data populates the target data, document your understanding, and review it with the customer. Not only will customers be able to validate the meanings of the source data, but they will also better understand how to relate the source data to the data in the data warehouse. To show agreement, the customer formally signs the conversion specifications.
The first place to reconcile the conversion process is after you extract the source data to the intermediate schemas. Total the number of extract records and other additive data, such as counts and monetary amounts, from the intermediate schemas. These figures tie to the totals from the source system. As a rule, reconcile the data early and often in the data conversion process, and strive to find problems as early as possible to avoid having to reexecute multiple steps of the process.
During the conditioning, cleaning, transformation, and integration of the data, you and the customer make decisions as the actual values of the data are examined. When you uncover errant data, report the findings to the customer. Together you and the customer decide to correct the data in the source system or with the data conversion routines. If you correct the data with the data conversion routines, you should also denote the corrections in the conversion metadata.
The next place to reconcile the data is after you complete the atomic-level load data. In the same manner as you did with the data extracts, total the number of extract records and other additive data. As you tie the figures from the atomic-level load data to the source systems, consider the actions you took as you conditioned, cleaned, transformed, and integrated the data. Probably, totals no longer match. However, you must reconcile the differences between the source systems and the atomic-level data.
Finally, reconcile the data after you create the fact-load data and the aggregate fact-load data. The totals from all fact tables tie exactly to each other and the atomic-level load data.
Figure 1.

--The data conversion process.
June 1997 Table of Contents | Other Contents | Article Index | Search | Site Index | Home
DBMS and Internet Systems (http://www.dbmsmag.com)
Copyright © 1997 Miller Freeman, Inc. ALL RIGHTS RESERVED
Redistribution without permission is prohibited.