DBMS
 

 


 Converting Data for Warehouses - Understanding the Complexities and Resource Requirements Involved in a Quality Data Conversion. By Kathy Bohn.

The data conversion process for a data warehouse is complex, time-consuming, and unglamorous. It is also the very root of a good, functional data warehouse. After all, "data" is the operative word in "data warehouse." Quality data conversions are important to the data warehouse because the warehouse holds the information that is key to a corporation's decision-making process. For a corporation to obtain the ultimate goals and promises of a data warehouse, it must understand just how critical the data conversion process is.

The data conversion team examines old, dirty source systems that usually reside on mainframes. Typically, they use good old Cobol for the majority of their work on the mainframe, and they are likely to have the tightest service target of any part of the warehouse. The business users might wait a little longer to obtain the results of their queries, but they will not wait an extra hour for their data loads. In addition, this team faces pressures because the data warehouse is likely to uncover more problems with source data than any other single activity. Typically, the conversion system is questioned because, as it's the newest system, users assume that the conversion system is what introduced the problems.

To truly understand the complexities and resource requirements involved in a quality conversion, you must examine the overall process.

Who Is Involved

Before beginning the conversion process, the data warehouse team completes the design and physical data model for the data warehouse and generates the target schemas. The data conversion team consists of business and technical people who design the warehouse structures; analyze the source data; identify data mappings; gather and/or create the external data; determine the logic to convert the data; plan and generate the conversion routines; and quality assure the data. They also select and use data migration, conversion, and cleaning tools. (For more information on data migration, cleansing, and conversion tools, refer to Joseph Williams' article this month on page 69.)

This team follows a data conversion process that starts with development of a conversion plan.

Conversion Plan

A key critical success factor for data conversion is that all members of your team thoroughly understand conversion requirements and flow. Data conversion for a warehouse is usually very large, and you have options that allow different activities to occur simultaneously so that the conversion timeline is as short as possible. However, if you are unaware of or misinterpret the conversion requirements and flow, the activities will not link together smoothly, if at all, at the end of the process.

The conversion plan determines the best route to migrate source data to the data warehouse. It considers available resources, source data volume, number of different source schemas, number and types of different access methods and platforms, data warehouse structure, and amount of aggregation required.

The plan documents each source system's platform, access method, and programmatic language required for data extraction. If your team will extract from many source systems residing on many machines, your data conversion plan will outline the appropriate strategy for gathering the data extractions to a common staging area to condition, clean, transform, and integrate the data. (See Figure 1.)

Your strategy for moving data to the common staging area must take into account available machine resources, collective skill set of the data conversion team, and volume of source data. For example, the source systems are MVS-based, the technical skill set of the data conversion team is MVS and Cobol, and the volume of data is high. In this case, your data conversion plan would recommend a staging area that resides on one of the MVS machines. If, however, the MVS machine resources are limited and your team's technical skill set is strongest with Unix and C, your plan would recommend a staging area that resides on a Unix machine and possibly the actual data warehouse server.

In addition, your data conversion plan must consider the structure of the data warehouse and the target database schemas. Typically, data moves from its source system schema to an intermediate schema in the staging area. The intermediate schema is the common interface to which all source systems are extracted. The intermediate schemas do not exactly match the source schemas or the target schemas. They are usually somewhere in between and contain additional fields, such as percentage numbers for use in calculations or key fields to read reference or lookup tables, that enhance the data conditioning, data cleaning, and data transformation routines. Finally, your data conversion plan must consider the flow of data throughout conversion, addressing the following issues:

Creating Conversion Specifications

After establishing a plan, your next step is to analyze the source data carefully. You should determine, on a data attribute by data attribute basis, what source data maps to what target data and what logic is required to migrate from source to target. You also identify external information such as tables that cross-reference locally used codes to industry-standard codes -- for example, healthcare procedure codes to use and/or create to transform the source data--and then design any intermediate schemas required to support the conversion plan.

Typically, you will document the conversion specifications in a word-processor document or a spreadsheet printed for customer review and sign-off. However, if you are using a data migration tool, you can document the conversion specifications directly in the tool and generate conversion specification reports for customer review and sign-off.

Either way, the conversion specifications are extremely important metadata for the data warehouse. Data migration tools automate the process of storing conversion metadata in a central place, tracking the conversion metadata over time, and exporting the conversion metadata to the data warehouse's metadata repository. If you generate the conversion routines manually, you must also generate the conversion metadata manually. The data conversion plan specifies the format of the conversion metadata as well as how you will store, update, and export the metadata.

Whether conversion specifications are generated manually or automatically into programmatic code, you should use conversion specifications to understand the data mapping and required processing logic. Programmatic code consists of six types of routines that perform the following functions:

Extract Source Data to Intermediate Schemas

In the extract source data activity, data extraction routines read the source data, convert it to an intermediate schema, and move it to a common staging area (a temporary work area in which data is held in intermediate schemas). The process of acquiring data in a common staging area greatly enhances the reusability of programmatic code. Some code is necessarily unique because of the source system's access method, platform, and language. However, the code that conditions, cleans, transforms, and integrates the data is common for all source data if that data is first brought to intermediate schemas in a common place.

You design the data extraction routines to isolate only that data that is required to migrate to the data warehouse. The design is effective for both the initial load and incremental updates of the data warehouse, and it is important because it reduces the amount of data that migrates to the data warehouse, which in turn reduces computer and network resource requirements for the conversion. The data extraction design should consider how the source system indicates changed and new data, such as time stamps or periodic archives.

Time stamps and periodic archives (such as month-end files of paid claims for an insurance company) are ideal because the data extraction routines easily identify and isolate the changed and new data. If the source system does not indicate changed or new data, you compare the source data with master files of warehouse data to find the changed or new data.

The data extraction routines usually execute within the source system's environment, performing functions that transform, convert from binary, convert from packed decimal, and compare, combine, and parse the source data. These types of data conditioning are more easily performed in the environment that created these types of data than in the target environment that might not support these functions or types of data. The conditioning process relies heavily on knowledge of the data source; structure of the data as it resides in the operational systems or in the data warehouse; structure of the data as perceived by the end user or the warehouse applications; knowledge of the rules required to identify, map, clean, transform, and aggregate the data; and security features associated with various metadata objects, such as source system logons and data warehouse database security rules.

Convert Intermediate Schemas to Load Data

Once the source data is gathered to a staging area, it is time to execute the conversion routines that might clean the data. Data cleaning ensures the data's integrity through special programs that correct data, improving the data's accuracy and overall usefulness. (See Ralph Kimball's article in the August 1996 DBMS Data Warehouse Supplement, page S15, for more information.) You might use software that corrects and enhances errant data fields such as names and addresses. Such data cleaning software is a separate step in a batch job stream, callable subroutines executed in a batch or online environment, or an external service that is not under the control of the traditional batch cycle.

For the data conversion team to implement data cleaning services, the data warehouse architecture design must accommodate constraints imposed by the data cleaning software, such as any platforms on which the data cleaning software does or does not run. You only use this software if you cannot correct the data at the source. In addition, you identify all cleaning requirements in the data conversion plan.

The following are major components of data cleaning:

Other data conversion routines transform data. The source systems usually have data attributes that require a transformation process. For example, the source system codes the gender of a person as "1" for male and "2" for female. However, the data warehouse codes the gender of a person as "M" for male and "F" for female. The transformation process ensures a consistent mapping of codes and keys between the source systems and the data warehouse.\

If you are extracting data from more than one subject area or from more than one version within a subject area, you must integrate that data into a single view. In addition, you must address data anomalies and corrections to previously migrated and loaded data.

The techniques of data conditioning, cleaning, transformation, and integration must be applied in an iterative fashion. Once you and the customer are satisfied with the condition of the data, you create load-record images for the atomic-level, dimension, and fact data.

Atomic-level data is the lowest level of data detail in the warehouse and is needed to provide the level of granularity required to effectively condition, clean, transform, and integrate the data. If the data warehouse feeds data marts that are typically designed with a star-join schema, you use the atomic-level load data to create the data mart dimension tables. If the processing for this activity is sequential, you sort the resulting files of dimension-load data to eliminate any duplicate records.

To track and generate keys for each of the dimensions of the star-join schema, you must create a key administration application. A variety of key generation and administration strategies exist. Examples are source system key integration and system-generated keys. Source system key integration transforms logical keys from multiple source systems into a unique physical key. A hospital patient might be identified with a social security number in one system and by name and birth date in another system. In the data warehouse, the keys are integrated. System-generated keys are those that the conversion system or RDBMS assigns. The conversion plan documents the strategy that is best suited for the data warehouse and its users.

To create data to load into fact tables, you generate an application that first matches the dimension-load data to the atomic-level load data. Then it populates the fact data records with the keys of the dimensions and the associated quantitative facts contained in the atomic-level data. This application reports any records from the dimension-load data that do not match back to the atomic-level load data. If the application performs properly, absolutely no mismatch conditions occur.

Aggregate Load Data

Load data is aggregated by executing a series of sorts and the previously mentioned key administration application numerous times. The result is all aggregate dimension load data and all associated aggregated fact load data defined in the data warehouse design.

Typically, you execute these functions sequentially rather than within the RDBMS for three reasons. First, external sort utilities such as SyncSort (from Syncsoft Inc., Woodcliff Lake, N.J.) are fast. Second, the new aggregate data is stored apart from the data warehouse server as part of disaster-recovery requirements. If a disaster requires a data recovery, you simply reload the aggregates. Third, you use a bulk loader to populate the target database rather than an application with embedded SQL. This method is fast and efficient.

Move, Load, and Index the Data

Next, you must move the data from the staging area to the data warehouse server, if the staging area is not on the server. Once the data is assembled on the data warehouse server, you load it into the database using the RDBMS bulk-load utility. You use referential integrity during the load process to ensure that the fact table key contains true foreign keys from the dimension tables. Fact tables normally contain several million -- if not a billion or more -- rows. Should the load data contain rows that violate referential integrity, you can easily find the offending records and correct them. Otherwise, you have little chance of even knowing about the problem.

Whether you decide to update indexes during the bulk load process or later depends on time constraints and the capabilities of the RDBMS. Several RDBMSs let DBAs segment the table indexes. Then you can drop a portion of the index, bulk load the data, and rebuild the indexes as needed. Segmented table indexes reduce the amount of time needed to bulk load and index the data. Reducing the amount of time spent in the "load window" has a positive impact on the availability of the warehouse and is especially important if the data warehouse users are located around the world in many time zones.

Quality Assure Data

You ensure the quality of the data not as a separate activity but throughout the data conversion process. The conversion plan specifies customer or end-user reviews and sign-off procedures, data validation and correction procedures, and the process for reconciling data with the source system.

You must work closely with the customer or end user to create the conversion plan and the conversion specifications. As mentioned earlier, you examine the current processing environment to plan the best route by which to move the data from the source systems to the data warehouse. The plan is documented and reviewed with the customer. In this manner, the customer knows up-front what to expect during the data conversion process. To show understanding, the customer formally signs the conversion plan.

To create effective conversion specifications, you should learn as much as possible about the source data, including its structures and the meanings of each of its fields. You should also document how the source data populates the target data, document your understanding, and review it with the customer. Not only will customers be able to validate the meanings of the source data, but they will also better understand how to relate the source data to the data in the data warehouse. To show agreement, the customer formally signs the conversion specifications.

The first place to reconcile the conversion process is after you extract the source data to the intermediate schemas. Total the number of extract records and other additive data, such as counts and monetary amounts, from the intermediate schemas. These figures tie to the totals from the source system. As a rule, reconcile the data early and often in the data conversion process, and strive to find problems as early as possible to avoid having to reexecute multiple steps of the process.

During the conditioning, cleaning, transformation, and integration of the data, you and the customer make decisions as the actual values of the data are examined. When you uncover errant data, report the findings to the customer. Together you and the customer decide to correct the data in the source system or with the data conversion routines. If you correct the data with the data conversion routines, you should also denote the corrections in the conversion metadata.

The next place to reconcile the data is after you complete the atomic-level load data. In the same manner as you did with the data extracts, total the number of extract records and other additive data. As you tie the figures from the atomic-level load data to the source systems, consider the actions you took as you conditioned, cleaned, transformed, and integrated the data. Probably, totals no longer match. However, you must reconcile the differences between the source systems and the atomic-level data.

Finally, reconcile the data after you create the fact-load data and the aggregate fact-load data. The totals from all fact tables tie exactly to each other and the atomic-level load data.

No Shortcuts Allowed

The complexities of the data conversion process change for each data warehouse implementation depending on the data warehouse design, source system structures, source data cleanliness, and source systems integration requirements. The data conversion activities, however, rarely change -- you can choose the order of the activities, but you can't eliminate any of them. The quality of the warehouse data depends on you paying detailed attention to each of the data conversion activities.


Kathy Bohn is a consultant for Plano, Texas-based EDS Health Care Industry Group, which provides a broad range of information services to the healthcare industry, including consulting, systems design, development, management, and integration. You can email Kathy at msusds01.kbohn01@eds.com.

Figure 1.


--The data conversion process.


Subscribe to DBMS and Internet Systems -- It's free for qualified readers in the United States
June 1997 Table of Contents | Other Contents | Article Index | Search | Site Index | Home

DBMS and Internet Systems (http://www.dbmsmag.com)
Copyright © 1997 Miller Freeman, Inc. ALL RIGHTS RESERVED
Redistribution without permission is prohibited.
Please send questions or comments to dbms@mfi.com
Updated Friday, May 16, 1997