DBMS
 

 


Tools for Traveling Data - A Specialized Breed of Tools Eases the Task of Extracting, Migrating, and Cleansing Data Before Loading it into Your Warehouse Database. By Joseph Williams.

One of the most important steps in building a data warehouse is loading data into the warehouse database. Unfortunately, one of the most overlooked and underscoped steps in building the data warehouse is also loading data into the warehouse database. Most companies grossly underestimate the complexity of the data load process and the effort required.

Businesses look to the decision-support capabilities of the data warehouse to guide them strategically through the marketplace maze. Unreliable, "dirty," and otherwise bad data can result in misinformed decisions that have direct adverse effects on profit. No business would purposely strive to decrease its bottom line; however, many are doing just that by neglecting to address the extraction, transformation, and data load process thoroughly. Perhaps even worse are the companies that give only a cursory glance at data quality and then proceed to bulk-load dirty data and base important business decisions on it.

These oversights often stem from good intentions, because companies eagerly anticipate the many benefits of the data warehouse. The promise of finding more value in existing data is exciting. Most companies think that this discovery process begins after the warehouse is populated, when analysts begin their work with data mining tools and sophisticated query tools. I submit to you, however, that the initial loading of data can yield a dramatic, beneficial discovery process that has a tremendous impact on the business. Loading source data should be the lengthiest and most carefully planned step of warehouse development, but instead it often becomes a simple replication process that duplicates and even exaggerates inconsistencies in the source data.

Many tools on the market facilitate extracting, cleansing, and transforming data as well as analyzing and ensuring data quality. These tools account for a large percentage of the money spent annually on data warehousing. Nearly one in three vendors at the DCI Data Warehouse Conference held in February 1997 in Orlando, Florida, offered some combination of these services. The problem is that no one tool provides all of these services, so in order to maximize the budget dollar, it is extremely important to understand the categories of tools and what each tool has to offer. This knowledge will help you choose the right tools for the job and use existing resources effectively.

In this article, I discuss three broad categories of tools that help you load a data warehouse: data quality, extraction and transformation, and cleansing. There is some overlap, because most tools do not fit neatly inside one category.

First Things First

The first step in loading the data warehouse is to assess the current state of existing source data. Several factors affect this source data, and loading the warehouse repository is a process that varies greatly according to the condition and status of this data -- rarely is data coming from a single source. Multiple sources necessitate matching, blending, and massaging data to shape the final product into something that is usable and meaningful.

Data quality refers to the condition of the data being examined and corrected. Data must be accurate to ensure that the decisions being made based on it are good business decisions. Additionally, data must be relevant to the business need and consistent across redundant sources. Finally, data must be complete and not lacking required information. The data quality audit is an assessment of the data itself and the business rules associated with that data. Once the business rules are defined, the data can be measured for consistency and completeness against these rules. A data quality tool must not only assess the current status of the data but provide control of improving and monitoring that data. QDB Solutions Inc., WizSoft Inc. (a subsidiary of WizSoft Ltd.), and Unitech Systems Inc. are three companies that offer data quality auditing packages.

QDB's Analyze product begins by assessing the quality of source data and comparing it to the destination system requirements and answers the following questions:

Larry English, an independent data quality expert and the president and principal of Information Impact International Inc., believes that we must take the focus off cleansing dirty data and shift our mindset to designing quality data. This is the only way, English believes, to permanently relieve the effects of dirty source data. QDB/Analyze addresses the source data, the migration process, and the destination system requirements. The initial baseline assessment is based on business rules, new system requirements, data models, and data transformation rules. This assessment not only gives you a clear picture of existing data quality but also identifies defects such as incompleteness, duplication, invalid formatting, and invalid values. QDB/Analyze software makes specific recommendations on how to clean up and organize your data to prepare it for extraction and conversion.

During the data migration process, QDB/Analyze tests the data quality to ensure successful data translation and transformation and to ensure quality throughout the migration process. To aid in the conversion process, QDB/Analyze creates exception reports pinpointing data that will require manual intervention, provides control reports of the conversion process that monitor the level of defects and observe trends in a database over time, provides a baseline data quality report that provides selected metrics such as duplicate records, and creates a plan for ongoing data quality monitoring and management. The product also validates and monitors data quality trends, synchronization with source data, and adherence to business rules for the new system.

Those who cannot bring themselves to purchase an enterprise-level tool such as QDB/Analyze should consider WizRule by WizSoft. The tool is priced at approximately $500, and it will provide a great return on your investment. WizRule is a rules-discovery tool that searches through every record in a table and determines the rules that are associated with it. The user can determine which fields are examined. An example of a rule might be, "If Customer is Williams or Carr, then City is Atlanta; probability 0.95. The rule exists in 612 records." WizRule then identifies records that deviate from the established rules. WizRule defines a rule as an event that happens in 95 percent of occurrences. To have more control over how rules are defined, users can change that threshold according to their needs.

WizRules supports ASCII files and any ODBC-compliant database, so it works well as a sampling tool to get a quick diagnosis on smaller projects. WizRules permits immediate access to deviated records, allowing for discrepancies in the source data to be reconciled on the fly. WizRules is a good tool for small projects and for sampling larger databases.

The Move

The simplest way to move data into the warehouse repository is to replicate the data that resides in the source databases. This is fast, simple, and very dangerous. Many companies try this approach, only to find later that the ease of data population is more than offset by the arduous process of cleaning up bad data in the destination database. One way to circumvent this scenario is to use proven data extraction and migration tools along with dependable processes and techniques. (Many companies program data migration applications in Cobol or other development tools.)

Four products lead the extraction and transformation charge: Evolutionary Technologies International's (ETI) Extract, Carleton Corp.'s Passport, Prism Solutions Inc.'s Warehouse Manager, and Platinum Technology Inc.'s InfoSuite. The common strength of these products is their ability to automate the manual process of extracting, transforming, and loading source data. Automation can dramatically reduce the amount time and the number of resources devoted to the load process, even when you consider the additional time needed to evaluate, install, and learn new tools.

The Prism solution provides an open platform to develop data warehouses and data marts in an evolving environment. Prism has a three-piece approach, providing source modules, target modules, and data capture modules. The Source modules provide the functionality to pull out the desired source data from heterogeneous client/server and legacy systems. These systems include DB2, Tandem Enscribe, IMS, Informix, Oracle, SAP R/3, Sybase, NCR Teradata, and VSAM. Data from multiple source systems can be combined and sent to a single destination or reworked and sent to multiple outputs. Prism makes it easy to verify and correct table and field attributes.

Target modules create the output files that will populate the warehouse repository and data marts. Additionally, Prism target modules interact with the load utilities of the target database to generate the instructions necessary to perform the initial load of the data as well as manage the incremental changes needed for warehouse maintenance. Prism supports the parallel data load capabilities of leading database vendors. Supported target databases include DB2, Informix, Microsoft SQL Server, Oracle, Red Brick Warehouse, Sybase, and Teradata. For all supported databases, Prism facilitates creating the data definition language (DDL) to create target tables and the appropriate load statements.

The Prism changed-data capture modules automatically respond to changes in the source databases. Supported changed-data source databases include DB2, IMS, and Oracle. These captured changes can be applied to the warehouse repository or data mart at user-specified times with legacy databases. The changed-data capture modules also minimize the impact to legacy systems through the processing of archival log files, let users set parameters for extracting relevant records for specific time frames, and capture committed database changes only. Prism works on Windows 95 or Windows NT workstations that have at least a 486 processor and 24MB of RAM.

ETI specializes in Enterprise Data Integration (EDI), and the company actively markets to many disciplines outside of the data warehouse area. Many of the projects that rely on ETI solutions are based around SAP, PeopleSoft, and Dun and Bradstreet Software (acquired by Geac Computer Corp. in March).

The ETI Extract tool suite gives users great flexibility in creating transformations. All transformations are created within the ETI tool itself and automatically stored in the metadata repository. ETI Extract tightly integrates all activity of the extraction and transformation process into the metadata layer.

ETI Extract generates applications that will manipulate and format data from any database or file system. This includes mainframe hierarchical files, standard relational tables, and proprietary application file systems such as SAP. Not only does the ETI suite create the transformation and migration programs in many languages including Cobol, C, ABAP/4, and RPG; ETI also creates the associated scripts and JCL code. Once the ETI tools are configured, you can manipulate data in a heterogeneous environment without actually writing code.

The program that ETI creates moves data directly from the source database into the destination database or a staging area. This eliminates the need for an intermediate platform or transport mechanism if it is not desired. The ability to create programs in any language also lets you take full advantage of the features available on any given platform. You can even distribute processes on different platforms to maximize the benefits each one has to offer.

ETI uses a straightforward graphical point-and-click interface as its development editor. With this interface, you can deal with almost any selective and conditional data retrieval, transformation, and data load. The tool also supports multistep conditional data transformations, joins, calculations, aggregations, and summarization between records, fields, or individual data elements from multiple data sources.

Carleton Passport is another metadata-driven tool that collects, conditions, and populates data from multiple sources including DB2, IMS, Oracle, Sybase, and VSAM. Passport uses a GUI to automate the process of extracting, transforming, cleansing, formatting, and mapping data. Passport lets you create applications that will access information on a mainframe, midrange, PC workstation, or Unix environment.

During the transformation process, you can reformat data into new datatypes that support your end needs. It is also easy to manipulate data values and datatypes based on existing conditions. Passport lets you perform detailed calculations, summarizations, and aggregations. Because of the tight metadata integration, Passport provides a comprehensive audit trail and an abundant supply of audit reports.

Passport supports up to 100 different data sources that can be combined in one consolidation program. Passport also supports 100 different output destinations from a single pass of an input file.

Unlike ETI and Prism, Carleton also bills its Passport product as a data cleansing tool. Passport supports the following cleansing functions:

One thing I particularly appreciate is Carleton's vision of the integrated metadata environment. Carleton has reached beyond a single-vendor tools approach and embraced a best-of-breed mindset centered around metadata management. (I discuss this in more detail later when I address the current metadata dilemma.)

Platinum provides a suite of tools that lets administrators selectively choose and scrub operational data and then move it to a destination database. InfoPump provides bidirectional replication of data between heterogeneous databases. This is primarily geared to the client/server community and supports Sybase SQL Server, Microsoft SQL Server, Oracle, Informix, IBM DB2, Lotus Notes, and other ODBC-compliant databases. InfoPump provides a scheduler that allows extracts to run automatically at a given time. Additionally, items such as downloads can be event-driven so that they respond at the proper time to actions in the system.

For nonrelational access, InfoHub is an application development/database access tool that provides direct SQL access to nonrelational mainframe data through client/server applications. InfoHub provides SQL access to a wide range of MVS relational and nonrelational databases from PC-based applications or front-end query and reporting tools. InfoHub supports ODBC, allowing access from most popular client platforms. InfoHub can also be used in conjunction with InfoPump to move data between MVS databases and client/server platforms, and with InfoRefiner for data replication from Adabas, IDMS, or Datacom.

InfoRefiner automates the extraction, refinement, and movement of large volumes of legacy data to client/server platforms for decision support and general data warehousing. An example of this might be phasing out an IMS database and replacing it with DB2. InfoRefiner will manage mapping the data and migrating it from IMS to DB2 tables.

Platinum has strong support for mainframe-based legacy systems. I particularly appreciate the way the Platinum family of tools is encapsulated. Each of the tools can be used by itself, in concert with other Platinum tools, or in conjunction with a mixed vendor solution.

What's the Difference

All four of these extraction and transformation suites are well suited for handling the job of loading source data into the warehouse repository or data mart. Each toolkit supports a long list of source and destination databases and will easily handle rudimentary transformation scenarios such as summarization, aggregation, and simple calculations. All four products can read from multiple sources and move that data into different outputs in a single pass. Platinum is a strong candidate for situations in which much of the source information resides on a mainframe environment.

Although all four of these products rely heavily on a tight metadata-driven approach to operation, the implementation is different. Prism lets you specify the database engine for the metadata repository. Platinum gives you two options: Repository/MVS, in which the application and database servers reside on DB2, and Repository Open Enterprise Edition (OEE), which supports Oracle 7.x and Sybase System 10 and 11. ETI uses an object-oriented database for its repository and considers one of its differentiating characteristics to be its metadata exchange and metadata merge functionality. Carleton is growing its MetaCenter warehouse management technology, which is centered around metadata repository and metadata dictionary. The company's evolving strategy is open integration through the MetaCenter.

Prism's Changed Data Capture modules appear to do the best job of handling changes in the source data automatically. All of these products can be configured to address this issue manually through transformation procedures if necessary.

Marts and Martyrs

Data mart management tools are software packages that are extremely useful in populating the database and data marts. These tools are designed and performing at their best when used for managing data marts and small- to medium-scale data warehouses. The two major players in this niche are Sagent Technology Inc.'s Data Mart Solution and Informatica Corp.'s PowerMart Suite.

Sagent provides a suite of transformation objects for extracting, transforming, and loading data into the warehouse database or data mart. Data Flow objects let you perform visual and manual SQL queries, create output grids, join columns from multiple input sources, combine multiple sources into one result set, split a result set into two duplicate sets for further processing, and save to a table. The Analysis Transforms objects include ranking, sorting, averages, and total.

Sagent supports databases that use star schemas and provide transform objects to facilitate populating fact and dimension tables. Sagent provides Batch Load Transforms for Microsoft, Oracle, and Sybase to assist in populating the destination database. The Sagent Data Mart Solution does not have robust internal data translation capabilities for performing calculations or value substitutions to source data. However, you can add custom transforms using C++ or Microsoft Visual Basic Scripting Edition (VBScript), which is a subset of the Microsoft Visual Basic programming system.

The Informatica PowerMart Suite is an integrated suite of client/server software tools for building deploying, and managing enterprise-scalable data marts. Informatica provides a graphical data extraction and transformation tool that is composed of three components: the source analyzer, warehouse analyzer, and transformation designer.

The source analyzer interprets the database schema and provides the formats and structure of the operational source data. Schema information could be in the form of an RDBMS catalog or a flat file.

The warehouse designer creates the data mart model. A wizard is included to walk you through the process of data warehouse design. PowerMart provides drag-and-drop capabilities for physical data mart design and modifications, as well as index and additional table constraint support. Once design is complete, PowerMart will create the necessary DDL for the target database to create tables, indexes, and constraints. The transformation designer provides a point-and-click interface for building the transformation procedures. PowerMart supports conditional logic, calculations, and the creation of aggregate and summary tables. All mapping and transformation information is stored in the metadata repository. These three components -- the source analyzer, the warehouse designer, and the transformation designer -- all combine to make up the Designer module of PowerMart. This module visually defines mappings and transformation in the PowerMart suite.

Sizing Up the Mart Tools

Both tools provide a solid mechanism for populating and managing the smaller warehouse and the data mart. I believe that Sagent provides a more simplified transformation interface that provides a great deal of bang for the buck. The Sagent product also supports the ability to perform many operations in a transformation that Informatica requires multiple passes to perform.

The Cleansing

It is crucial to the process of loading the data warehouse that you not just rearrange the grouping of source data and deliver it to a destination database, but that you also ensure the quality of that data. Data cleansing is vitally important to the overall health of your warehouse project and ultimately affects the health of your company. I'd like to look at a couple of the cleansing tools in detail and then briefly describe some of the major forces in the name- and address-cleansing arena.

Integrity from Vality Technology Inc. is a quality-analysis and cleansing tool that investigates, standardizes, conditions, enriches, transforms, and integrates data from a number of legacy files and source systems. Integrity employs a four-phase reengineering approach to data quality: investigation, conditioning and standardization, integration, and survivorship and formatting.

Integrity places a great deal of emphasis on the metadata. A typical survey to examine metadata parses every value in a field for unique business words. The results from this type of analysis feed the warehouse data model and ensure that it accurately reflects the way a company does business. This searching through data dictionaries, copybooks, and meta tools is known as metadata analysis.

Integrity works on the basis of discovery-based processing, as opposed to rules-based processing. In rules-based processing, the business rules are entered and then the data is checked against the rules. This limits the analysis to only the constraints and rules that you know about. Discovery-based processing makes no preliminary assumptions about your data but rather uses pattern processing to analyze source data. Using this method will reveal new and essential business metadata that would go undiscovered using rules-based techniques. One company used this process to discover redundant entries in a chemical formula field to drastically reduce levels of physical inventory. Another differentiating factor for Integrity is the fact that it is not limited to name and address fields. Because it works on the basis of pattern-processing, any field type can be searched and analyzed.

Integrity complements data extract products because it is used as a preliminary step. If we are going to build quality in, it is important that we clean the data before it is passed to the extraction tools for migration. In some cases with simple environments, Integrity can eliminate the need for extensive extraction and transformation steps and enable simple load utilities to be the only migration tool needed.

Integrity runs on IBM MVS mainframes, Unix, Windows NT, and AS/400 servers. Because the input and output files are ASCII, character-delimited files, the data feeds are source- and target-independent. Another tool that possesses extraction and transformation capabilities as well as data cleansing capabilities is Apertus Technologies Inc.'s Enterprise/Integrator. Enterprise/Integrator supports relational, hierarchical, network, and object data models. The tool uses Object Query Language (OQL), a declarative, SQL92-based rule-specification language. The metadata repository is based on Object Data Management Group (ODMG) standards.

The first step in the process is mapping and transforming source data. Data maps are used to define the conditions where a record should be transformed and to store it in a logically equivalent record in another database. OQL is used to write all transformation rules; the language contains a rich set of data conversion functions.

Creating validation rules is the next step in the Enterprise/Integrator model. These rules tell us the condition of the source data. Again, we use system-defined functions and tables as part of the validation procedure. The next step matches common customers across heterogeneous source systems using fuzzy logic that is fine-tuned by the user. Fuzzy logic will find the closest matches to requested information, even if no exact matches exist. This can be an important factor in cleaning names and addresses that can potentially be spelled any number of ways.

The last step prior to mapping and transforming data to the target database is merging, cleansing, and enhancing the data. Enterprise/Integrator lets you specify rules to merge the matched set of records into a consolidated view. You can select from several methods to ensure that you are using the best possible source of data. You can even set up merging and cleansing rules that create a source hierarchy that considers each aspect of consolidated data. Finally, the selected, validated, matched, merged, cleansed, and enhanced data is mapped and transformed to the target database using several available methods to populate the target database.

Enterprise/Integrator is a powerful tool. It provides good scheduling capabilities and simplifies the burden of warehouse maintenance. Rules that are created within Enterprise/Integrator are independent of any specific and particular use of the rules. This makes these rules reusable across the project. Enterprise/Integrator also supports metadata versioning to the degree that data maps, schemas, and projects are all independently versionable. Finally, OQL is a capable language but requires someone fluent in it to give Enterprise/Integrator the power it needs.

Constellar Corp.'s Constellar Hub is a tool based on a hub-and-spoke architecture in which one or more multithreaded transformation hubs perform transformation that is based on defined and developed rules. These transformation hubs run as servers on Unix or Windows NT systems, and they make extensive use of Oracle tools, databases, and recovery features for handling data and metadata. Spokes are merely data paths between transformation hubs and source or target data. Business rules are defined using Transformation Definition Language (TDL), a high-level procedural language that generates Oracle PL/SQL statements and, in some cases, Cobol.

Constellar uses a graphical point-and-click interface for defining business rules and mapping source data. Constellar Hub was also designed to accommodate large databases. This tool provides good multiplatform support, as well as support for parallel processing, multiple simultaneous sources and targets, and VLDBs.

Rounding Out the Field(s)

Several products emerged from the widespread importance of mailing lists, and they have attracted vendors with a special focus on the unique and ubiquitous needs related to mailing-list cleanup. The following is a description of the major products in this industry.

NADIS by Group 1 Software Inc. uses expert system technology to process customer data. NADIS, which stands for Name and Address Data Integrity Software, is comprised of three products: ScrubMaster, SearchMaster, and Onlooker. ScrubMaster structures and standardizes names and addresses so that you can manage and model your data. This is accomplished by identifying every element in a name or address file and converting it to the Universal Name and Address data standard. This provides a quantitative measure of name and address integrity. SearchMaster takes the output of ScrubMaster and analyzes the relationships that exist in the data. Onlooker cleans and processes these name and address transactions and can be called from other applications such as your order-entry system. Customization and integration with existing systems is one of the strengths of NADIS. NADIS also delivers error reports and matching reports to help you clean name and address data.

id Centric (a division of Postalsoft Inc.) has three products that cleanse, enhance, and match data in customer-based databases. DataRight works with name, title, business-name, and address data integrity issues. ACE is used for address parsing, geocoding, address handling, and data enhancement. ACE also consolidates household-level data. Merge/Purge is used for the matching and consolidation of database information. Together these tools and services enhance, validate, and consolidate customer data to achieve a single customer view. id Centric supports a number of built-in transformation capabilities and also supports callable C library products.

Trillium from Harte-Hanks Data Technologies (a subsidiary of Harte-Hanks Communications Inc.) identifies and validates customer profile information and standardizes and verifies customer data. Trillium provides a base set of rules that provides a strong foundation for database analysis. You can also create, save, and reuse custom rules to broaden the capabilities of the existing rules. With the use of a Census Geocoder add-on product, Trillium can validate known addresses down to the census block group level. Trillium also provides householding capabilities.

Innovative Systems Inc.'s Dictionary, Match, and Analyzer products identify, standardize, and verify customer data. This is accomplished by processing records against word, phrase, and pattern tables that contain industry, language, or geographic specific data. ISI's Match product identifies duplicate names and records The Analyzer tool analyzes and enumerates data integrity problems in a customer information database. The Edit and Scrub products provide a good cleanup facility for all customer data. ISI has done a good job of integrating the capabilities in these five products and continues to be a major force in customer data cleanup.

Tool Selection

Tool selection can be time-consuming and frustrating. There is no single tool that does everything. Many of the tools I discussed in this article can, and should, be used in conjunction with one another to provide the desired results. For example, you may start the process with a quality audit using QDB and then feed an extraction tool such as ETI Extract Suite.

When selecting a tool, your first priority should be to know what you are attempting to accomplish. This should be a natural by-product of designing and planning the data warehouse. Next, set an evaluation period and stick to it. Tool evaluations can go on forever unless you realize that there is not a perfect tool; commit to finding the one that most closely gives you what you need. It will help if you have some established, written guidelines for the product evaluation. Carefully consider metadata management. As more and more products are added to the company's list, there is a good chance that more and more metadata repositories are being added as well. Managing multiple metadata repositories can be tedious and frustrating. You should not discount or skip the initial quality audit step. Finally, select a tool and move forward. No tool does everything, but it will do what it is designed to do. After you know where you are going, know where you are, and know the criteria for the tools selection, make a decision and enjoy the journey.


The accompanying product chart contains information on the companies mentioned in this article.


Joseph Williams is database developer based in Atlanta. You can reach him at jewill@ mindspring.com.
Subscribe to DBMS and Internet Systems -- It's free for qualified readers in the United States
June 1997 Table of Contents | Other Contents | Article Index | Search | Site Index | Home

DBMS and Internet Systems (http://www.dbmsmag.com)
Copyright © 1997 Miller Freeman, Inc. ALL RIGHTS RESERVED
Redistribution without permission is prohibited.
Please send questions or comments to dbms@mfi.com
Updated Friday, May 16, 1997.