DBMS

Dealing with Dirty Data

By Ralph Kimball
DBMS, September 1996

The science of maintaining clean data in your warehouse, and why nobody talks about it.


A theme in data warehousing that is universally recognized but far too often ignored is the cleanliness of the warehouse data. From hundreds of meetings with data processing and IS staff, I have identified three consistent themes. Although these three themes stand out dramatically as the biggest problems in corporate data access, the same data processing and IS staffs that identify them are only attacking the first two. The three themes can be expressed by the following comments:

Given the universality of these comments, it is strange that entire industries are being organized around the first two issues but the third issue seems to be something that we don't want to talk about.

The database marketplace has responded to the need for data access with client/server architectures, dedicated data warehouse hardware and software, and whole families of communications schemes to connect users to their data. The query tool marketplace is an embarrassment of riches. There are dozens of ad hoc query tools, report writers, and application development environments. We are well into the second generation of powerful tools for data warehousing end-user applications with dimensional OLAP/ ROLAP tools and the hot new data mining tools. Yet the third issue, data integrity, languishes in the backwater of data warehousing. It is talked about briefly and then the door is closed. There is a definite avoidance of the topic and very few plans in place to address data integrity at the same level as those plans that address data access or query tools.

Applications in which Good Data is Critical

One of the reasons for the lack of planning to address data integrity is that IS staffs don't adequately consider the business impact of bad data. To put it more positively, we must better appreciate our opportunities for really powerful data warehouse applications that vitally depend on good data. Following is a sampler of such applications, starting with a whole series of customer-based applications.

Marketing Communications. If you want to understand who your customers are, and if you want to communicate effectively with them on the phone and via mail, you must have an extraordinarily clean customer list. You destroy your credibility by using nonsensical or misspelled addresses or by sending multiple letters to the same person. Even worse, if the address is invalid in some way, the letter never arrives.

Customer Matching. You want to find the same customer when the customer buys a second or a third product from you. Customer matching is a major issue in banking and healthcare, where separate customer (or patient) encounters are often listed separately. The average bank has great difficulty listing all of the separate accounts of a given individual, although the reverse process of listing all the individuals in a specific account causes no trouble.

Retail Householding. You want to find a group of people who comprise a family, each of whom is a customer. When you correctly identify the household, you can communicate coherently with its members. At that time you can identify the overall needs of the household and suggest an effective consolidation or expansion of products. This process is called "cross-selling." Cross-selling your own customer base is recognized as one of the most effective ways to increase sales.

Commercial Householding. You want to look at your customers to find multiple commercial organizations that turn out to be part of a larger parent organization. Many times you may not be aware that you are dealing with various "parts of a whole."

Targeted Marketing. You want to generate a mailing list by sifting through the demographic and behavioral attributes of a large customer list. The completeness and correctness of the attributes is crucial.

Combining Information Systems After an Acquisition. An increasingly common IS problem is merging customer lists and product lists from incompatible information systems after an acquisition. Organizationally, it may be easy to imagine combining staffs and moving everything to the corporate system after an acquisition, but what do you do with the data itself? Sometimes the customer list is the most valuable asset of the acquired organization.

Merging External Data with Internal Data. This problem is structurally very similar to combining data after an acquisition. Although the external data might be syndicated and cleaner than average, it will still probably not match internal data, such as product names, customer names, or geographic names.

Tracking Retail Sales. In large retail environments with hundreds of stores and hundreds of thousands of products, it is crucial to have a centrally maintained master product file with clean descriptors. A good retail product file will have at least 50 separate descriptors for each product. These descriptors are used by managers and analysts throughout the organization to group and dissect the various kinds of product data, all the way from purchasing at the back door to sales at the front door. I recently designed a comprehensive data warehouse for a large drugstore chain. A major category of product in this chain was "lozenges," and the word "lozenge" was spelled 20 different ways in the product file!

Medical Records. All of us are affected by the quality of data in medical records. We want the diagnosis and the procedure to be correct and readable, both for our medical safety and for billing and insurance purposes. A major headache in the medical insurance industry is the correct identification of the provider physician, clinic, or hospital. This is another variant of the customer matching application I discussed previously.

The previous list is, in actuality, endless. Every meaningful data warehouse application needs good data. William Weil of Innovative Systems Inc. (a data-cleaning vendor profiled later in this article) recently performed an interesting analysis of the cost of bad data. He assumes that a given customer list for a company is 90 percent accurate. Of the 10 percent of the customers with inaccuracies, 5 percent (0.5 percent of the overall file) have unusable addresses that could have been repaired. In my experience, these numbers seem reasonable -perhaps even low.

Weil goes on to assume that the annual value of cross-selling or retaining each customer is between $100 and $1000. In a big business with a million customers, 0.5 percent (or 5000 customers) will be lost because you can't find them in your database. Multiplying the 5000 affected customers by $100 to $1000 results in a direct cost of $500,000 to $5,000,000 annually for bad data. Although the detailed numbers are debatable, there is a fundamental and compelling ring to this argument. Bad data is expensive, and it is time we face the data-cleaning problem head-on.

The Science of Data Cleaning

Although data cleaning can take many forms, most of the important data-cleaning examples from the previous list of applications arise from the need for good descriptions of tangible things such as customers, products, procedures, and diagnoses. The current marketplace and the current technology for data cleaning are heavily focused on customer lists. I will use customer lists to talk about the underlying science.

Data cleaning is much more than simply updating a record with good data. Serious data cleaning involves decomposing and reassembling the data. You can break down the cleaning into six steps: elementizing, standardizing, verifying, matching, householding, and documenting.

To illustrate these six steps, consider the following fictitious address:

Ralph B and Julianne Kimball Trustees for Kimball Fred C
Ste. 116
13150 Hiway 9
Box 1234 Boulder Crk
Colo 95006

Perhaps this address has been entered in five fields called Address_1 though Address_5. There may be no reliable ordering of the parts of the address. A critical part of this address has been entered incorrectly. We will find this mistake in a minute.

The first step in cleaning this address is to elementize it: data cleaners' jargon for parsing it correctly. Elementizing the address could produce a result such as:

Addressee First Name(1): Ralph
Addressee Middle Initial(1): B
Addressee Last Name(1): Kimball
Addressee First Name(2): Julianne
Addressee Last Name(2): Kimball
Addressee Relationship: Trustees for
Relationship Person First Name: Fred
Relationship Person Middle Name: C
Relationship Person Last Name: Kimball
Street Address Number: 13150
Street Name: Hiway 9
Suite Number: 116
Post Office Box Number: 1234
City: Boulder Crk
State: Colo
Five Digit Zip: 95006

This list of standard elements is dependent on what is found when the address is analyzed. Ralph and Julianne could have been trustees for an organization rather than an individual, in which case some of the element types would have been different.

The second step is to standardize the elements. At least four of the elements must be put in a more standard form. We have already dropped "Ste" because we have recognized it as "suite." We may suspect that "Hiway 9" should actually read "Highway 9." Make this a provisional change, and in the verification pass make sure that the actual street is called "Highway 9." Also, change "Boulder Crk" to "Boulder Creek" and "Colo" to "Colorado."

The third step is to verify the consistency of the standardized elements. In other words, are there any mistakes in content? Although it may not be obvious, there is a glaring content error in the address. Boulder Creek in zip code 95006 is in California, not Colorado. Because two of the three pieces of data point to California, change the state name to California. You should probably also flag this record for further verification. For example, if you have another address instance for either Ralph Kimball or Julianne Kimball, you could verify the correct state. This would be even more urgent if you discovered a legitimate Boulder Creek in Colorado.

Now that you have elementized, standardized, and verified the address, you are in a good position to perform the fourth and fifth steps: matching and householding. Matching consists of finding either Ralph Kimball or Julianne Kimball in other customer records and ensuring that all of the elements of all the addresses are identical. Note that you handle the problem of legitimately changed addresses by assigning separate element types to "previous" and "current" addresses.

Householding consists of recognizing that Ralph and Julianne constitute a household because they share the same address, although you must be careful to exclude people who live in different apartments in a single large building. You may have information in another internal or external data source that indicates that Ralph and Julianne are married.

The sixth step of data cleaning consists of documenting the results of elementizing, standardizing, verifying, matching, and householding in metadata. This helps ensure that subsequent cleaning episodes will be better able to recognize addresses and that end-user applications will be better able to slice, dice, and understand the customer database.

All of the leading data-cleaning vendors interviewed for this article noted that these six data-cleaning steps require sophisticated software and extensive embedded expert knowledge. The expert knowledge is embedded in fuzzy matching algorithms, address parsing algorithms, and large multimillion-entry lookup tables that provide synonyms for parts of names and addresses. In other words, a serious data-cleaning system is a large software system.

The Leading Vendors

Three companies dominate the data-cleaning marketplace, and all three specialize in cleaning large customer address lists. The three companies are Harte-Hanks Data Technologies, Innovative Systems Inc. (ISI), and Vality Technology. Harte-Hanks and ISI were both founded in 1968, while Vality was founded relatively recently in 1989. The primary technology of all three companies is IBM mainframe-based, generally running under MVS. All three companies are moving products and services to PC and Unix platforms. Following is an overview of each of their products.

Harte-Hanks Data Technologies
Harte-Hanks Data Technologies is a division of Harte-Hanks Communications. The parent company offers a wide variety of services for marketing organizations, mostly oriented around customer-list processing and promotional marketing. The Data Technologies division's primary product is the Trillium Software System, a collection of C-written mainframe utilities intended for data discovery, lexical analysis, data scrubbing, and data transformation activities. The components of Trillium generally run on MVS, Unix, Windows NT, and OS/2 platforms. You can find detailed information about Trillium on the Web at http://www.trilliumsoft.com. The most interesting parts of Trillium include:

Innovative Systems Inc.
ISI is a software development and consulting company that helps organizations build and maintain data warehouses and customer information systems for improved client relationship management. ISI is a division of the Innovative Group headquartered in Pittsburgh. Detailed information about ISI's products can be found on the Innovative Group home page at http://www.innovgrp.com. ISI's main products include: Vality Technology Inc.
Vality is a Boston-based supplier of data-quality software tools and data-reengineering consulting services. For detailed information on Vality's products, call the company's office in Boston at 617-338-0300. Vality has a very rudimentary Web page, but Alta Vista lists more than 50 Web pages from other sources that also talk about Vality. Vality's main product is the Integrity Data Reengineering Tool. Integrity relies more on software sophistication than on lookup tables. This is a key differentiating factor for Integrity. Vality guarantees 99-percent accuracy in data migration projects. It claims to have the industry's strongest matching engine. Vality also recently announced a cooperative marketing vendor relationship with IBM.

The Marketplace

The three highly respected companies reviewed in this article serve a fairly narrow band of marketing-oriented clients and focus almost exclusively on the issues of name and address list processing. As such, they have developed impressive technology for this one task. Although all three companies have a fairly traditional mainframe software focus, they are clearly moving toward Unix and client/server architectures.

The main question I have is whether these companies realize how huge the data-cleaning marketplace may be. I also wonder whether they are prepared for an explosive growth of demand for multiple hardware and operating system platforms and for application areas other than name and address list processing. The data warehouse marketplace has tremendous vitality; thousands of IS organizations worldwide feel driven to build data warehouses. Countless IS organizations realize they have a data-cleaning problem but defer facing this issue in their haste to bring up their first warehouse. All that is needed to grow this marketplace exponentially is someone with a potent, widely applicable product for data cleaning and the ability to convince the marketplace that it is easy to use.

It seems very possible that the whole marketplace might suddenly decide that it needs a data-cleaning capability that:

I suspect that all three of the companies I discussed would give their eyeteeth to address this market opportunity and that all three are moving in this direction. However, data cleaning has been a somewhat quiet backwater of data warehousing. The fierce stimulation of the mainline marketplace has not been applied to these companies yet. To put it another way, there may still be room for one or more new players to enter this marketplace.

The traditional data-extract providers have largely left advanced data cleaning to Harte-Hanks, ISI, and Vality. Although the extract providers talk about data cleaning, they have not yet seized upon advanced data cleaning as a proprietary strength of their products. (See my Data Warehouse Architect column in the July DBMS for a discussion of Prism, Carleton, and ETI.) I would not be surprised to see all of the data extract providers move into this area. Prism, at the moment, lists ISI as an alliance vendor on its Web page and describes ISI as bundling Prism's Data Warehouse Manager product into ISI's overall solutions for customer information management systems. This is the only obvious collaboration between the extract providers and the data-cleaning vendors that I have been able to discern from the companies' Web pages or their printed literature. Vality takes a more pugnacious stand, stating in its company literature that a good data-cleaning tool may well "obviate the need for a separate data extract tool."

There are additional market forces at work that will make the value of data cleaning more obvious. The Year 2000 crisis has everyone trying to analyze how his or her information systems may have dependencies on date processing. The data-cleaning companies have begun to suggest that their tools can be used to ferret out dates from all types of legacy data and thereby give a heads-up warning to IS about all of the places where dates may arise.

Data mining, the hot new trend in data warehousing, is the scanning of large amounts of data to reveal unexpected patterns or correlations. The analysis of demographic or behavioral descriptions of customers is an important part of data mining. Obviously, all of the arguments in this article directly drive the quality of a data mining analysis. Increasingly, the data mining companies are talking publicly about the value of clean data. Perhaps one or more of these companies will try to enter the data-cleaning marketplace as well. In a way, when a data miner finds something exciting, it's either a valuable insight into the business or a bogus finding from bad data.

Data Integrity Drives Business Reengineering

It is worthwhile to step back from this close look at data cleaning to get some perspective. Why is data dirty in the first place? Much of the time it is because of poor systems or poor practices at the moment of data capture. Often, someone such as a salesperson, field adjuster, or buyer is stuck with the job of entering critical data. That person may have an awkward system with no built-in support for data integrity. For example, a buyer may simply have a "Product Description" field in which to enter the characteristics of the product he or she has just acquired. No wonder there were 20 spellings of the word "lozenge" in the drugstore.

At the same time, the person responsible for data entry may not be motivated to drive data quality to the 100-percent level. If your job is to drop data down a pipe and no one ever tells you that your efforts are worthwhile, then you aren't going to do a perfect job for very long. You probably won't even know what a perfect job entails.

The answer to the problem of data-entry quality is a kind of business reengineering, which requires multiple steps such as:

Data warehouse and data-cleaning tools play a unique role in defining the need for this type of business reengineering. The data warehouse is the perfect place to see the value of good data. Sometimes, paradoxically, the data warehouse must make imperfect data available in order to show the organization how valuable perfect data would be.

Data-cleaning tools alert IS to the exact issues involved in clean data. The final architecture for data delivery will ideally be balanced between data originally entered as accurately as possible and powerful data-cleaning systems operating downstream in the data extract process to bring data up to the shining goal of being 100-percent correct.


Ralph Kimball was co-inventor of the Xerox Star workstation, the first commercial product to use mice, icons, and windows. He was vice president of applications at Metaphor Computer Systems, and is the founder and former CEO of Red Brick Systems. He now works as an independent consultant designing large data warehouses. His book The Data Warehouse Toolkit: How to Design Dimensional Data Warehouses (Wiley, 1996) is now available. You can reach Ralph through his Web page at http://www.rkimball.com.


* Harte-Hanks Data Technologies (a subsidiary of Harte-Hanks Communications Inc.), 25 Linnell Circle, Billerica, MA 01821; 508-663-9955 or fax 508-667-7297; http://www.harte-hanks.com.
* Innovative Systems Inc. (a division of the Innovative Group), 790 Holiday Dr., Pittsburgh, PA 15220; 800-622-6390, 412-937-9300, or fax 412-937-9309; http://www.innovgrp.com.
* Vality Technology Inc., 286 Congress St., Boston, MA 02210; 617-338-0300 or fax 617-338-0338; http://world.std.com/~vality/vality. html.
Table of Contents - September 1996 | Home Page
Copyright © 1996 Miller Freeman, Inc. ALL RIGHTS RESERVED
Redistribution without permission is prohibited.
Please send questions or comments to mfrank@mfi.com
Updated Friday, September 20, 1996