
In this issue, Ralph Kimball has written a feature article titled "Dealing With Dirty Data." In it, he reveals that data quality is far too often ignored. Ralph also notes that many shops acknowledge the problem but do little about it. I have spent my share of nights and weekends working late to clean up messy databases, and I doubt my experience is unique. Ask 10 DBAs or developers about dirty data, and I suspect at least nine of them will groan and start telling you gruesome war stories.
If you're going to get serious about data quality, you need more than just anecdotes. If you do not yet have a data quality campaign under way, start by getting a handle on the problem. Don't just rush off in search of a solution, even if you know in your gut that something must be done. To better understand the nature and degree of your problem, start by asking questions such as:
You might think that the upstream weaknesses in data-entry programs can be solved easily. But mechanically patching some validation logic here and there may do little more than plug a few leaks. The problem is usually more systemic because most large companies have redundant applications that capture some of the same data more than once. Even if you think that you fixed a data error in one application, it may still occur in another application. If you're lucky, you corrected the source that feeds the data warehouse - but maybe you didn't.
Cleaning dirty data in the data warehouse is certainly necessary and valuable, but it is not enough by itself. You must also replicate the corrections back into the source databases. This is essential for several reasons. If your data warehouse is periodically refreshed with data from operational systems, your data warehouse may receive and fix the same data error over and over again. Also, any analytical use of the data directly against the operational database or against an extract or replica of the original data will be plagued by the data errors unless they are fixed at the source.
Data warehouses typically manage transactional data from relational databases. Finding and fixing dirty data in simple character, numeric, and date fields is not always easy, but at least the data is fairly discrete. As relational and object-relational ("universal") databases store and manage more text, multimedia, and complex objects, ensuring data quality will become even more challenging. If you are not yet paying enough attention to data quality, then start now. Unless your databases are getting smaller and simpler, the problem won't go away by itself.
Internet Systems: Next Month and Next Year
The May 1996 issue of DBMS shipped with the first edition of our new Internet Systems supplement. Many of you wrote to express how useful the articles were and to ask when Internet Systems would appear again. Next month, the October issue of DBMS will ship with the second Internet Systems supplement. Beginning in January 1997, we will publish the Internet Systems supplement every other month (six times a year). There is no separate subscription to Internet Systems, so if you are not a DBMS subscriber, use the free subscription qualification card in this issue or complete the equivalent form on our Web site at http://www.dbmsmag.com. (Free subscriptions are only available in the United States.) By the way, if you missed the first Internet Systems, you can find the full text of all of the articles on the DBMS Web site.