Increasing Data Integration Exposes X-rated Data Quality Problems
As more and more systems are moved from batch to online designs, or replaced entirely by product suites and ERP systems, the problem of dirty data is becoming more apparent. In the past this was rarely exposed to the outside. Now that systems are so interconnected, the polluted data multiplies quickly. My last experience with this was when Clickstream Data Warehousing came out.
At that time, the information that shows up on the Amazon and Barnes & Noble web pages for the book had errors in it. We got the errors corrected quickly enough, but they reappeared a week later. It turns out the publisher made mistakes entering the information, and that data was propagated from their system to the distributors' systems, and from there to several online booksellers. Any correction downstream was overwritten whenever new updates came in. Even fixing it at the distributor did not guarantee it wouldn't reoccur.
This kind of problem happens all the time with product information, and in many different industries. One of my favorite examples of propagating bogus product data is also from Amazon. The problem: the image references for a bunch of Shaolin Kung-fu movies were not for the video covers, but for softcore porn videos. A query for, say "Fighting of the Shaolin Monks", turned up this page a as result [screen capture, 366K]. Here's a much smaller closeup of the video box for Fighting of the Shaolin Monks. This is a style of Kung-fu I'm not familiar with.
Studies in the direct mail industry show industry say the industry-wide costs of erroneous mailing caused by bad data run into the hundreds of millions of dollars. Not too long ago the state of Alaska had a single problem with a mailing that ended up costing them somewhere between $65,000 and $95,000 to correct.
Even though bad data can lead to commercial loss (or possibly unexpected sales growth to the teenage male segment in the movie example), most companies don't have a data quality program in place. This is not necessarily bad, since data quality is only as good as the processes that handle it. Putting a program and people in place to correct data quality problems is not normally one person's job. It's a process problem that requires changing software development, systems integration, and data management practices.
Without this perspective, it's easy to fall into the trap of buying data cleansing tools to clean up data so that's it's in good shape, never fixing the source of the problems. This is a lot like building a filtration plant for your water supply instead of stopping the upstream pollution. It costs more and it deals with the symptom rather than the cause.
Posted by Mark Wednesday, July 02, 2003 12:24:00 PM |