XP: Not for Data Warehouses
I've always been skeptical of development methodologies. They ignore the fact that different organizations develop software under different constraints and in different ways. What's important is the development process, and that it address the needs of the organization while managing the process well enough to produce useful, reasonable-quality goods.
Methodologies occasionally work if they are focused on specific problem domains, such as the data warehouse development methodology espoused in The Data Warehouse Lifecycle Toolkit. Methodologies should be like training wheels: you use them to learn the problem domain and ways to go about solving the problems in that domain, then you take the wheels off and evolve your process. More often, methodologies are top-down efforts to improve problems that are caused by ham-handed management.
That said, I agreed to let a development manager try out XP on portions of the ETL development for a data warehouse project. The first problem was that the user was really me since the users specified the requirements we did the dimensional models for. That meant the user focus of XP was following the data mappings that populated dimension tables with the proper data.
The result? Simple tasks took three times the duration required to develop in a more standard fashion. The total lack of documentation meant that when data anomalies popped up the developers had to read code to figure out how to deal with them. Some simple documentation could have specified that transactions X, Y and Z were handled, so that when transaction A showed up we knew something had been missed in the data feed. User stories were basically "get data from these places to those places, and do these things to make sure it's clean", not much different from the data mapping rules and flowcharts.
The worst part is what should be XP's strength: testing. This was a joke because the expected results of the test require that you pull production data and process it to get the correct output. Without the ETL program to generate that output, you have to work out the results manually based on your understanding of the source data. That's fine when you eyeball the data and work out the desired results. But it does not account for data quality problems. A few bad values in a column can lead to a join failure, so that data is missing. But the test will never catch that.
The developers couldn't develop every possible test case for full coverage testing because a simple dimension extract might pull 15 columns from 6 different tables, joining via 8 different columns, with hundreds of potential data values for each column. The combinatorial explosion of data values and relationships makes this extremely difficult, not to mention that many of the test cases should generate error conditions which are so unlikely in production data that it's hardly worth trying to catch them.
Lastly, real production data changes over time so there's no way to guarantee that today's test cases cover tomorrow's production data. ETL programs don't have the luxury of controlling the inputs, only the outputs. The primary goal is winnowing bad data so that only the good gets through and the bad is flagged with a reason for rejection so it may be corrected at the source.
The non-XP half of the development team had all the core scheduling, dependency checking and logging code and three dimensions done before the first dimension under the XP process saw the light of day. That one dimension passed all its tests, but it failed the first time it hit production data because of a data quality problem.
We stopped using XP at this point, much to the relief of the developers. The kicker? These developers were all trained on location by Kent Beck and one of his associates for another project, but we picked them up while they were idle.
There are some domains for which XP does not work, and systems integration - at least of the type done in data warehousing - appears to be one of them.
Posted by Mark Tuesday, October 07, 2003 10:53:00 PM |