style="margin-top:70px;" Clickstream


Evaluating ETL Products is Hard

Comparing ETL and integration products is tough.
They all look alike.
The sales presentations are all the same.
The canned demos are indistinguishable.
Information from IT analyst firms is so high level it's worthless, even from the ones who cover the BI/DW space pretty well.
The standards, sources, targets and other checkbox-style comparisons don't provide the information you need to select the right tools.

Only a series of questions about implementation details exposes the things you need to know in order to make a decision. For example, one recent comparison I was doing went something like this:
Does the product bring data back to the engine or push transformations into SQL?
"It depends. It could do either.:

If the engine does some work then what happens when I bring over a mammoth amount of data?
"It buffers the data."

OK, so if I have a 4 GB SQLserver table and an 8 GB Oracle table to join, how is this accomplished?
"It accesses the data from the database using SQL."

OK, but it's different databases, so how does the tool do the join?
Vendor A: "One table is brought back and used as a lookup to the other remote table."
Vendor B: "Both tables are retrieved locally."

If it's that large, isn't caching locally a problem? Does it cursor through the data to keep from sucking up all the physical memory?

Vendor A: "You have to manage the input buffering, or it spills into a file/hash structure." (In other words, performance is poor enough that you won't see results)

Vendor B: "If you exceed 2 GB of buffering then the job will fail, so you have to work out a way to limit the data coming back." (In other words, more data than memory = crashed job)

In a simple example like that, you can follow the trail to a problem you might face. There are examples you'd never think to ask about unless you were already familiar with the products, in which case you wouldn't need to ask.

This is why I always do a hands-on demo or proof-of-concept on real data, using logic simpler than "map column A to column B". Sure, 70% of the integration work is simple mapping, but the other 30% is multiple tables, outer joins, calculations, lookups, and dealing with bad data. I wish vendors would demo the hard stuff instead of the easy stuff.

That's why I'm extending the ETL evaluation course at the Data Warehouse Institute to a full day. The first half is all about what to evaluate and how to evaluate it. The second half will be live, with vendors showing how to accomplish a given task. I'll start easy, and follow the usual path into the weeds of poor data quality and bad development planning. Maybe I'll finally see some problems fixed in some of my favorite tools, since they won't want to be embarrassed during a live demo with competitors. It should be fun! Look for the first run of the revised class at TDWI in Las Vegas next February.

Meanwhile, if there are vendors you'd like to see outside of Informatica and Ascential, er, IBM, I'm open to suggestions. You can leave a comment on this post.

How do you rate SAS ETL ?
I currently don't rate them. I evaluated ETL Studio and found it under-featured for the price a little over a year ago. That's largely because, if you aren't a SAS shop, you have to buy a set of SAS products that inflates the price to a level beyond what you would pay for the same features in comparable products. I'm planning to re-evaluate SAS in the next few months so we'll see how they stack up at that time.
I thought the post made some good points on extracting data, I use python for simple extracting data,data extraction can be a time consuming process but for larger projects like documents, files, or the web i tried "extracting data" which worked great, they build quick custom screen scrapers, extracting data, and data parsing programs
Post a Comment


Data warehousing, business intelligence, IT strategy and architecture, and occasional interesting bits.

Subscribe to XML feed

Bio / About Me

Check out my book

Clickstream data warehousing book cover Buy clickstream data warehousing from

Search this site or  the web

Site search   Web search
powered by FreeFind
Popular Posts
Primate programming.
Why development in crunch mode doesn't work.
Enterprise data modeling sucks big rocks.
XP Exaggerated.
Ping-pong in the matrix.
Time management for anarchists.
Is Ab Initio worth evaluating?
Job posting: omniscient architect.
Why hiring more sales people won't grow revenues faster.
Some resources for Open Source CMS.

Reading List
The Cruise of the Snark
Blue Latitudes
Everyone in Silico
The Klamath Knot
Swarm Intelligence (Bonabeau)
A three year backlog of F&SF

Listening List
Toots and the Maytals
The Buena Vista Social Club
American Idiot

Watching List
Winged Migration Quicktime trailer
Ghengis Blues
Howl's Moving Castls
A Bronx Tale

Daily KOS
Due Diligence
Boing Boing
Kevin Kelly (Recomendo)
Not Geniuses
3 Quarks Daily

War in Context
Valmiki's Ramayana
Choose the Blue
Third Nature
Mark Madsen
The Data Warehouse Institute
James Howard Kunstler
Clickstream Data Warehousing
Technorati Profile

04/01/2003 - 05/01/2003 05/01/2003 - 06/01/2003 06/01/2003 - 07/01/2003 07/01/2003 - 08/01/2003 08/01/2003 - 09/01/2003 09/01/2003 - 10/01/2003 10/01/2003 - 11/01/2003 11/01/2003 - 12/01/2003 12/01/2003 - 01/01/2004 05/01/2004 - 06/01/2004 06/01/2004 - 07/01/2004 07/01/2004 - 08/01/2004 08/01/2004 - 09/01/2004 09/01/2004 - 10/01/2004 10/01/2004 - 11/01/2004 11/01/2004 - 12/01/2004 12/01/2004 - 01/01/2005 01/01/2005 - 02/01/2005 02/01/2005 - 03/01/2005 03/01/2005 - 04/01/2005 05/01/2005 - 06/01/2005 06/01/2005 - 07/01/2005 07/01/2005 - 08/01/2005 08/01/2005 - 09/01/2005 09/01/2005 - 10/01/2005 10/01/2005 - 11/01/2005 11/01/2005 - 12/01/2005 12/01/2005 - 01/01/2006 01/01/2006 - 02/01/2006 03/01/2006 - 04/01/2006 05/01/2006 - 06/01/2006 06/01/2006 - 07/01/2006 07/01/2006 - 08/01/2006 08/01/2006 - 09/01/2006 09/01/2006 - 10/01/2006 10/01/2006 - 11/01/2006 01/01/2007 - 02/01/2007 02/01/2007 - 03/01/2007 03/01/2007 - 04/01/2007 04/01/2007 - 05/01/2007 05/01/2007 - 06/01/2007 06/01/2007 - 07/01/2007 07/01/2007 - 08/01/2007 08/01/2007 - 09/01/2007 09/01/2007 - 10/01/2007 10/01/2007 - 11/01/2007 11/01/2007 - 12/01/2007 12/01/2007 - 01/01/2008 01/01/2008 - 02/01/2008 02/01/2008 - 03/01/2008 03/01/2008 - 04/01/2008 08/01/2008 - 09/01/2008 06/01/2009 - 07/01/2009 08/01/2009 - 09/01/2009 10/01/2009 - 11/01/2009 01/01/2010 - 02/01/2010 09/01/2011 - 10/01/2011 04/01/2013 - 05/01/2013

Powered by Blogger.

Creative Commons License
This work is licensed under this Creative Commons License except where indicated.