Clickstream

style="margin-top:70px;" Clickstream

Clickstream
	MySQL Hits the Data Warehousing Mainstream? You know the time is coming for open source databases when a big company goes public with a large-scale implementation and it hits the Wall Street Journal. Media coverage in the mainstream press generally means that a technology has matured enough that it's safe for the corporate masses who follow in the shadow of forward thinking companies. In this case, the "big company" is Cox Communications and the database is MySQL: "One user of an open-source database is Cox Communications Inc. The Atlanta-based cable-TV operator is using the software to monitor the performance of more than 1.5 million cable modems providing customers with high-speed Internet access. Mark Cotner, manager of network application development, originally got the system up and running on spare hardware and free software he downloaded from the Web site of MySQL AB, based in Sweden. The database now has 2.4 billion rows of information, totaling about 600 gigabytes of data." The primary driver for Cox is cost. According to their statements, it was a choice between $300K for Oracle (not including support) or about $1K for MySQL and $12K for support. In the Big Understatement department, the CEO of Sybase is quoted as saying "It's very hard to compete with free. It lowers the price point." I have some skepticism about the spin surrounding the system described. I have not talked to anyone yet for more details, but what I've turned up in online research does not point to a traditional data warehouse. In this press release they talk about 4 distributed copies of the database being maintained through replication, "4 million inserts every two hours" and "27 collection servers with over 3,600 MySQL tables." When someone mentions a data warehouse you normally think of the query schema, and their 2 billion rowcount appears to come from all of the rows across all the databases. I can come up with 2 billion rows if I include all of my ETL infrastructure and ignore the query tables completely. The 4 million inserts in two hours might be considered large until you work out that it's about 555 rows inserted per second. I'm doing 1200 inserts per second for a single dimension load, and loading 20 different dimensions. Their data loads are probably done in short bursts, so the loads are likely spiking much higher than this modest number. Without further details it's hard to day. They name 600GB as the database size, which is large even if it's spread across several servers. I believe open source platforms are already reasonable for smaller (traditional) data warehousing projects. To test this out I just installed a Linux system at home with several hundred GB of storage, with DB2 UDB 7.2, Oracle 9i, Postgres and MySQL and will be comparing equivalent performance and functionality between these databases. What open source databases need in order to be taken seriously for data warehousing is less hype and real capabilities. The primary needs are the ability to manage large tables with many tens or hundreds of millions of rows, fast data load speeds, easy data integration with other databases and decent query support for commercial business intelligence tools. We'll see how MySQL and Postgres compare with the other databases. I don't expect any problems from Linux or ETL code built with open source tools. Posted by Mark Wednesday, July 16, 2003 10:22:00 PM \| permalink \| Comments: Post a Comment Home	Data warehousing, business intelligence, IT strategy and architecture, and occasional interesting bits. Subscribe to XML feed Bio / About Me Check out my book Where I'm At Find my status on Twitter Search this site or the web Site search Web search powered by FreeFind Popular Posts Primate programming. Why development in crunch mode doesn't work. Enterprise data modeling sucks big rocks. XP Exaggerated. Ping-pong in the matrix. Time management for anarchists. Is Ab Initio worth evaluating? Job posting: omniscient architect. Why hiring more sales people won't grow revenues faster. Some resources for Open Source CMS. Reading List Quicksilver The Cruise of the Snark Blue Latitudes Everyone in Silico The Klamath Knot Swarm Intelligence (Bonabeau) A three year backlog of F&SF Listening List Toots and the Maytals The Buena Vista Social Club American Idiot Watching List Winged Migration Quicktime trailer Ghengis Blues Howl's Moving Castls Hero A Bronx Tale Blogroll Daily KOS Due Diligence Boing Boing Kevin Kelly (Recomendo) Not Geniuses 3 Quarks Daily Futurismic Fafblog Kottke.org Miscellany War in Context Salon.com Valmiki's Ramayana Choose the Blue Third Nature Mark Madsen The Data Warehouse Institute James Howard Kunstler WorldChanging /. Clickstream Data Warehousing Technorati Profile Archives 04/01/2003 - 05/01/2003 05/01/2003 - 06/01/2003 06/01/2003 - 07/01/2003 07/01/2003 - 08/01/2003 08/01/2003 - 09/01/2003 09/01/2003 - 10/01/2003 10/01/2003 - 11/01/2003 11/01/2003 - 12/01/2003 12/01/2003 - 01/01/2004 05/01/2004 - 06/01/2004 06/01/2004 - 07/01/2004 07/01/2004 - 08/01/2004 08/01/2004 - 09/01/2004 09/01/2004 - 10/01/2004 10/01/2004 - 11/01/2004 11/01/2004 - 12/01/2004 12/01/2004 - 01/01/2005 01/01/2005 - 02/01/2005 02/01/2005 - 03/01/2005 03/01/2005 - 04/01/2005 05/01/2005 - 06/01/2005 06/01/2005 - 07/01/2005 07/01/2005 - 08/01/2005 08/01/2005 - 09/01/2005 09/01/2005 - 10/01/2005 10/01/2005 - 11/01/2005 11/01/2005 - 12/01/2005 12/01/2005 - 01/01/2006 01/01/2006 - 02/01/2006 03/01/2006 - 04/01/2006 05/01/2006 - 06/01/2006 06/01/2006 - 07/01/2006 07/01/2006 - 08/01/2006 08/01/2006 - 09/01/2006 09/01/2006 - 10/01/2006 10/01/2006 - 11/01/2006 01/01/2007 - 02/01/2007 02/01/2007 - 03/01/2007 03/01/2007 - 04/01/2007 04/01/2007 - 05/01/2007 05/01/2007 - 06/01/2007 06/01/2007 - 07/01/2007 07/01/2007 - 08/01/2007 08/01/2007 - 09/01/2007 09/01/2007 - 10/01/2007 10/01/2007 - 11/01/2007 11/01/2007 - 12/01/2007 12/01/2007 - 01/01/2008 01/01/2008 - 02/01/2008 02/01/2008 - 03/01/2008 03/01/2008 - 04/01/2008 08/01/2008 - 09/01/2008 06/01/2009 - 07/01/2009 08/01/2009 - 09/01/2009 10/01/2009 - 11/01/2009 01/01/2010 - 02/01/2010 09/01/2011 - 10/01/2011 04/01/2013 - 05/01/2013 This work is licensed under this Creative Commons License except where indicated.