MySQL Hits the Data Warehousing Mainstream?
You know the time is coming for open source databases when a big company goes public with a large-scale implementation and it hits the Wall Street Journal. Media coverage in the mainstream press generally means that a technology has matured enough that it's safe for the corporate masses who follow in the shadow of forward thinking companies.
In this case, the "big company" is Cox Communications and the database is MySQL:
"One user of an open-source database is Cox Communications Inc. The Atlanta-based cable-TV operator is using the software to monitor the performance of more than 1.5 million cable modems providing customers with high-speed Internet access. Mark Cotner, manager of network application development, originally got the system up and running on spare hardware and free software he downloaded from the Web site of MySQL AB, based in Sweden. The database now has 2.4 billion rows of information, totaling about 600 gigabytes of data."
The primary driver for Cox is cost. According to their statements, it was a choice between $300K for Oracle (not including support) or about $1K for MySQL and $12K for support. In the Big Understatement department, the CEO of Sybase is quoted as saying "It's very hard to compete with free. It lowers the price point."
I have some skepticism about the spin surrounding the system described. I have not talked to anyone yet for more details, but what I've turned up in online research does not point to a traditional data warehouse. In this press release they talk about 4 distributed copies of the database being maintained through replication, "4 million inserts every two hours" and "27 collection servers with over 3,600 MySQL tables."
When someone mentions a data warehouse you normally think of the query schema, and their 2 billion rowcount appears to come from all of the rows across all the databases. I can come up with 2 billion rows if I include all of my ETL infrastructure and ignore the query tables completely. The 4 million inserts in two hours might be considered large until you work out that it's about 555 rows inserted per second. I'm doing 1200 inserts per second for a single dimension load, and loading 20 different dimensions. Their data loads are probably done in short bursts, so the loads are likely spiking much higher than this modest number. Without further details it's hard to day. They name 600GB as the database size, which is large even if it's spread across several servers.
I believe open source platforms are already reasonable for smaller (traditional) data warehousing projects. To test this out I just installed a Linux system at home with several hundred GB of storage, with DB2 UDB 7.2, Oracle 9i, Postgres and MySQL and will be comparing equivalent performance and functionality between these databases.
What open source databases need in order to be taken seriously for data warehousing is less hype and real capabilities. The primary needs are the ability to manage large tables with many tens or hundreds of millions of rows, fast data load speeds, easy data integration with other databases and decent query support for commercial business intelligence tools. We'll see how MySQL and Postgres compare with the other databases. I don't expect any problems from Linux or ETL code built with open source tools.
Posted by Mark Wednesday, July 16, 2003 10:22:00 PM |