style="margin-top:70px;" Clickstream


Brio Bites the Dust

The business intelligence market loses another company as Hyperion buys Brio. Brio has long been my favorite ad-hoc query tool. Easy to set up, easy interface, not a lot of confusing administrative work to get going. On the good side, lots of people use Essbase and Brio would provide a window into both Essbase and relational databases, so the acquisition makes sense. It will probably fare better than some of the Crystal products under Business Objects. If you don't like the UK reporting, here's a US news report

Crystal Decisions Acquisition: BI Tool Consolidation Continues

The latest large acquisition in the BI market is Business Objects' purchase of Crystal Decisions. This is a consolidation of two large players and that may push the combined company to the #1 slot in the BI market [I have not checked out the market share and revenue numbers yet].

I wonder about culture clash as the companies merge operations. BOb is a French company, while Crystal is based in San Jose. There's also fear that work in San Jose may be shifted out, further damaging the Silicon Valley economy. The tools do overlap a little, but for the most part they serve different segments of the market and come at it with different approaches. Hopefully that means they tweak the products to work better together without dropping many product lines.

It's going to be interesting for all those Crystal customers out there. I'm one, but not in a big way. Fortunately we have two other platforms we can shift to if we need to make a move.

Books for the Data Warehouse User Interface

[Tufte Image]
I've had Edward Tufte's books for years now and I still refer back to them occasionally if I'm working on presenting numbers and facts. Each of the three focuses on different aspects of delivering information visually. He's a stickler for details, so the books are beautifully designed and the construction is high quality from the paper to the binding. The titles are:
The Visual Display of Quantitative Information
Envisioning Information
Visual Explanations

These three books, or at least the first one, should be on every data warehouse professional's bookshelf if involved with user-interface design. We spend a lot of time with business intelligence products, but it's often not focused on communicating the information effectively.

Machines 20, Mosquitoes 0

Our Terminator future has already started. After one full day of operation there were 20 mosquitoes in the machine's little jail. If it keeps up this pace I'll soon be able to go outside for more than ten minutes without suffering from anemia.

Outsourcing vs. Telecommuting

It's an interesting idea, comparing telecommuting to offshore (or even onshore) outsourcing. It's not ok to have an IT department and spend $50M per year, but it is ok to have an outsourcing contract for $45M per year and no IT department. If you could save $5M per year by getting rid of 80% of the office infrastructure in favor of telecommuting and outsourcing some of the basic infrastructure services, why go to an outsourcer? In theory you could control your destiny and save money.
An anonymous reader asks: "Corporations and management resisted telecommuting for years, now jobs flow to distant nations. Did telecommuting become acceptable because of the greater distance? Because some form of on-site management persists? Because labor laws are favorable? Because a well paid middle class is a political threat? Is it really as simple as money? I'll work cheaper if I can choose where I live and work. Must I leave my country to do so?"
I've seen enough outsourcings and insourcings and deals gone bad to know that outsourcing specific items or infrastructure works well, but trying to outsource the custom development and maintenance and the management of IT that goes with it is hard enough that it usually costs more and leaves the company hobbled. Read more at /.

California's New Over-hyped Privacy Law Has Giant Loophole

There's been a lot of press lately about California's new law requiring companies to notify consumers when personal information is stolen. This would be a great leap forward if not for the fact that the only information that appears to be protected is the social security number, driver's license number or credit card number.

That means your online brokerage account username and password do not fall into this category. So if someone hacks in and transfers all your money to their account, it's unclear whether they need to tell you since the only information that was lost to the intrusion was a username and password. Other data like medical records or insurance information is likewise not covered.

This widely applauded bill is supposed to be the basis for a broader federal statute to be introduced later this year. Read the text and see if you come to the same conclusion: full text of CA SB 1386. I had to point to the Google cache copy because their link went dead

Followup: rereading this, I think the brokerage account would be covered, although the wording is ambiguous. The other personal information is probably not covered. Good intent. Needs to be more strongly worried or it will not be effective.

Gear Fetish: Open Source PVR

Departing from the DW/DSS theme, there's some neat hardware coming on the market. I've been looking into building my own Linux-based Tivo clone. Then I found the Telly thanks to this article at Wired News.

Open source. Open hardware. User modifiable. What could be better? The $899 price is a little steep but it saves the effort of buying and integrating a bunch of hardware and software. They get kudos for their home page too: there's a penguin on the Telly!

Gear Fetish 2: A Network-based MP3 Player
[The Telly]The Telly may obviate the need for the SLIMP3. I like the idea of a networked mp3 player that can search through and build playlists out of my collection, and the user interface on this device is clean and simple. It also doesn't require that the TV act as a monitor, since it has the bright LCD display. The only drawback is that it requires a network cable - no Wifi. That's the one thing I like better about the Homepod. The display for the Homepod is pretty lame.

The FBI's Approved Reading List

You better watch what you read, because Uncle Ashcroft has a database with your name in it. I wonder how many different databases this poor guy's reading list will show up in...
Then they ask if I carried anything into the shop -- and we're back to me.

My mind races. I think: a bomb? A knife? A balloon filled with narcotics? But no. I don't own any of those things. "Sunglasses," I say. "Maybe my cell phone?"

Not the right answer. I'm nervous now, wondering how I must look: average, mid-20s, unassuming retail employee. What could I have possibly been carrying?

Trippi's partner speaks up: "Any reading material? Papers?" I don't think so. Then Trippi decides to level with me: "I'll tell you what, Marc. Someone in the shop that day saw you reading something, and thought it looked suspicious enough to call us about. So that's why we're here, just checking it out. Like I said, there's no problem. We'd just like to get to the bottom of this. Now if we can't, then you may have a problem. And you don't want that."

What was the young man above reading? Why, it was this terribly dangerous article. Combine two parts bad legislature with one part paranoia, mix in some technology, and everyone can enjoy the same freedoms as David Nelson.

What next, RIAA building a database of people on file trading networks so they can tie up our legal system with frivolous lawsuits based on misuse of the DMCA?

MySQL Hits the Data Warehousing Mainstream?

You know the time is coming for open source databases when a big company goes public with a large-scale implementation and it hits the Wall Street Journal. Media coverage in the mainstream press generally means that a technology has matured enough that it's safe for the corporate masses who follow in the shadow of forward thinking companies.

In this case, the "big company" is Cox Communications and the database is MySQL:
"One user of an open-source database is Cox Communications Inc. The Atlanta-based cable-TV operator is using the software to monitor the performance of more than 1.5 million cable modems providing customers with high-speed Internet access. Mark Cotner, manager of network application development, originally got the system up and running on spare hardware and free software he downloaded from the Web site of MySQL AB, based in Sweden. The database now has 2.4 billion rows of information, totaling about 600 gigabytes of data."

The primary driver for Cox is cost. According to their statements, it was a choice between $300K for Oracle (not including support) or about $1K for MySQL and $12K for support. In the Big Understatement department, the CEO of Sybase is quoted as saying "It's very hard to compete with free. It lowers the price point."

I have some skepticism about the spin surrounding the system described. I have not talked to anyone yet for more details, but what I've turned up in online research does not point to a traditional data warehouse. In this press release they talk about 4 distributed copies of the database being maintained through replication, "4 million inserts every two hours" and "27 collection servers with over 3,600 MySQL tables."

When someone mentions a data warehouse you normally think of the query schema, and their 2 billion rowcount appears to come from all of the rows across all the databases. I can come up with 2 billion rows if I include all of my ETL infrastructure and ignore the query tables completely. The 4 million inserts in two hours might be considered large until you work out that it's about 555 rows inserted per second. I'm doing 1200 inserts per second for a single dimension load, and loading 20 different dimensions. Their data loads are probably done in short bursts, so the loads are likely spiking much higher than this modest number. Without further details it's hard to day. They name 600GB as the database size, which is large even if it's spread across several servers.

I believe open source platforms are already reasonable for smaller (traditional) data warehousing projects. To test this out I just installed a Linux system at home with several hundred GB of storage, with DB2 UDB 7.2, Oracle 9i, Postgres and MySQL and will be comparing equivalent performance and functionality between these databases.

What open source databases need in order to be taken seriously for data warehousing is less hype and real capabilities. The primary needs are the ability to manage large tables with many tens or hundreds of millions of rows, fast data load speeds, easy data integration with other databases and decent query support for commercial business intelligence tools. We'll see how MySQL and Postgres compare with the other databases. I don't expect any problems from Linux or ETL code built with open source tools.

The TSA Owes Me Lunch

Tourists. Terrorists. It's too hard to tell the difference.

The TSA has been popping up a lot lately in my blog. The latest occurrence is once again at Portland International Airport, the origin of the David Nelson saga and show me your tits baggage screening.

So how did the TSA cost me lunch? I'm glad you asked...

We were supposed to meet two reps from a software company in Seattle for lunch today. They called and told us they wouldn't be able to make it because they were stuck in Portland. After leaving the Seattle-Portland plane, they were detained at the gate for the Medford-bound leg of their trip because one person's driver's license had expired. That license is certainly less secure than before it expired, a few days earlier.

I can (almost) understand not allowing someone to board a flight with an expired license, but after they're on the second leg of their flight? I suppose I can see a reason. After all, who knows what terrible mischief they could have caused in a small commuter plane flying over central Oregon wilderness. Instead, the rocket scientists at the Portland airport allowed them to board a jet back home to Seattle.

The TSA wins and I'm out a free lunch.

This is why I'm grateful I no longer travel for work. The innocent traveling public suffers the cost and outrages of ham-handed federalized airport security while the tragic stupidity of government-mandated security practices leaves us worse off than we were two years ago. Maybe we can all help the TSA by pointing out when their metal detectors aren't plugged in.

Ping-Pong in The Matrix

Low-tech special effects make this Japanese kung-fu ping-pong match one you won't want to miss.

It's too bad American media companies are so stuck on polls and the herd instinct that we rarely get a program half as interesting as what's on Japanese TV.

Nice Short Miyazake Review in Salon

There's a nice short article on Hayao Miyazake at It offers some critique and summaries of a number of Miyazake films. Worth clicking through the single ad page to get to if you aren't a member.

If you haven't become a fan of Miyazake yet, go see Spirited Away. It's by far the best one of those I've seen.

TSA Uses Data Warehousing to Further Terrorism: A Lesson in How Not to do Security

From the last post, I was asked "how is the TSA's no-fly list less secure than not having a list?" Because maintaining a predetermined list that is uniform (uniformly wrong if you're David Nelson) rather than performing random checks of passengers means there are ways to use the system against itself.

The government has tipped its hand as to who is interesting. If a group of people wants to infiltrate flights or travel incognito, all that is necessary is to obtain fake IDs and travel to see which IDs attract attention and which don't. Spending some time and money, a group can try different combinations of activity as well. For example, buy one-way tickets, pay cash for a round trip, book from rural to metropolitan airports and see which segments are flagged.

Knowing what the TSA considers important, it's possible to obtain safe fake IDs and fly under the TSA's radar.

To make matters worse, if you want to divert security's attention and resources, simply send someone with ID that is guaranteed to be flagged to the airport when you go. You can bring your favorite plague-and-bomb kit while the person with the "bad" ID has nothing but nail clippers.

Contrast this with random searches. If it's random, the system can't be gamed. It is more difficult to anticipate who will be targeted for searches or what activities, if any, will attract attention. As an added benefit, we avoid the false sense of security caused by the belief that the bad ones have been weeded out in the trusty TSA data warehouse.

We need security based on sound principles, not on the assumption that terrorists are unsophisticated goofs or that computers and databases can't contain mistakes. This is one data warehouse I'd like to see decommissioned.

Homeland Security's Poor Data Quality May Land You in Jail

The various security laws like the USA Patriot Act can show a darker side to data quality. Following up on my prior post about data quality and how bad data makes its way through a chain of computer systems, imagine the impacts of flawed data when that data is used for law enforcement purposes. This is an area where an error can destroy someone's life.

Because Homeland Security and the TSA (and the FBI and the CIA and local and state police departments) are all collecting data - and thanks to the Patriot Act are free to spy on citizens who are not suspects in any criminal investigation - there is real danger when inaccuracies in one agency's data get widely dispersed. Imagine if bad data is propagated from one system to other databases, and from there to even more databases. Even if the originating agency corrects their data, the corrections may not make their way to all the replicated copies or may show up too late to do any good. This gives me the screaming heebie-jeebies.

The case of the David Nelsons is a perfect example of what can happen when bad data gets into a system. Anyone named David Nelson was targeted for special scrutiny to the point of missed flights and hours-long detention until the enforcers realized this was not the David Nelson they were interested in. Amazingly, the database did not bother to identify specifics, like a middle name or physical address or drivers license number. This meant that anyone named David Nelson from Oregon is suspect.

To make matters worse, there was no way to alert airports further down the line that the current person is not an "interesting" David Nelson and has already been screened. Some people were subject to repeated detentions as they changed planes on subsequent legs of the same flight.

This highlights one of the key problems: the multiple databases and the disconnected nature of systems. Even if one airport cleared David Nelson, the next airport did not know this other than via the obvious fact that he just got off a plane and was booked on another segment. Each airport is an island, and each island has its own copy of the bad data.

We have multiple airports all using data from the same source: the TSA. Nobody knows where the TSA got their data because they keep their "can't fly" criteria and lists secret, even though this practice makes flying less secure by making it easier for someone to subvert the system. This is typical of most government attempts to increase security over the past two years.

If this happens to you, assuming you can get the information corrected, you will probably still be fighting the bad data sitting in the backwaters of some regional TSA office and find yourself unexpectedly detained.

Now magnify this annoying but minor incident with some of the other federal efforts, like the "Total Information Awareness" program that was renamed the "Terrorist Information Awareness" program in an effort to make everyone feel better about the Orwellian goals. The feds will build a database on everyone in the US, just in case the data might be useful, and share that with other shadowy federal organizations. Now the bad data could be feeding into police surveillance programs, suspect questioning or even detentions. And there is no mechanism in place to review the data for accuracy or correct it and any downstream uses.

Each error in the data has the potential for a huge cost in misdirected law enforcement, diverting security efforts and making us less secure. And bad data is a very difficult, almost intractable problem, particular when secretive government agencies are involved. Let's hope these massive surveillance data warehouse projects are stopped before too many people are sent to Cuba because of bad data.

Increasing Data Integration Exposes X-rated Data Quality Problems

As more and more systems are moved from batch to online designs, or replaced entirely by product suites and ERP systems, the problem of dirty data is becoming more apparent. In the past this was rarely exposed to the outside. Now that systems are so interconnected, the polluted data multiplies quickly. My last experience with this was when Clickstream Data Warehousing came out.

At that time, the information that shows up on the Amazon and Barnes & Noble web pages for the book had errors in it. We got the errors corrected quickly enough, but they reappeared a week later. It turns out the publisher made mistakes entering the information, and that data was propagated from their system to the distributors' systems, and from there to several online booksellers. Any correction downstream was overwritten whenever new updates came in. Even fixing it at the distributor did not guarantee it wouldn't reoccur.

This kind of problem happens all the time with product information, and in many different industries. One of my favorite examples of propagating bogus product data is also from Amazon. The problem: the image references for a bunch of Shaolin Kung-fu movies were not for the video covers, but for softcore porn videos. A query for, say "Fighting of the Shaolin Monks", turned up this page a as result [screen capture, 366K]. Here's a much smaller closeup of the video box for Fighting of the Shaolin Monks. This is a style of Kung-fu I'm not familiar with.

Studies in the direct mail industry show industry say the industry-wide costs of erroneous mailing caused by bad data run into the hundreds of millions of dollars. Not too long ago the state of Alaska had a single problem with a mailing that ended up costing them somewhere between $65,000 and $95,000 to correct.

Even though bad data can lead to commercial loss (or possibly unexpected sales growth to the teenage male segment in the movie example), most companies don't have a data quality program in place. This is not necessarily bad, since data quality is only as good as the processes that handle it. Putting a program and people in place to correct data quality problems is not normally one person's job. It's a process problem that requires changing software development, systems integration, and data management practices.

Without this perspective, it's easy to fall into the trap of buying data cleansing tools to clean up data so that's it's in good shape, never fixing the source of the problems. This is a lot like building a filtration plant for your water supply instead of stopping the upstream pollution. It costs more and it deals with the symptom rather than the cause.

Misleading Real Time Product Literature

After reading some real-time integration literature for a few products, I'm left feeling like the vendors are missing the point. There was a lot of focus on moving data from point to point in real time, and on how low the latencies are for their products. This would be fine if we were only worried about how fast it took data to get from point A to point B.

The problem is that this is not the real measure of what's important. The business need that drives the real-time infrastructure dictates what sort of latency is important. It's a rare enterprise that needs subsecond interconnections, or data precisely 30 seconds after an event occurs. The focus should be on meeting the information latency needs of the users.

Most products will do this. Getting them into benchmark-like shootouts is counterproductive. It's not so much the fault of the vendor marketing as it is the whole IT buying cycle. Everyone wants neat little checklists that can run down to do a product comparison, and the product marketers like to focus on things that are easy to differentiate.

It's easy to show the difference between 5 second and subsecond information delivery. It's hard to show the difference between one product's mechanism for integrating with a custom system and another's, and why one is better. It's easy to show how well a system scales under various workloads, but not how performance is affected if one link in a chain of systems has a problem.

This is one of the reasons I wrote my last article on real-time data integration for Intelligent Enterprise. What we're usually trying to buy with integration products is infrastructure, and infrastructure doesn't come in a box. IT infrastructure is a set of technologies that need to fit into a technology plan, and need to supply a relatively stable and regular set of services.


Data warehousing, business intelligence, IT strategy and architecture, and occasional interesting bits.

Subscribe to XML feed

Bio / About Me

Check out my book

Clickstream data warehousing book cover Buy clickstream data warehousing from

Search this site or  the web

Site search   Web search
powered by FreeFind
Popular Posts
Primate programming.
Why development in crunch mode doesn't work.
Enterprise data modeling sucks big rocks.
XP Exaggerated.
Ping-pong in the matrix.
Time management for anarchists.
Is Ab Initio worth evaluating?
Job posting: omniscient architect.
Why hiring more sales people won't grow revenues faster.
Some resources for Open Source CMS.

Reading List
The Cruise of the Snark
Blue Latitudes
Everyone in Silico
The Klamath Knot
Swarm Intelligence (Bonabeau)
A three year backlog of F&SF

Listening List
Toots and the Maytals
The Buena Vista Social Club
American Idiot

Watching List
Winged Migration Quicktime trailer
Ghengis Blues
Howl's Moving Castls
A Bronx Tale

Daily KOS
Due Diligence
Boing Boing
Kevin Kelly (Recomendo)
Not Geniuses
3 Quarks Daily

War in Context
Valmiki's Ramayana
Choose the Blue
Third Nature
Mark Madsen
The Data Warehouse Institute
James Howard Kunstler
Clickstream Data Warehousing
Technorati Profile

04/01/2003 - 05/01/2003 05/01/2003 - 06/01/2003 06/01/2003 - 07/01/2003 07/01/2003 - 08/01/2003 08/01/2003 - 09/01/2003 09/01/2003 - 10/01/2003 10/01/2003 - 11/01/2003 11/01/2003 - 12/01/2003 12/01/2003 - 01/01/2004 05/01/2004 - 06/01/2004 06/01/2004 - 07/01/2004 07/01/2004 - 08/01/2004 08/01/2004 - 09/01/2004 09/01/2004 - 10/01/2004 10/01/2004 - 11/01/2004 11/01/2004 - 12/01/2004 12/01/2004 - 01/01/2005 01/01/2005 - 02/01/2005 02/01/2005 - 03/01/2005 03/01/2005 - 04/01/2005 05/01/2005 - 06/01/2005 06/01/2005 - 07/01/2005 07/01/2005 - 08/01/2005 08/01/2005 - 09/01/2005 09/01/2005 - 10/01/2005 10/01/2005 - 11/01/2005 11/01/2005 - 12/01/2005 12/01/2005 - 01/01/2006 01/01/2006 - 02/01/2006 03/01/2006 - 04/01/2006 05/01/2006 - 06/01/2006 06/01/2006 - 07/01/2006 07/01/2006 - 08/01/2006 08/01/2006 - 09/01/2006 09/01/2006 - 10/01/2006 10/01/2006 - 11/01/2006 01/01/2007 - 02/01/2007 02/01/2007 - 03/01/2007 03/01/2007 - 04/01/2007 04/01/2007 - 05/01/2007 05/01/2007 - 06/01/2007 06/01/2007 - 07/01/2007 07/01/2007 - 08/01/2007 08/01/2007 - 09/01/2007 09/01/2007 - 10/01/2007 10/01/2007 - 11/01/2007 11/01/2007 - 12/01/2007 12/01/2007 - 01/01/2008 01/01/2008 - 02/01/2008 02/01/2008 - 03/01/2008 03/01/2008 - 04/01/2008 08/01/2008 - 09/01/2008 06/01/2009 - 07/01/2009 08/01/2009 - 09/01/2009 10/01/2009 - 11/01/2009 01/01/2010 - 02/01/2010 09/01/2011 - 10/01/2011 04/01/2013 - 05/01/2013

Powered by Blogger.

Creative Commons License
This work is licensed under this Creative Commons License except where indicated.