style="margin-top:70px;" Clickstream


Readers Berating CIO Magazine for Layout Equivalent of Spamming

I had more fun reading the comments on this CIO Magazine article than the article itself. In fact, I didn't finish the article because it was the thing that pissed all the readers off.

Lesson to learn: sure, you may be trying to maximize ad revenue, but dropping the ad to content ratio to 1 ad per 100 words is taking it too far. They have 19 (!) pages of article, one paragraph per page. The side effects are damn funny, though. The comment paragrpahs count runs 5:1 over the article paragraph count.

RSS Sucks For Mashups, But It's Ubiquitous

Another thing we learned while banging away at web sites to get data was that RSS stinks for slinging web data, yet the tools available don't deal well with Atom feeds. Atom is great for this purpose, but hard to locate tools for (other than code libraries). Being both ignorant and lazy, I'd prefer to integrate tools rather than write 3x that much code in PHP and Python.

Our first attempt to grab data from generated an Atom feed. We did that because we wanted to get a bunch of data and feed different elements to other extractors to collect data. Sadly, none of the tools we were using could consume Atom. We had a choice of programming a custom widget for parsing Atom to get the data we wanted or switching to RSS. We did the latter because it was expedient given the time constraints.

The problem with using RSS for data is that it only has a few fields to store data in. We ended up using the title, description, link, and even the date fields. This meant having to take 8 fields of extracted data and publish two separate RSS feeds, or stuffing multiple values into a single field like description. Luckily, we didn't need some fields broken out so stuffed things into fewer fields. Still, I had something like nine RSS feeds going at one point.

The other side was getting feeds. Lots of sites generate RSS feeds, but no Atom. So we're stuck with crappy feeds or we do it ourselves, but then have to DIY the consuming side as well. Let's hope Atom usage picks up in the future.

Labels: , , , , , ,

Web Data Sourcing Tools We Used for the Mashup Contest

We just finished recording a podcast for IBM DeveloperWorks which will be up in the next few days so I was looking to see what else happened at Mashup Camp while we were writing code. We got a mention at Programmable Web (first place I go to see what's new in the mashup space). I wish we had been able to spend more time in sessions, but being heads-down in new tools was still worthwhile.

Apart from the possibility of winning prizes, this is the best learning environment I've found for this space. Unless you spend a lot of time reading through blogs, you aren't going to find many resources. Besides, nothing beats learning from other people while doing.

Here's the rundown of tools Renat and I worked with:
QEDwiki (not an official product or downloadable, yet)
Yahoo! Pipes
and lots of Google sources

One thing that hasn't been mentioned in most of the news is that all these companies had people at Camp. To be honest, if it weren't for Dan Gisolfi and Meg Sorber from IBM, we never would have finished work by the time the event closed. They stayed up late to help us with problems, bugs and techniques using QEDwiki.

I used Dapper a lot to scrape pages and make RSS feeds. I ran into problems with Dapper and the bad HTML practices of some web sites. Fortunately, Eran Shir and Jon Aizen (the CEO and CTO of Dapper) were there to help out.

Out of all the things I've worked with, theirs is the most impressive because of its simplicity. Unlike Pipes, which manipulates RSS feeds, Dapper scrapes pages and turns them into feeds in many different formats. Dapper + Pipes is a great combination. Kapow is an industrial strength scraper, so Dapper is not as powerful as Kapow's tools, but it's a lot easier to use for something quick and dirty.

We used Apatar because it was the only way short of directly coding to APIs to get data from And it's open source. And Renat knows how to use it. It's not a page scraper, it's a data integration tool, so it does things these other tools can't. Overall, the combination of a DI tool, a scraper, and a manipulation and delivery formatting tool are what you need to get data for mashups if you're doing it inside an IT shop.

QEDwiki is the assembly hub, so it doesn't provide data sourcing or manipulation features. IBM is going to include another tool in the kit for that. They did a demo of this during Mashup U, but didn't have it available for us.

Labels: , , , , , , ,

Watch a Walkthrough of Our Mashup Camp Contest Winner

Here's a link to coverage of the mashup contest winner (me and Renat Khasanshyn - not often I get to see "me" and "winner" in the same sentence) over at ZDnet: Mashup culture shatters crusty, stodgy old approach to business app dev

They did a nice job with the video of David Berlind interviewing me showing a walkthrough of the mashup and a little about what we had to do to build it. I'd comment on how intelligent that person doing the demo looks, but I'm too modest.

One thing that didn't get mentioned is that Apatar - which Renat used to pull a data feed from - is open source. It's free as "freedom" and free as in "beer". Hard to beat that.

Sadly, no walkthrough of my answer to the Starbucks finders on Programmable Web, the Starbucks Anti-Finder (on the right). Given how many stores there are, I think the real challenge is showing a map of coffeehouses that aren't Starbucks, so I created a Starbucks locater that can't find Starbucks, even if you were to feed the lat/long position to the map. You're welcome, Brian - now you don't need to see corporate coffeehouses. Version 2 will exclude all public corporations from the list, making cafe hunting even easier for the anti-corporate crowd.

And no, there is no such thing as Jose Cuervo cereal.

Labels: , , , , ,

Getting Data for Mashups

Getting data is still the hardest part, even with great tools like Dapper and Kapow. They aren't perfect, but neither are the web sites they go after (see previous post about Yahoo!). One thing we learned building the main mashup for the competition was that it's helpful to have an ETL tool around. We used page scrapers and RSS mungers and the like, but the only way to get data from is to write to their API.

That's no fun. Luckily, we had access to Apatar which has prebuilt connectors for Salesforce, so we were able to go in and fetch the data we wanted without having to resort to client-side javascript calling Salesforce APIs. To do that would have meant coding custom widgets in order to address communication between UI controls, ad there was no way to do that in he time we had.

I love open source. Amazing how quickly OSS data integration has been coming to market. Aside from Apatar there's Talend, Kettle (Pentaho Data Integration) and SnapLogic (though they aren't technically an ETL tool). I saw a mention of JitterBit recently but know nothing about them yet.

Labels: , , , , , ,

Unexpected Lessons From the Mashup Camp 4 Contest

Mashup Camp is done and we came away with 4 of the prizes from the IBM (and Dapper, Kapow, StrikeIron, Accuweather) mashup building competition. Not a clean sweep but we did our best.

While sourcing information for the mashups, one thing really surprised me. Yahoo! generates really poor HTML that's hard to parse. We had identified data for which there were no feeds in several areas (like Yahoo! finance) and built scrapers. We couldn't get any of them to work (I tried from five different sets of pages). We ended up going to several other sites like Google finance to get our data. Googles sites were simple to scrape, with no strange things going on.

Lesson learned: if you want to use Yahoo! go via one of their APIs or RSS feeds, otherwise find another source because you'll be pulling your hair out.

A related lesson is just how web-unfriendly Microsoft can be. I hadn't realized the difficulty of going after ASP-generated pages. If you use url-driven tools you are simply SOL. If you use a tool that can do directed spidering it's still not easy but works.

I'll be ready for the next contest. IBM gave two months of advanced notice, but neither we nor the #2/3 winner took advantage of the time. We didn't start until the day before the deadline, when the wine and beer provided by IBM encouraged us to help each other learn the tools to do the work.

Labels: , , ,

Mashup Camp 4, Day 1

I'm at Mashup Camp 4 this week. Tired. Who starts a conference with breakfast at 7:30 AM and runs it until 6:00 PM? Based on today, someone who's wrangled lots of smart people to talk about leading edge technology.

Bottom line: lots of sexy UI stuff like Zude going on, less focus on the data integration, but more than I expected. Enjoyed the SnapLogic presentation, not so much because they are innovative (they are, but there are a half-dozen other entries in the general web integration market) but because they have a freely downloadable server-based open source offering. This circumvents one of the complaints I have about Yahoo Pipes, Dapper and OpenKapow: you can run your own server, where the others are services that host for you with no option of running your own server. (closed Kapow costs a lot, does more and runs inside your firewall).

Saw Google Gears demo of an application running online and off. That's some interesting stuff and it's still very early. Given the flakiness of the conference wifi, more people could have benefited from offline browsing.

Also excited because I just got my invite for the Google Mashup Editor so I can play around. So many things to try out, so little time. I'd love to integrate LignUp with one of the open source BI tools and build some fun voice-enable BI alerter demos.

Overall quality of demos and talks was high, with interesting things from both small startups and big companies like Yahoo, Google and Microsoft. Looking forward to tomorrow.

Labels: ,

List of Data Integration Webcast Links for This Year

Here are links to replays for all the webcasts I've done during the past year.

The Hybrid Data Warehouse: Extending the BI Environment with Data Federation
Webcast: 3/21/07

Challenges and Techniques When Hand-coding ETL
Webcast: 6/26/07

Extract-Transform-Load (ETL) Market Overview and Directions
Webcast: 6/13/07

Enterprise Information Integration Technologies: Common Uses and Abuses
Webcast: 2/14/07

Enterprise Integration Series: Evaluation Criteria for Selecting ETL Tools
Webcast: 8/2/06

Labels: , , , , ,


Data warehousing, business intelligence, IT strategy and architecture, and occasional interesting bits.

Subscribe to XML feed

Bio / About Me

Check out my book

Clickstream data warehousing book cover Buy clickstream data warehousing from

Search this site or  the web

Site search   Web search
powered by FreeFind
Popular Posts
Primate programming.
Why development in crunch mode doesn't work.
Enterprise data modeling sucks big rocks.
XP Exaggerated.
Ping-pong in the matrix.
Time management for anarchists.
Is Ab Initio worth evaluating?
Job posting: omniscient architect.
Why hiring more sales people won't grow revenues faster.
Some resources for Open Source CMS.

Reading List
The Cruise of the Snark
Blue Latitudes
Everyone in Silico
The Klamath Knot
Swarm Intelligence (Bonabeau)
A three year backlog of F&SF

Listening List
Toots and the Maytals
The Buena Vista Social Club
American Idiot

Watching List
Winged Migration Quicktime trailer
Ghengis Blues
Howl's Moving Castls
A Bronx Tale

Daily KOS
Due Diligence
Boing Boing
Kevin Kelly (Recomendo)
Not Geniuses
3 Quarks Daily

War in Context
Valmiki's Ramayana
Choose the Blue
Third Nature
Mark Madsen
The Data Warehouse Institute
James Howard Kunstler
Clickstream Data Warehousing
Technorati Profile

04/01/2003 - 05/01/2003 05/01/2003 - 06/01/2003 06/01/2003 - 07/01/2003 07/01/2003 - 08/01/2003 08/01/2003 - 09/01/2003 09/01/2003 - 10/01/2003 10/01/2003 - 11/01/2003 11/01/2003 - 12/01/2003 12/01/2003 - 01/01/2004 05/01/2004 - 06/01/2004 06/01/2004 - 07/01/2004 07/01/2004 - 08/01/2004 08/01/2004 - 09/01/2004 09/01/2004 - 10/01/2004 10/01/2004 - 11/01/2004 11/01/2004 - 12/01/2004 12/01/2004 - 01/01/2005 01/01/2005 - 02/01/2005 02/01/2005 - 03/01/2005 03/01/2005 - 04/01/2005 05/01/2005 - 06/01/2005 06/01/2005 - 07/01/2005 07/01/2005 - 08/01/2005 08/01/2005 - 09/01/2005 09/01/2005 - 10/01/2005 10/01/2005 - 11/01/2005 11/01/2005 - 12/01/2005 12/01/2005 - 01/01/2006 01/01/2006 - 02/01/2006 03/01/2006 - 04/01/2006 05/01/2006 - 06/01/2006 06/01/2006 - 07/01/2006 07/01/2006 - 08/01/2006 08/01/2006 - 09/01/2006 09/01/2006 - 10/01/2006 10/01/2006 - 11/01/2006 01/01/2007 - 02/01/2007 02/01/2007 - 03/01/2007 03/01/2007 - 04/01/2007 04/01/2007 - 05/01/2007 05/01/2007 - 06/01/2007 06/01/2007 - 07/01/2007 07/01/2007 - 08/01/2007 08/01/2007 - 09/01/2007 09/01/2007 - 10/01/2007 10/01/2007 - 11/01/2007 11/01/2007 - 12/01/2007 12/01/2007 - 01/01/2008 01/01/2008 - 02/01/2008 02/01/2008 - 03/01/2008 03/01/2008 - 04/01/2008 08/01/2008 - 09/01/2008 06/01/2009 - 07/01/2009 08/01/2009 - 09/01/2009 10/01/2009 - 11/01/2009 01/01/2010 - 02/01/2010 09/01/2011 - 10/01/2011 04/01/2013 - 05/01/2013

Powered by Blogger.

Creative Commons License
This work is licensed under this Creative Commons License except where indicated.