Readers Berating CIO Magazine for Layout Equivalent of Spamming
I had more fun reading the comments on this CIO Magazine article than the article itself. In fact, I didn't finish the article because it was the thing that pissed all the readers off.
Lesson to learn: sure, you may be trying to maximize ad revenue, but dropping the ad to content ratio to 1 ad per 100 words is taking it too far. They have 19 (!) pages of article, one paragraph per page. The side effects are damn funny, though. The comment paragrpahs count runs 5:1 over the article paragraph count.
Posted by Mark Thursday, July 26, 2007 3:14:00 PM |
RSS Sucks For Mashups, But It's Ubiquitous
Another thing we learned while banging away at web sites to get data was that RSS stinks for slinging web data, yet the tools available don't deal well with Atom feeds. Atom is great for this purpose, but hard to locate tools for (other than code libraries). Being both ignorant and lazy, I'd prefer to integrate tools rather than write 3x that much code in PHP and Python.
Our first attempt to grab data from Salesforce.com generated an Atom feed. We did that because we wanted to get a bunch of data and feed different elements to other extractors to collect data. Sadly, none of the tools we were using could consume Atom. We had a choice of programming a custom widget for parsing Atom to get the data we wanted or switching to RSS. We did the latter because it was expedient given the time constraints.
The problem with using RSS for data is that it only has a few fields to store data in. We ended up using the title, description, link, and even the date fields. This meant having to take 8 fields of extracted data and publish two separate RSS feeds, or stuffing multiple values into a single field like description. Luckily, we didn't need some fields broken out so stuffed things into fewer fields. Still, I had something like nine RSS feeds going at one point.
The other side was getting feeds. Lots of sites generate RSS feeds, but no Atom. So we're stuck with crappy feeds or we do it ourselves, but then have to DIY the consuming side as well. Let's hope Atom usage picks up in the future.
Labels: Atom, integration, mashup, mashup camp, mashupcamp4, RSS, web2.0
Posted by Mark Wednesday, July 25, 2007 12:09:00 AM |
Web Data Sourcing Tools We Used for the Mashup Contest
We just finished recording a podcast for IBM DeveloperWorks which will be up in the next few days so I was looking to see what else happened at Mashup Camp while we were writing code. We got a mention at Programmable Web (first place I go to see what's new in the mashup space). I wish we had been able to spend more time in sessions, but being heads-down in new tools was still worthwhile.
Apart from the possibility of winning prizes, this is the best learning environment I've found for this space. Unless you spend a lot of time reading through blogs, you aren't going to find many resources. Besides, nothing beats learning from other people while doing.
Here's the rundown of tools Renat and I worked with:
QEDwiki (not an official product or downloadable, yet)
and lots of Google sources
One thing that hasn't been mentioned in most of the news is that all these companies had people at Camp. To be honest, if it weren't for Dan Gisolfi and Meg Sorber from IBM, we never would have finished work by the time the event closed. They stayed up late to help us with problems, bugs and techniques using QEDwiki.
I used Dapper a lot to scrape pages and make RSS feeds. I ran into problems with Dapper and the bad HTML practices of some web sites. Fortunately, Eran Shir and Jon Aizen (the CEO and CTO of Dapper) were there to help out.
Out of all the things I've worked with, theirs is the most impressive because of its simplicity. Unlike Pipes, which manipulates RSS feeds, Dapper scrapes pages and turns them into feeds in many different formats. Dapper + Pipes is a great combination. Kapow is an industrial strength scraper, so Dapper is not as powerful as Kapow's tools, but it's a lot easier to use for something quick and dirty.
We used Apatar because it was the only way short of directly coding to APIs to get data from Salesforce.com. And it's open source. And Renat knows how to use it. It's not a page scraper, it's a data integration tool, so it does things these other tools can't. Overall, the combination of a DI tool, a scraper, and a manipulation and delivery formatting tool are what you need to get data for mashups if you're doing it inside an IT shop.
QEDwiki is the assembly hub, so it doesn't provide data sourcing or manipulation features. IBM is going to include another tool in the kit for that. They did a demo of this during Mashup U, but didn't have it available for us.
Labels: apatar, dapper, ibm, kapow, mashup, mashup camp, pipes, qedwiki
Posted by Mark Tuesday, July 24, 2007 9:02:00 AM |
Watch a Walkthrough of Our Mashup Camp Contest Winner
Here's a link to coverage of the mashup contest winner (me and Renat Khasanshyn - not often I get to see "me" and "winner" in the same sentence) over at ZDnet: Mashup culture shatters crusty, stodgy old approach to business app dev
They did a nice job with the video of David Berlind interviewing me showing a walkthrough of the mashup and a little about what we had to do to build it. I'd comment on how intelligent that person doing the demo looks, but I'm too modest.
One thing that didn't get mentioned is that Apatar - which Renat used to pull a data feed from Salesforce.com - is open source. It's free as "freedom" and free as in "beer". Hard to beat that.
Sadly, no walkthrough of my answer to the Starbucks finders on Programmable Web, the Starbucks Anti-Finder (on the right). Given how many stores there are, I think the real challenge is showing a map of coffeehouses that aren't Starbucks, so I created a Starbucks locater that can't find Starbucks, even if you were to feed the lat/long position to the map. You're welcome, Brian - now you don't need to see corporate coffeehouses. Version 2 will exclude all public corporations from the list, making cafe hunting even easier for the anti-corporate crowd.
And no, there is no such thing as Jose Cuervo cereal.
Labels: apatar, data integration, mashup, mashup camp, mashupcamp4, open source
Posted by Mark Monday, July 23, 2007 11:08:00 PM |
Getting Data for Mashups
Getting data is still the hardest part, even with great tools like Dapper and Kapow. They aren't perfect, but neither are the web sites they go after (see previous post about Yahoo!). One thing we learned building the main mashup for the competition was that it's helpful to have an ETL tool around. We used page scrapers and RSS mungers and the like, but the only way to get data from Salesforce.com is to write to their API.
I love open source. Amazing how quickly OSS data integration has been coming to market. Aside from Apatar there's Talend, Kettle (Pentaho Data Integration) and SnapLogic (though they aren't technically an ETL tool). I saw a mention of JitterBit recently but know nothing about them yet.
Labels: apatar, etl, integration, mashup, mashup camp, mashupcamp4, web2.0
Posted by Mark Saturday, July 21, 2007 10:31:00 PM |
Unexpected Lessons From the Mashup Camp 4 Contest
Mashup Camp is done and we came away with 4 of the prizes from the IBM (and Dapper, Kapow, StrikeIron, Accuweather) mashup building competition. Not a clean sweep but we did our best.
While sourcing information for the mashups, one thing really surprised me. Yahoo! generates really poor HTML that's hard to parse. We had identified data for which there were no feeds in several areas (like Yahoo! finance) and built scrapers. We couldn't get any of them to work (I tried from five different sets of pages). We ended up going to several other sites like Google finance to get our data. Googles sites were simple to scrape, with no strange things going on.
Lesson learned: if you want to use Yahoo! go via one of their APIs or RSS feeds, otherwise find another source because you'll be pulling your hair out.
A related lesson is just how web-unfriendly Microsoft can be. I hadn't realized the difficulty of going after ASP-generated pages. If you use url-driven tools you are simply SOL. If you use a tool that can do directed spidering it's still not easy but works.
I'll be ready for the next contest. IBM gave two months of advanced notice, but neither we nor the #2/3 winner took advantage of the time. We didn't start until the day before the deadline, when the wine and beer provided by IBM encouraged us to help each other learn the tools to do the work.
Labels: mashup, mashup camp, mashupcamp4, web2.0
Posted by Mark 9:09:00 PM |
Mashup Camp 4, Day 1
I'm at Mashup Camp 4 this week. Tired. Who starts a conference with breakfast at 7:30 AM and runs it until 6:00 PM? Based on today, someone who's wrangled lots of smart people to talk about leading edge technology.
Bottom line: lots of sexy UI stuff like Zude going on, less focus on the data integration, but more than I expected. Enjoyed the SnapLogic presentation, not so much because they are innovative (they are, but there are a half-dozen other entries in the general web integration market) but because they have a freely downloadable server-based open source offering. This circumvents one of the complaints I have about Yahoo Pipes, Dapper and OpenKapow: you can run your own server, where the others are services that host for you with no option of running your own server. (closed Kapow costs a lot, does more and runs inside your firewall).
Saw Google Gears demo of an application running online and off. That's some interesting stuff and it's still very early. Given the flakiness of the conference wifi, more people could have benefited from offline browsing.
Also excited because I just got my invite for the Google Mashup Editor so I can play around. So many things to try out, so little time. I'd love to integrate LignUp with one of the open source BI tools and build some fun voice-enable BI alerter demos.
Overall quality of demos and talks was high, with interesting things from both small startups and big companies like Yahoo, Google and Microsoft. Looking forward to tomorrow.
Labels: events, mashup camp
Posted by Mark Tuesday, July 17, 2007 12:21:00 AM |
List of Data Integration Webcast Links for This Year
Here are links to replays for all the webcasts I've done during the past year.
The Hybrid Data Warehouse: Extending the BI Environment with Data Federation
Challenges and Techniques When Hand-coding ETL
Extract-Transform-Load (ETL) Market Overview and Directions
Enterprise Information Integration Technologies: Common Uses and Abuses
Enterprise Integration Series: Evaluation Criteria for Selecting ETL Tools
Labels: data federation, data integration, EII, etl, events, webcasts
Posted by Mark Monday, July 09, 2007 11:58:00 AM |