Helpful Resources for Scaling Web Sites and Applications
There's a good collection of presentations about high-end scaling of web applications at Peter Van Dijck's blog. Most are hosted on slideshare, but there are some mp3s and PDFs as well. It's hard to find quality information on scaling applications of any sort. I used to deal with scalability a lot in the early days of the web. Technologies have changed quite a bit since then so I focus on scalability mostly in the data warehouse and business intelligence world now. I always pay attention when someone goes through the trouble of collecting a set of good quality material like this. I've only seen two of the presentations he's got listed, and never read the O'Reilly book he mentioned, which means more summer reading in the works.
Posted by Mark Thursday, May 31, 2007 1:28:00 PM |
Real Estate Data Visualization
Trulia, a real estate search service, has an interactive map of US homes animated based on when the homes were built. It's fun to play with. The search service (the first link) has a well-designed navigation and display interface, better than others I've used. Worth looking at to see how information displays can be made interactive.
Labels: real estate, search, visualization
Posted by Mark Wednesday, May 30, 2007 4:22:00 PM |
LOLTrek: This Meme Has Jumped the Shark
If you don't know what this means, you probably need to stay in more. In this case, go find the "I can has cheezburger?" image, then visit a tribute to tribbles, LOLTrek. If don't know what "jump the shark" means then you're definitely having too much life and not enough computer and need to spend some time on Wikipedia.
Even funnier is Google correcting "I can has cheezburger" to "I can has cheezeborger" when I tried to track down one of the original images:
To finish off my news reading for today, "I'm in ur cellz killin ur mitochondrias" (Fanta screws with your mitochondria?) Good thing I don't drink those fizzy drinks.
Labels: fanta, lolcats, loltrek
Posted by Mark Tuesday, May 29, 2007 9:31:00 PM |
Story About the First Computer
The New Yorker has a story about the Antikythera Mechanism, possibly the first mechanical computer, built by hand in ancient Greece. I liked the conclusion best, that the device embodied the world view of an era, in much the way that our current world view is embodied in our own technology.
"One day in the spring of 1900, a party of Greek sponge divers returning from North Africa was forced by a storm to take shelter in the lee of the small island of Antikythera, which lies between Crete and Kythera. After the storm passed, one of the divers, Elias Stadiatis, put on a weighted suit and an airtight helmet that was connected by an air hose to a compressor on the boat, and went looking for giant clams, with which to make a feast that evening."
continue reading story
Posted by Mark 9:52:00 AM |
US Terrorism Data: Worst Practices for Data Quality and Governance
The Department of Homeland Security is a perfect counter-example of how to deal with data. A way to learn good practices is to do the opposite of what the government does. Here's a list of worst practices to not follow from the DHS.
Worst practice #1: load all the data you can lay your hands on. Terror Database Has Quadrupled In Four Years discusses the "Terrorist Identities Datamart Environment" or TIDE, the source of data for airline, police, border and consulate watch lists. Their policy is to shovel everything possible in there, needed or not, because it might be useful some day. This runs counter to years of data warehouse practices.
"Each day, waves of new information are fed into terrorism-suspect databases... Ballooning from fewer than 100,000 files in 2003 to about 435,000, the growing database threatens to overwhelm the people who manage it."Aside from this problem, do we really believe there are 435,000 suspected terrorists in the US? That's almost four times larger than our military in Iraq. What are they waiting for? They could take over the US now. Obviously something is wrong with all that data.
Worst practice #2: don't do anything about data quality. "The single biggest worry that I have is long-term quality control," said Russ Travers, in charge of TIDE at the National Counterterrorism Center in McLean." Yet he's doing nothing to QA the data before dumping it into the system. To make things worse, there's no way to remove or correct data once it's loaded.
"The bar for inclusion is low, and once someone is on the list, it is virtually impossible to get off it."TSA can't tell the difference between a 70's musician and a senator's wife? That's a serious data quality problem. Given the 50% miss rate, it would be better for the police to toss a coin each time they arrest someone instead of consulting the watch lists.
"In 2004 and 2005, mis-identifications accounted for about half of the tens of thousands of times a traveler's name triggered a watch-list hit."
"Sen. Ted Stevens (R-Alaska) said last year that his wife had been delayed repeatedly while airlines queried whether Catherine Stevens was the watch-listed Cat Stevens. The listing referred to the Britain-based pop singer who converted to Islam and changed his name to Yusuf Islam. The reason Islam is not allowed to fly to the United States is secret."
The data management practices go one step further. They actually have a process to include names on the list that are no longer valid. Their idea is that if someone is dead (for example), then a terrorist might use that name since it won't be on the list. Perhaps all the dead people account for the growth of the database. Why not go one step further and put the names of everyone who died this year onto the list?
Worst practice #3: ignore the problems of your users.
"TIDE is a vacuum cleaner for both proven and unproven information, and its managers disclaim responsibility for how other agencies use the data. "What's the alternative?" Travers said."Multiple agencies are complaining about wasted man-hours due to mis-identification. Airlines are routinely stopping people in airports, like Ted Steven's wife, even though the data is obviously wrong. God forbid that we should have any data management processes in place. Travers should probably get out to a data warehousing conference once in a while. Particularly since he said earlier that data quality is a problem. His current alternative of doing nothing isn't feasible.
But the problems don't stop with TIDE...
Worst practice #4: if the users aren't sold on the concept, build it and they will come.
Remember TIA ? Congress killed the program, so the people involved did it for the government of Singapore instead. Son of TIA: Pentagon Surveillance System Is Reborn in Asia tells how Snowden and company built a system for tracking people in a totalitarian state. Now they they want to sell the system they developed back to the US government. It's like doing an IT-driven data warehouse project, only with Orwellian overtones.
Worst practice #5: ignore the users. DHS has a monumental mess on their hands. Different government departments need different data, just like the finance department has different needs than the marketing department. Yet the DHS systems are being centrally mandated by (mostly) intelligence people with no idea of how other groups like the police or the border patrol need information. Combine this with poorly managed data integration and no governance and you have data being copied and misapplied all over the place.
A number of the articles I linked to came up via Bruce Schneier's crypto-gram mailing list. Always interesting reading, even if you aren't a security professional. Choice items from this month's issue:
"...tips on preventing terrorism" indeed. (Tip #7: When transporting nuclear wastes, always be sure to padlock your truck.)YOU are big brother: Control and track your car from the 'net Or, as Schneier says, have someone hack into their web site and control it for you.
AMEX is Watching You AMEX has a patent application titled "Method and System for Facilitating a Shopping Experience," that:
"describes a Minority Report style blueprint for monitoring consumers through RFID-enabled objects, like the American Express Blue Card.Breaches of personal data: blaming the myth and punishing the victim Can we stop blaming hackers for theft of information already?
According to the patent, RFID readers called "consumer trackers" would be placed in store shelving to pick up "consumer identification signals" emitted by RFID-embedded objects carried by shoppers. These would be used to identify people, track their movements, and observe their behavior."
The report states that "60 percent of the incidents involve missing or stolen hardware, insider abuse or theft, administrative error, or accidentally exposing data online."Windows Vista code-signing to keep out evil spyware? I don't think so: VBootkit bypasses Windows Vista's code-signing mechanisms Microsoft spent a lot trying to secure Vista. So far, no dice.
Given that its data suggests that a significant portion of the blame should go to those who hold the data, the report argues forcefully for legislation that requires they meet minimum data safety standards.
Local Sheriff Suspects Al-Qaeda Or Teens
"This activity matches up with the M.O. of a terrorist casing a potential target," Steinhorst said. "It also matches the M.O. of a group of teens drinking beer and fooling around."I don't read The Onion enough.
Labels: data quality, data warehouse, DHS, TSA
Posted by Mark Saturday, May 26, 2007 1:52:00 PM |
Feel Free to Carry Explosives on Planes, You Won't Get Caught
I just got done with 6 weeks of continuous travel and I wasn't surprised to read that airport security missed 90% of improvised and hidden explosives during security tests, proving our system doesn't work. Yet we still have to take off our shoes, fork over our mouthwash and listen to "the threat alert has been raised to orange - report your neighbors". My favorite quote is this from a former TSA inspector:
"There's very little substance to security," said former Red Team leader Bogdan Dzakovic. "It literally is all window dressing that we're doing. It's big theater on TV and when you go to the airport. It's just security theater.Sure would be nice the idiots running homeland "security" would stop trying to keep their evil overlords in power by pretending we're under imminent attack and instead focus on doing their jobs. I know that's a lot to ask of this administration.
Dzakovic, who testified that the FAA ordered the Red Team to "not write up our findings," said the TSA is also trying to hide its results.
"The last thing TSA wants to do is look bad in front of congress and in front of the public, so rather than fix the problem, they'd rather just keep them quiet," said Dzakovic.
Labels: DHS, security, TSA
Posted by Mark 12:13:00 PM |
Web Integration Talk in Las Vegas
I'm off to Las Vegas for the Portals, Collaboration and Content Conference to give a talk on data integration for the web. Title of the talk is "Web Data Integration: Methods to Extract and Deliver Data for Portals and Web Applications." I'll be doing a run-through of integration architecture and technology choices to get data from where it is to where you want it. I had to cut back on the part that I find most interesting, getting data off web pages. Scraping, scrAPIs, RSS and the rest of the fun web stuff gets about 10-15 minutes towards the end. I'll probably repurpose the unused information into posts. The slides will make their way online some time in the next couple weeks. I can say one thing - they're pretty relative to some of the other presentations.
Labels: BI, data integration, events, portals
Posted by Mark Monday, May 21, 2007 9:03:00 AM |
Microsoft Patent FUD Appears to be Exactly That
Since I just finished running a day on open source at TDWI, I thought it would be worthwhile to comment on this. It's always hard to tell what's going on when they toss out FUD, but the authors of the study about potential Linux infringements say that Microsoft is misrepresenting their conclusions:
"The point of the study was actually to eliminate the FUD about Linux's alleged legal problems by attaching a quantifiable measure versus the speculation," he said. "And the number we found, to anyone familiar with this issue, is so average as to be boring; almost any piece of software potentially infringes at least that many patents."This looks like another case of bogus interpretation similar to the Linux TCO study, where they concluded that Linux was ten times more expensive to run than Windows. It's true, when you look at what they compared: an Intel PC running Windows vs. an IBM mainframe running Linux.
Labels: microsoft, open source, patents
Posted by Mark Saturday, May 19, 2007 1:03:00 PM |
Open Source BI Case Study Slides Are Posted
We had a good open source session at the May TDWI conference. After I spent some time reviewing history, projects and adoption practices we got to the good part: short case studies, demos and a panel session with representatives from BIRT, JasperSoft, Pentaho and SpagoBI. I particularly enjoyed the panel session where we had some great insights from the panel on what's happening in this space and some of the problems people are facing. I also liked Cindi Howson (the most knowledgeable person on the subject of BI products I know) asking some basic but pointed questions that were on the minds of the mostly non-OSS audience.
Special thanks to Paul Clenahan and Jason Weathersby from Actuate/BIRT, Nick Halsey and Beth Mazur of JasperSoft, Lance Walter and Nicholas Goodman of Pentaho, and Grazia Cazzin and Daniela Tura representing SpagoBI. They did a lot of work and came at their own expense to represent their projects at this conference.
All of their overview and case study slides are available at Third Nature.
Labels: BI, business intelligence, open source
Posted by Mark Friday, May 18, 2007 8:06:00 PM |