Open Source Search: Zvents Marries Heritrix with Hadoop
The rise of significant open-source search technology projects is one of the key reasons that startups like Zvents can compete with the major portals. Doug Cutting began this trend with the release of Lucene, and the public release of the Google MapReduce paper has now led to Hadoop, an open source implementation of Google’s GFS and Map-Reduce system lead by Doug and the folks over at Yahoo.
The engineering team at the Internet Archive has built a web crawler as an open-source project called Heritrix. Heritrix provides a full set of features for running an Internet crawl. Current startup state of the art for high-volume web crawling is to combine Heritrix with Hadoop… the only problem being that the two aren’t really on speaking terms.
What are you going to do with all the Heritrix-crawled data once you’ve pulled it down? To do large-scale data mining on the crawled content, you must go through the inelegant and non-scalable process of writing to local disk and then copying into HDFS (Hadoop Distributed FileSystem). The Zvents engineering staff has developed an extension to Heritrix that allows it to crawl directly into HDFS, speeding up the process and makes it much more reliable. A source code and binary distribution of this extension can be found at http://www.zvents.com/labs
We’re pleased to contribute this component to the community, and look forward to giving back more useful pieces of our effort to build the best local events search engine.
If you enjoyed this post, make sure you subscribe to my RSS feed!



