Since we’re getting closer to the public unveiling of the first vertical search engine produced from the Hyperix search platform I think it’s time to release a couple of tidbits about the platform. Hints have been out there for a while and some people within the search community are already aware of this, but one of the primary components of Hyperix is the use of Hadoop.

Here’s the quick intro of what Hadoop does from the open source site:

“Hadoop is a software platform that lets one easily write and run applications that process vast amounts of data.

Here’s what makes Hadoop especially useful:

* Scalable: Hadoop can reliably store and process petabytes.
* Economical: It distributes the data and processing across clusters of commonly available computers. These clusters can number into the thousands of nodes.
* Efficient: By distributing the data, Hadoop can process it in parallel on the nodes where the data is located. This makes it extremely rapid.
* Reliable: Hadoop automatically maintains multiple copies of data and automatically redeploys computing tasks based on failures.

Hadoop implements MapReduce, using the Hadoop Distributed File SystemHDFS) (see figure below.) MapReduce divides applications into many small blocks of work. HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster. MapReduce can then process the data where it is located.

Hadoop has been demonstrated on clusters with 2000 nodes. The current design target is 10,000 node clusters.”

And recently Business Week wrote a piece about Hadoop, “The Two Favors of Google

“A battle could be shaping up between the two leading software platforms for cloud computing, one proprietary and the other open-source.”

“This means that the two leading software platforms for cloud computing could end up being two flavors of Google, one proprietary and the other—Hadoop—open source. And their battle for dominance could occur even within Google’s own clouds. Here’s why: MapReduce is so effective because it works exclusively inside Google, and it handles a limited menu of chores. Its versatility is a question. If Hadoop attracts a large community of developers, it could develop into a more versatile tool, handling a wide variety of work, from scientific data-crunching to consumer marketing analytics. And as it becomes a standard in university labs, young computer scientists will emerge into the job market with Hadoop skills.”

Since day 1 we’ve been using Hadoop. It provides us a platform to scale our data processing. But Hadoop is just one part of the puzzle. The intellectual property we’re developing lies primarily within the rest of the Hyperix platform. Some of which I’ll discuss in future postings.

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Digg
  • Sphinn
  • del.icio.us
  • Facebook
  • Reddit
  • Slashdot
  • StumbleUpon
  • Technorati
  • TwitThis
  • Fark
  • Google
  • Live
  • YahooMyWeb
Tags:
Leave a Reply