Hyperix LogoA lot has been written about cloud computing in the last year and each day seems to bring news of a new player in the cloud arena. So what does the cloud have to offer search engine companies like Hyperix? Well that depends on how deep our pockets are. After all, we need a lot of bandwidth, processing power and data storage to run any real search engine. And as we don’t have deep pockets, nor an angel or venture firm backing us we’ve had to be find creative solutions and innovate where possible.

Up to this point we’ve been focusing solely on the technology we’re using that will differentiate ourselves from any other vertical search platform entities out there. We’ve got our own small web crawling cluster setup which we’ve used for some time to test different web crawlers, collect and parse data and measure a variety web crawler values which determine how many CPU cycles, RAM, bandwidth, and storage is necessary to create the vertical search indexes we want. We’ve also been focusing on the quality of the data we’re crawling, the algorithm which ranks the pages crawled, the parsing engines, and the results pages.

Having determined a baseline for our costs to crawl on our own we’re now comparing that with web crawling using Amazon’s Elastic Compute Cloud (Amazon EC2). After we’ve compared the two we’ll decide which to use as we move forward with our production web crawls. We would prefer to use our own hardware but the cost can be prohibitive and ultimately you would think that at some point it would make financial sense to run the crawls on your own hardware, but until we actually test the crawl on Amazon EC2 we won’t know the true costs. And while we could just crunch numbers in Amazon’s calculator, anyone whose ever done crawling knows that there are many variables that determine how long a crawl will take, the RAM it will use and how many CPU’s and nodes are required to successfully achieve an efficient crawl.

Aside from web crawling there’s the search side of the equation. There are some search engines which use Amazon’s web services to not only crawl for data, but also to serve up their searches. We’ve determined that Amazon’s services as offered don’t offer us a cost effective solution for our search needs. This primarily has to do with our search indexes. When users will search our vertical search niches they’ll be querying our indexes which are held in memory. We’ve come up with an innovative solution that a) dramatically reduces our memory costs b) is faster than current index searching and c) is cheaper for us to run on our own hardware. The innovation which is theoretical at this point is going to be tested out for the first time later this year, but we are confident it will work.

Cloud computing for us at this time, using Amazon’s EC2, may be useful for web crawling but not for our search servers.

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Digg
  • Sphinn
  • del.icio.us
  • Facebook
  • Reddit
  • Slashdot
  • StumbleUpon
  • Technorati
  • TwitThis
  • Fark
  • Google
  • Live
  • YahooMyWeb
Tags: , ,
Leave a Reply