Archive for the “Search Engine Technology” Category


Cuil Home Page Screen Shot

It’s cool to be Cuil today. Cuil Inc. launched their new search alternative to Google today. Cuil pronounced Cool has received lot’s of press today and it helps when it’s in the right places. And if it we’re not for the fact that the principals have a history of producing value add to existing search products like Google search, then this roll out would be hardly noticed.

But the fact that they have a track record, worked at Google and are boasting that they have an index bigger than Google, is newsworthy. Cuil is led by Anna Patterson a former engineer at Google.  Along with her husband Tom Costello, a search expert in his right, Cuil aims to take on Google. No small feat.

But having a bigger index doesn’t mean you’re better. And only time will tell if they have what it takes to carve out a piece of the big search pie. They claim to be able to search across 120 billion web pages compared to an estimated 40 billion Google has. Google officially does not reveal how many pages it indexes but others sources suggest that they keep an index of around 60 billion pages. As well Google says that not all of the pages it crawls are indexed because many are duplicates. Working in this industry I can concur that there is a lot of duplicate content out there.

For Cuil to take some market share away from Google it will take more than the boasting of a bigger index. Reality is, with enough hardware and money a startup can build an index that is big, even huge as Cuil has. The test of whether Cuil can succeed will be if the public and business users find more relevant search results through Cuil. Being as big or fast as Google is not enough. You have to be able to change people’s search preferences. And that’s not easy.

What is noteworthy is that Cuil says they’ve developed a faster, better way to index pages and just as important use less hardware. Less hardware is important as the cost to index, store and serve up results can be prohibitive. The ongoing downward costs of hard drives, CPU’s etc. helps. However even though RAM prices have come down, the price of RAM still is one of the most expensive aspect of creating a searchable index.

In my initial tests of Cuil I was both pleased with the results and disappointed. Some common searches resulted in no results. I’ll attribute that to first day bugs. But I also found that sources like Wikipedia were heavily weighted, sometimes in favor of the actually site that I was looking for.

It’s public day 1 for Cuil and they have people’s attention. Let’s see if they can keep it and build some momentum. In the meantime I’ll give them a try and report back with my thoughts in the near future.

Tags: , ,

Comments No Comments »

Twitter LogoI have a secret, for the last couple months as a side project we’ve been crawling Twitter with the idea to create a small niche vertical search of tweets. But the more I come across cools applications like Twitterholic, Tweetstats, Twubble, Tweet Scan, twemes etc. the more I think we can do more with our data. So my question to anyone caring to answer is; If you had a rockin application you’d like to see built for Twitter, what would it be?

You never know, we might just build it.

(more…)

Tags: , , ,

Comments 3 Comments »

Yahoo Search Blog
In his latest entry on the Yahoo Search Blog, Vish Makhijani, discusses “Yahoo! Search An Open Approach to Search“. This post builds on last weeks announcement of the largest Hadoop production application and I love it. It’s innovative, especially for content producers. They, we finally get a say in the output of Yahoo’s search results like never before. Regardless if you’re a content producer or searcher you can sign up for more information here.

“Because the platform is open it gives all Web site owners — big or small — an opportunity to present more useful information on the Yahoo! Search page as compared to what is presented on other search engines. Site owners will be able to provide all types of additional information about their site directly to Yahoo! Search. So instead of a simple title, abstract and URL, for the first time users will see rich results that incorporate the massive amount of data buried in websites — ratings and reviews, images, deep links, and all kinds of other useful data — directly on the Yahoo! Search results page.”

(more…)

Tags: ,

Comments No Comments »

Hadoop
Some exciting news today from Eric Baldeschwieler, Senior Director, Grid Computing on the Yahoo Developer Network, Yahoo! Launches World’s Largest Hadoop Production Application. I’ll note that my company Hyperix is using Hadoop for our vertical search platform.

Here’s some of the stats:

Some Webmap size data:

* Number of links between pages in the index: roughly 1 trillion links
* Size of output: over 300 TB, compressed!
* Number of cores used to run a single Map-Reduce job: over 10,000
* Raw disk used in the production cluster: over 5 Petabytes

Tags: ,

Comments No Comments »

ReadWriteWeb has a good article on cloud computing today.

The first, Reaching for the Sky Through The Compute Clouds, is written the Amazon Web Services outage last Friday fresh in our minds. I’m a big proponent of cloud computing as it’s the only way in my opinion to truly scale large data driven applications such as search which is what I’m working on.

“So is it really true - is cloud computing a bad idea? Of course not. It is a wonderful, powerful idea. In this post, we explore the ideas behind cloud computing and argue that it will be an integral part of our future.”

“Do Clouds Really Work?

You bet! The best example is Google. The king of the web is reigning with a farm of hundreds of thousands, if not millions of boxes. To race along with the web, Google constantly increases the size of its cloud, incorporating new web sites, and expanding its index.

Of course, Google isn’t the only one operating in a cloud. All major web players including Amazon, eBay, Yahoo! and Facebook are running some sort of massive computing cloud.”

Tags:

Comments No Comments »