Archive for the “Search Engine Technology” Category



On Friday at the Open Source conference, Jimmy Wales, founder of Wikipedia and Wikia, the open source search engine project, announced the release of an open-source Web crawling site called Grub. Grub crawls the web indexing pages for the Wikia search engine.

It’s a clever idea building upon other distributed projects such as SETI @ Home. Crawling the web is costly so if you have thousands of clients doing it for you that will save you money and could make crawling cost effective. However I have to wonder what percentage of an actual crawl will be performed by Grub distributed clients. Also if my computer is contributing to this project, which although is open source, is still a for profit venture, shouldn’t I profit from it as well?

Grub aims to compete with Google. If they can get enough computing power behind them they might be able to get an index as large as Google’s and maybe even bigger but the key to getting and keeping market share will be the results returned. And that is all in the algorithms of the search engine.

A distributed web crawling client for Project Phoenix is something to consider and providing people a portion of the revenue stream could make it attractive to users.

Comments No Comments »

When dealing with vast amount of data you need a scalable distributed storage system. All of my database driven web sites use MySQL and for smaller databases you can mount a MySQL search to the web site. But you soon find out that to deliver fast searching capabilities to a site, or if as in my case you intend to offer a search service of hundreds of millions of crawled niche data you need a scalable distributed storage system.

Recently Google hosted a Conference on Scalability in Seattle where they talked about MapReduce, BigTable, and other distributed systems for large datasets. Listed here are the talks which are now available on Google video:

(Kudo’s to Greg Linden for compiling the list of videos.)

The video’s provide some technical detail while Marissa Mayer’s provides some insight into Google’s big picture plans.

Google’s technology however is closed so if you’re interested in a solution that you can use then turning to open source projects is the way to go. And this is where Hadoop with HBase come in.

(more…)

Comments No Comments »

The answer of course is search, although most people who will read this will have never heard of Nitin Karandikar.

Search is in my blood these days, more specifically vertical search and project Phoenix which I’m currently working on. Every where I turn these days people are talking about search. Today I’ve come across a trio of interesting blog postings related to search and offer them as worth reading.

In the first Tim O’Reilly talks about “What Would Google Do?” and it’s more about innovation, Web 2.0 and less specifically about search but the core of the articles comes from data gathered from Google services including search. Web 2.0 is more than just a buzz word, it’s a fundamental shift in how we interact online. Quoting Tim;

… it goes right to the heart of what makes Web 2.0 applications so interesting: they are alive, or as close to it as you can get with a computer. They learn from and interact directly with their users (and more specifically, provide services to individual users that benefit from the aggregate interaction of the system with all of its users.)

(more…)

Comments No Comments »

Greg Linden formerly of Amazon and founder of Findory posted an interesting blog post over the weekend about Google looking like it will reject federated search and instead use a local copy of all the data they collect on their own cluster. For those who don’t know federated search, is when a query is sent to many search engines and the results aggregated and reranked and in more general terms is called metasearch.

(more…)

Comments No Comments »

The afternoon sessions started with the awarding of the first Everett Brenner Award award for the Best Contribution to Knowledge at the 2006 Search Engine Meeting. The winner was Stavros Macrakis formerly of Lycos and now with FAST and who ironically is scheduled to be the last speaker this afternoon and of the conference.

The sessions this afternoon dealt with web and intelligent tools. The first speaker was Paul Thompson of Dartmouth College. His talk “Search and Misinformation in Intelligence and Security Informatics” was quite interesting in that it deals with a relative little researched area. He said a what was needed was a new science along the lines of bioinformatics. Fraud was increasing and he cited one prominent journal which reported that at least 20% of accepted manuscripts, let alone those not accepted, contained at least one occurrence of fraud. He went into some detail of his research done over the last few years. His paper will be online Friday at the conference web site for those interested in this subject.

(more…)

Comments No Comments »