Why locust?

Gregory Kozlovsky

Every software project has a reason behind it. It may be an opportunity to provide a new functionality to users, to advance an academic career, or to make a move in the office politics. What are the reasons behind locust?

In the shadow of Google's mega success, a new search engine is unavoidably compared to it. And so where does locust stand compared to Google? Unlike Google, locust does not aspire to search the whole internet. Our goal is to provide software for vertical search applications that focus on a specific area of activity: a business niche such as travel to a particular country or a trade area, enterprise search or a knowledge area such as a branch of science search. In all these cases, search is restricted to a collection of selected sites. This difference in application leads to profound consequences, of which the smaller amount of documents to be searched is not the most important. As Thomas W Malone, the Director of the MIT Center for Collective Intelligence noticed:

[T]here's a tremendous amount of structure in the Web itself that people create every time they add a link to one of their pages. That collective intelligence is what Google leverages so effectively. Ironically, most internal webs - intranets - don't have that same kind of cross-linking. As a result, the same algorithms are radically less effective internally than externally.

Dr Malone's conclusion is, in most cases, also applicable to specialized institutional and corporate websites and, more broadly, to vertical search. Fortunately, vertical search has specific features that can be used to a great advantage.

Users are searching for documents, i.e. self-contained coherent collections of words (possibly illustrated with pictures) written to give users information on a specific subject. A web server serves pages. There is a big difference between the two. First, on a modern website, most pages are not documents, but what we call gateway pages. Gateway pages have no information value in themselves, they are designed to help users find real documents following hyperlinks. Second, one document can be presented as many different pages, formatted for reading, printing, viewing of a mobile computer, etc. And, finally, even on a page that displays a document, the document itself usually occupies only the central part, what we normally call content. The margins of the page are used for the header, the footer and navigation columns.

As a result of the above situation, the searchable index created by a search engine spidering a website, even by Google, is bloated with meaningless gateway pages and polluted with words used on document pages for navigation and other purposes. In certain quite frequent situations this leads to a great deterioration in search efficiency. For example, when we use a specific pair of keywords in a query, that in a clean index should have found necessary documents, it may happen, that the keywords belong separately to titles of unrelated documents on gateway pages and we will receive a torrential amount of gateway pages with high link popularity as a result of the search. Another frequent situation is when a query keyword is part of a navigation panel that usually is the same on all pages of a site. In this case, search results will contain all the pages of the site making it impossible to find few pages where this keyword is in the document itself.

A generic internet search engine cannot efficiently deal with this situation. There is no way to differentiate algorithmically between gateway pages and documents or to separate navigation from content on pages. It is however very easy to do in a cooperative environment of area search. With help of the site administrator, a search engine can be configured to differentiate between the URLs of gateway pages and that of documents. Structured comments can be used to separate page content from navigation code and other adornments. What is needed is a search engine that possesses an appropriate set of configuration options.

locust is designed to do exactly this. It can be easily configured to skip indexing pages but to follow the links or vice versa. It can include or exclude specific folders, URLs matching a regular expression or single documents. It can skip page navigation and adornments using structural comments. There is also a rich set of configuration options for indexing external pages hyperlinked from an indexed site.

Another area where capabilities of existing search engines can be improved is taxonomy-assisted search. Taxonomy-assisted search as practiced now by Yahoo or Open Directory has whole web sites as the smallest unit. This is a substantial limitation even for an internet-wide search engine because sites often combine several subjects. For vertical or area search this limitation makes taxonomy-assisted search useless.

locust can group documents by docsets. A docset (document set) contains a subset of documents from one or several sites. To be efficient, this capability requires deep changes in the structure of reverse index files. Docsets then can be used as leaves of a multi-hierarchical tree of categories.