Treatment of clones in locust

Gregory Kozlovsky

Clones are documents with identical body (body includes the HTML header, but not the HTTP header) that have distinct URLs. Clones are very common on modern web sites. When undetected, they will all be shown in the search results to great annoyance of the user.

In locust, clones are detected using CRC checksum. Among clones of the same document, one is designated to be the original, the rest are called clones. The original document has the attribute origin set to 1, the clones have this attribute set to 0. For the original document, its content and content attributes are stored in the corresponding table urlwordsXX, its keywords, links, and modification times are written into the spider log (delta files). For the clones only an entry in the table urlword is created.

During the first indexing, a document with unique CRC is designated to be the original. Documents with the same CRC found later are condidered clones of the first document. Please note that because during first indexing the later a document is found, the greater is its handle. Therefore, the original will always have smaller handle than its clones.

Things get more complicated during reindexing and orphan eleimination. When a clone has to be deleted, only its entry in the table urlword has to be deleted. When an original is deleted the process has several steps. First, we find the list of clones. Second, we designate the clone with the smallest handle to be the new original. Third, we copy the old original content into the new original's entry in the the corresponding table urlwordsXX. Forth, we add the old original handle to the deleted documents bitmap and write the new original keywords, links, and modification times into the spider log. Finally, we delete the old original's entries from the table urlword and the corresponding table urlwordsXX.

We detect clones only among the documents from the same site or docset in the docset mode. To do otherwise would cause erroneous results during site-restricted or category-restricted searches.