Smart Reindexing - Technical side

Gregory Kozlovsky

This document complements the White Paper "Smart Reindexing", adding technical details.

General

The following cases require special treatment during reindexing.

Lost documents

Lost documents are documents that had the status 200 during one of previous reindexings and now have an error status, for example 404. These documents content is not immediately deleted because the error may be temporary. Their status is marked with lost count upper bits (lost count is the number of reindexings when the document was found being lost) and these documents will still appear in search. Lost documents cannot be deleted, because links to them from other documents still exist. The command "deltamerge -D" must therefore to remove content of the relevant documents and to remove lost count marking, so that lost 404 document becomes a normal 404 document.

Problem: Some documents can be lost and after become orphans. Therefore deleting them as orphans must take the some of lost and orphan periods.

Orphans

Orphaned documents are those that were found and indexed during one of previous reindexings, but are no longer accessible through links or are no longer in the document set because of a change in the configuration. Reindexing always follows links and, therefore, orphans are not reindexed and consequently their last indexing time is not updated. Orphans are recognized by comparing stored reindexing start times with document last indexing times.

Modified documents

Problem with a modified document is that it may have been the original document with clones and after modification it may become a clone of another document or even the original among already existing clones.

Link handles stored in the database

In case sets.cnf and/or spider.cnf configuration files or robots.txt file were changed between reindexings, document link handles stored in the database may become obsolete if the document was not modified. This may happen because set of linked documents is restricted by the configuration and therefore may change with the configuration change. For this reason, during every reindexing even unchanged documents must be parsed for links and link urls processed. Storing link handles in the database seems to serve no purpose and should be removed.

Possible changes in indexing mode

It is possible that due to changes in the sets.cnf configuration file, for some documents the indexing mode changes while the document itself remains unchanged. This may requre reparsing the document and writing keywords and links into the spider log (delta files).

Database structure

Database consists of the following main parts.

Table urlword contains document internet spidering-related attributes: document internet location (url, site), extended HTTP status and its spidering attributes (last_index_time, next_index_time, crc, last_modified, etag, hops, redir, origin).

Table urlwordsXX contains document content-related attributes (wordcount, totalcount, content_type, charset, title, txt, docsize, keywords, description, lang, words, hrefs). Document entry is created only for original documents (not clones) with HTTP status 200 and also for documents with HTTP status 406 if they contain links belonging to a docset.

Spider log where keyword information, links, redirections and modification times are written. Spider log is written only for documents with HTTP codes 200, 301, 302 and 406.

Database modification actions

registerUrl - Create initial record for a new URL. Record internet spidering-related attributes in the table urlword.

updateInetAttrib - Update internet spidering-related attributes of the document leaving the content unchanged. Is used for updating unchanged documents.

modifyLost - Modify spidering-related attributes of a document that previously had 200 status and the last downloading and processing attempt returned an error status code. New status is marked as belonging to a "lost" document (markLost function) and the last modification time is left unchanged. The document content and reverse index data are left unchanged because the error code may have been a result of temporary failure. Such documents can be later removed by the orphan-removing tool.

updateDocument - Completely replace old document with the new one. If the old document had the status 200 or 406 with links and the new one is not, delete the corresponding entry in the table urlwordsXX.

markDeleted - Mark old document entries in the reverse index to be deleted during merging.

Actions depending on current and previous document status

When a URL is first discovered (either a starting page specified in the sets.cnf configuration file or a link from a previously downloaded page) an initial record is created in the table urlword, handle obtained and placed in the server queue. The initial record is created before download because typically a URL is linked from many pages and we need to prevent multiple copies of the same URL from appearing in the server queue.

During first indexing after every download request the spider simply records the document status and its attributes into the table urlword. For documents with OK (200) status and documents with Redirect (301 and 302) status an entry in the correspondent table urlwordsXX is also created. This entry contains compressed document body (content), document fields, links and document content attributes. The document's keywords, links and modification times are written into the spider journal. The links discovered in the downloaded document body (200) or in the HTTP header Location field (301 and 302) are put into the server queue only if they have not been found before.

During reindexing (respidering) things get much more complicated. We consider the following cases, depending on the new extended HTTP status of the document and the status during the previous spidering.

  1. New URL that is not in the old index. Processed as during the first indexing.
  2. Status 200 or 304. The document did not change since the previous spidering (code 304 or CRC did not change). We get its links from the database, put those of them that have not been respidered yet and not in the serser queue into the server queue. Finally, we update last index time and next index time of the document.
  3. Status 200. The document did change. Independent of its previous status, we process it as new, replace its content and attributes in the tables urlword and urlwordsXX, write the document's keywords, links and modification times into the spider journal, and process the links as before. In addition, we place the document's handle in the delete set (stored in the file markdel), so that old document keywords, links and modification times will be deleted in the reverse index during merging.
  4. Statuses 301, 302 or 406. Independent of its previous status, we replace the old document and place the document's handle in the delete set.
  5. HTPP status that is not (200, 301, 302 or 406). If the previous status is 200, we do not modify the document, but put its new status into the upper 2 bytes of its status variable.
  6. Extended status belonging to HTTP header error group.
  7. Extended status belonging to networking or DNS error group. Chances are this may be a temporary failure. Do not change document and do not change its last index time.
  8. Extended status belonging to processing error group.