This document complements the White Paper "Smart Reindexing", adding technical details.
The following cases require special treatment during reindexing.
Lost documents are documents that had the status 200 during one of previous reindexings and now have an error status, for example 404. These documents content is not immediately deleted because the error may be temporary. Their status is marked with lost count upper bits (lost count is the number of reindexings when the document was found being lost) and these documents will still appear in search. Lost documents cannot be deleted, because links to them from other documents still exist. The command "deltamerge -D" must therefore to remove content of the relevant documents and to remove lost count marking, so that lost 404 document becomes a normal 404 document.
Problem: Some documents can be lost and after become orphans. Therefore deleting them as orphans must take the some of lost and orphan periods.
Orphaned documents are those that were found and indexed during one of previous reindexings, but are no longer accessible through links or are no longer in the document set because of a change in the configuration. Reindexing always follows links and, therefore, orphans are not reindexed and consequently their last indexing time is not updated. Orphans are recognized by comparing stored reindexing start times with document last indexing times.
Problem with a modified document is that it may have been the original document with clones and after modification it may become a clone of another document or even the original among already existing clones.
In case sets.cnf and/or spider.cnf configuration files or robots.txt file were changed between reindexings, document link handles stored in the database may become obsolete if the document was not modified. This may happen because set of linked documents is restricted by the configuration and therefore may change with the configuration change. For this reason, during every reindexing even unchanged documents must be parsed for links and link urls processed. Storing link handles in the database seems to serve no purpose and should be removed.
Database consists of the following main parts.
Table urlword contains document internet spidering-related attributes: document internet location (url, site), extended HTTP status and its spidering attributes (last_index_time, next_index_time, crc, last_modified, etag, hops, redir, origin).
Table urlwordsXX contains document content-related attributes (wordcount, totalcount, content_type, charset, title, txt, docsize, keywords, description, lang, words, hrefs). Document entry is created only for original documents (not clones) with HTTP status 200 and also for documents with HTTP status 406 if they contain links belonging to a docset.
Spider log where keyword information, links, redirections and modification times are written. Spider log is written only for documents with HTTP codes 200, 301, 302 and 406.
registerUrl - Create initial record for a new URL. Record internet spidering-related attributes in the table urlword.
updateInetAttrib - Update internet spidering-related attributes of the document leaving the content unchanged. Is used for updating unchanged documents.
modifyLost - Modify spidering-related attributes of a document that previously had 200 status and the last downloading and processing attempt returned an error status code. New status is marked as belonging to a "lost" document (markLost function) and the last modification time is left unchanged. The document content and reverse index data are left unchanged because the error code may have been a result of temporary failure. Such documents can be later removed by the orphan-removing tool.
updateDocument - Completely replace old document with the new one. If the old document had the status 200 or 406 with links and the new one is not, delete the corresponding entry in the table urlwordsXX.
markDeleted - Mark old document entries in the reverse index to be deleted during merging.
When a URL is first discovered (either a starting page specified in the sets.cnf configuration file or a link from a previously downloaded page) an initial record is created in the table urlword, handle obtained and placed in the server queue. The initial record is created before download because typically a URL is linked from many pages and we need to prevent multiple copies of the same URL from appearing in the server queue.
During first indexing after every download request the spider simply records the document status and its attributes into the table urlword. For documents with OK (200) status and documents with Redirect (301 and 302) status an entry in the correspondent table urlwordsXX is also created. This entry contains compressed document body (content), document fields, links and document content attributes. The document's keywords, links and modification times are written into the spider journal. The links discovered in the downloaded document body (200) or in the HTTP header Location field (301 and 302) are put into the server queue only if they have not been found before.
During reindexing (respidering) things get much more complicated. We consider the following cases, depending on the new extended HTTP status of the document and the status during the previous spidering.