DocId (sequence number)

Sequence Number (called DocId below) is an important concept in Lucene. A database uniquely identifies a row using a primary key while Lucene's Index uniquely identifies a Doc by its DocId. However, the following points should be kept in mind:

  • A DocId is not actually unique to the Index but is unique to a Segment. Lucene does this mainly to optimize writing and compression. Since it is only unique to a Segment, how can a Doc be uniquely identified at the Index level? The solution is simple. The segments are ordered. To take a simple example, an Index has two segments and each segment has 100 docs respectively. The DocId's in the Segment are 0-100 but when they are converted to the Index level, the range of the DocId's in the second Segment is converted to 100-200.
  • DocId's are unique within a Segment, numbered progressively from zero. But this does not mean that the DocId's are continuous. When a Doc is deleted, there is a gap.
  • The DocId corresponding to a document can change, usually when Segments are merged.

The inverted index, the very core of Lucene, is essentially a list mapping each Term to the DocId's of the document containing the Term. So when Lucene is searching internally, it makes a two-phase query. The first phase is to list the DocId's found to contain the given Term, and the second phase is to find the Doc based on the DocId. Lucene provides functionality to search by Term as well as to query on the basis of DocId.

DocId is basically an int32 value starting from 0, which is an important optimization, also reflected in data compression and query efficiency. This includes such optimizations as the Delta policy for data compression, ZigZag code and SkipList used in inverted index lists. These are described in detail later on.

Traits

  • The DocId corresponding to a document can change, usually when Segments are merged.

Backlinks