


classDiagram class IndexWriter { +writes files +uses Directory } class IndexSearcher { +reads from Index } class IndexReader { +reads from Directory } class Directory { +stores index files } class Index { +made up of Segments } class Segment { +sub-index of Index +contains Documents } class Document { +contains Fields } class Field { +made up of Terms } class Term { +smallest search unit } IndexWriter "1" --> "1" Directory : writes files to IndexReader "1" --> "1" Directory : reads files from IndexSearcher "1" --> "1" IndexReader : uses Directory "1" --> "1" Index : stores files for Index "1" --> "*" Segment : consists of Segment "1" --> "*" Document : contains Document "1" --> "*" Field : made up of Field "1" --> "*" Term : consists of

Similar to a row in a relational database or a document in a document database, an index can contain multiple documents. A document written into an Index is assigned a unique ID, that is DocId (sequence number).



Sequence Number (called DocId below) is an important concept in Lucene. A database uniquely identifies a row using a primary key while Lucene's Index uniquely identifies a Doc by its DocId. However, the following points should be kept in mind:

  • A DocId is not actually unique to the Index but is unique to a Segment. Lucene does this mainly to optimize writing and compression. Since it is only unique to a Segment, how can a Doc be uniquely identified at the Index level? The solution is simple. The segments are ordered. To take a simple example, an Index has two segments and each segment has 100 docs respectively. The DocId's in the Segment are 0-100 but when they are converted to the Index level, the range of the DocId's in the second Segment is converted to 100-200.
  • DocId's are unique within a Segment, numbered progressively from zero. But this does not mean that the DocId's are continuous. When a Doc is deleted, there is a gap.
  • The DocId corresponding to a document can change, usually when Segments are merged.

The inverted index, the very core of Lucene, is essentially a list mapping each Term to the DocId's of the document containing the Term. So when Lucene is searching internally, it makes a two-phase query. The first phase is to list the DocId's found to contain the given Term, and the second phase is to find the Doc based on the DocId. Lucene provides functionality to search by Term as well as to query on the basis of DocId.

DocId is basically an int32 value starting from 0, which is an important optimization, also reflected in data compression and query efficiency. This includes such optimizations as the Delta policy for data compression, ZigZag code and SkipList used in inverted index lists. These are described in detail later on.


  • The DocId corresponding to a document can change, usually when Segments are merged.


Similar to a row in a relational database or a document in a document database, an index can contain multiple documents. A document written into an Index is assigned a unique ID, that is DocId (sequence number).



One Document is made up of one or more fields. A field is the smallest definable unit of a data index in Lucene. Lucene provides many different types of field, including StringField, TextField and NumericDocValuesField. Lucene determines the type of indexing (such as Invert Index, Store Field, DocValues or N-dimensional) to use for the data depending on the type of field (FieldType).

Types of Fields


The figure shows the basic relationship between the different types of Field in Lucene... more in here



The smallest unit of index and search in Lucene. A Field consists of one or more terms. A term is produced when a field is put through an Analyzer (tokenizer). A term dictionary is the basic index used to perform conditional searches on terms.



This is similar conceptually to a database table, but differs in some important ways. The table in a traditional relational database or NoSQL database needs to at least have its scheme defined when it is created. There are some clear constraints on the definition of the primary key, columns and the like. However, there are no such constraints on a Lucene index. A Lucene index can be understood as a document folder. You can put a new document into the folder or you can take one out, but if you want to modify one of the documents, you must first take it out, modify it and then put it back in the folder. You can put all sorts of documents in the folder, and Lucene can index the document whatever the content.

Composed of Segments

An index is composed of one or more sub-indexes. A sub-index is called Segment.


too many nested note refs


Segment (Sub-index)



  • #immutable: "Segments are immutable; updates and deletions may only create new segments and do not modify existing ones." - ref-apache-doc


An Index is composed of one or more sub-indexes. A sub-index is called a segment. The conceptual design of Segments in Lucene is similar to LSM but there are some differences. They inherit the data writing advantages of LSM but only provide near real-time and not real-time queries.

When Lucene writes data it first writes to an In-Memory Buffer (similar to MemTable in LSM, but not readable). When the data in the Buffer reaches a certain amount, it will be flushed to become a Segment. Every segment has its own independent index and are independently searchable, but the data can never be changed. This scheme prevents random writes. Data is written as Batch or as an Append and achieves a high throughput. The documents written in the Segment cannot be modified, but they can be deleted. The deletion method does not change the file in its original, internal location, but the DocID of the document to be deleted is saved by another file to ensure that the data file cannot be modified. Index queries need to query multiple Segments and merge the results, as well as handling deleted documents. In order to optimize queries, Lucene has a policy to merge multiple segments and in this regard is similar to LSM's Merge of SSTable.

Before segments are flushed or committed, data is stored in memory and is unsearchable. This is another reason why Lucene is said to provide near-real time and not real-time queries. After reading the code, I found that it's not that it can't implement data writing that's searchable, just that it's complicated to implement. The reason for this is the searching of data in Lucene relies on the index setup (for example, an inverted index relies on Term Dictionary), and the data index in Lucene is set up during a Segment flush and not in real-time. The purpose of this is to set up the most efficient index. Naturally, it can introduce another index mechanism that sets up the index in real time as it is written, but the implementation of such a mechanism would differ from the current index in the segment, needing the introduction of an extra index at write time and an extra query mechanism, involving some degree of complexity.

How to Answer Questions Regarding Segment


  1. Common Pitfalls Gotcha
  2. Components
  3. Concept
  4. Data Types
  5. Indexing Benchmark