Index

This is similar conceptually to a database table, but differs in some important ways. The table in a traditional relational database or NoSQL database needs to at least have its scheme defined when it is created. There are some clear constraints on the definition of the primary key, columns and the like. However, there are no such constraints on a Lucene index. A Lucene index can be understood as a document folder. You can put a new document into the folder or you can take one out, but if you want to modify one of the documents, you must first take it out, modify it and then put it back in the folder. You can put all sorts of documents in the folder, and Lucene can index the document whatever the content.

Composed of Segments

An index is composed of one or more sub-indexes. A sub-index is called Segment.

Segment

Relationships

Metadata

  • #immutable: "Segments are immutable; updates and deletions may only create new segments and do not modify existing ones." - ref-apache-doc

Description

An Index is composed of one or more sub-indexes. A sub-index is called a segment. The conceptual design of Segments in Lucene is similar to LSM but there are some differences. They inherit the data writing advantages of LSM but only provide near real-time and not real-time queries.

When Lucene writes data it first writes to an In-Memory Buffer (similar to MemTable in LSM, but not readable). When the data in the Buffer reaches a certain amount, it will be flushed to become a Segment. Every segment has its own independent index and are independently searchable, but the data can never be changed. This scheme prevents random writes. Data is written as Batch or as an Append and achieves a high throughput. The documents written in the Segment cannot be modified, but they can be deleted. The deletion method does not change the file in its original, internal location, but the DocID of the document to be deleted is saved by another file to ensure that the data file cannot be modified. Index queries need to query multiple Segments and merge the results, as well as handling deleted documents. In order to optimize queries, Lucene has a policy to merge multiple segments and in this regard is similar to LSM's Merge of SSTable.

Before segments are flushed or committed, data is stored in memory and is unsearchable. This is another reason why Lucene is said to provide near-real time and not real-time queries. After reading the code, I found that it's not that it can't implement data writing that's searchable, just that it's complicated to implement. The reason for this is the searching of data in Lucene relies on the index setup (for example, an inverted index relies on Term Dictionary), and the data index in Lucene is set up during a Segment flush and not in real-time. The purpose of this is to set up the most efficient index. Naturally, it can introduce another index mechanism that sets up the index in real time as it is written, but the implementation of such a mechanism would differ from the current index in the segment, needing the introduction of an extra index at write time and an extra query mechanism, involving some degree of complexity.

How to Answer Questions Regarding Segment

Relationships


Backlinks