Elasticsearch Indexing and Retrieval Algorithm 📑

Elasticsearch can perform search operations with low latency because instead of searching documents it searches in an inverted index 📑. If you want to know more about indexes and the algorithm that elasticsearch follows, then this article is for you 💥

Inverted Index

An inverted index is a sorted list of all unique words that appear in any document and each word has a list of documents in which it appears. An inverted index is created using Analyzer. Analyzer is a process that wraps 3 functions that instruct elasticsearch how to index and search documents.

Character Filter is used to preprocess text by adding, removing or changing characters before it is passed to the tokenizer.
Tokenizer breaks a text into words.
Token Filters is used to add, remove or change tokens. Some of the common token filters are lowercase converts all tokens into lowercase, stop removes common words, synonym introduces synonyms into the token stream.

Example of an inverted index:
Doc1: Show me the code
Doc2: Write the code

The Inverted Index for the above two documents would look like:

code-> [Doc1, Doc2]
me-> [Doc1]
show-> [Doc1]
the-> [Doc1, Doc2]
write-> [Doc2]

Retrieval Algorithm

Elasticsearch generates a relevance_score for each document based on the user query. Documents are ordered based on the relevance_score so that documents which high scores are ranked highly.

Term Frequency (TF) is the frequency of a term in a given document.
Document Frequency (DF) is the frequency of a term in all documents.
Inverse Document Frequency (IDF) is the inverse of Document Frequency.
The formula to calculate relevance_score is

relevance_score = TF * IDF

Let's take an example:

Doc1: Show me the code
Doc2: Write the code and keep coding
Let the search query be to search for the term: code

Term Frequency for Doc1 (TF1) = 1
Term Frequency for Doc1 (TF1) = 2
Document Frequency (DF) = 3
Inverse Document Frequency (IDF) = 1/DF = 1/3
Relevance score for Doc1 (Rev1) = TF1 * IDF = 1/3
Relevance score for Doc2 (Rev2) = TF2 * IDF = 2/3
Since Rev2 > Rev1, Doc2 will be ranked higher than Doc1

Conclusion

This article was an introduction to the scoring algorithm and I hope the examples in this blog clarifies how the scoring works in Elasticsearch. If you want to know more about how scoring works in Elasticsearch then visit the link 👉 Link

Elasticsearch Indexing and Retrieval Algorithm 📑

A guide on Elasticsearch Scoring and Relevancy

Inverted Index

Retrieval Algorithm

Conclusion