Elasticsearch Indexing and Retrieval Algorithm ๐
A guide on Elasticsearch Scoring and Relevancy
Elasticsearch can perform search operations with low latency because instead of searching documents it searches in an inverted index ๐. If you want to know more about indexes and the algorithm that elasticsearch follows, then this article is for you ๐ฅ
Inverted Index
An inverted index is a sorted list of all unique words that appear in any document and each word has a list of documents in which it appears. An inverted index is created using Analyzer. Analyzer is a process that wraps 3 functions that instruct elasticsearch how to index and search documents.
- Character Filter is used to preprocess text by adding, removing or changing characters before it is passed to the tokenizer.
- Tokenizer breaks a text into words.
- Token Filters is used to add, remove or change tokens. Some of the common token filters are
lowercase
converts all tokens into lowercase,stop
removes common words,synonym
introduces synonyms into the token stream.
Example of an inverted index:
Doc1: Show me the code
Doc2: Write the code
The Inverted Index for the above two documents would look like:
code-> [Doc1, Doc2]
me-> [Doc1]
show-> [Doc1]
the-> [Doc1, Doc2]
write-> [Doc2]
Retrieval Algorithm
Elasticsearch generates a relevance_score
for each document based on the user query. Documents are ordered based on the relevance_score
so that documents which high scores are ranked highly.
Term Frequency (TF) is the frequency of a term in a given document.
Document Frequency (DF) is the frequency of a term in all documents.
Inverse Document Frequency (IDF) is the inverse of Document Frequency.
The formula to calculate relevance_score
is
relevance_score = TF * IDF
Let's take an example:
Doc1: Show me the code
Doc2: Write the code and keep coding
Let the search query be to search for the term: code
Term Frequency for Doc1 (TF1) = 1
Term Frequency for Doc1 (TF1) = 2
Document Frequency (DF) = 3
Inverse Document Frequency (IDF) = 1/DF = 1/3
Relevance score for Doc1
(Rev1) = TF1 * IDF = 1/3
Relevance score for Doc2
(Rev2) = TF2 * IDF = 2/3
Since Rev2 > Rev1, Doc2
will be ranked higher than Doc1
Conclusion
This article was an introduction to the scoring algorithm and I hope the examples in this blog clarifies how the scoring works in Elasticsearch. If you want to know more about how scoring works in Elasticsearch then visit the link ๐ Link