Full-text searching
Back in the days when full-text searching was a term known to a small percentage of engineers, most of us used SQL databases to perform search operations. Of course, it is ok, at least to some extent. However, as you go deeper and deeper, you start to see the limits of such an approach. Just to mention some of them—lack of scalability, not enough flexibility, and lack of language analysis (of course there were additions that introduced full-text searching to SQL databases). These were the reasons why Apache Lucene (http://lucene.apache.org) was created—to provide a library of full text search capabilities. It is very fast, scalable, and provides analysis capabilities for different languages.
The Lucene glossary and architecture
Before going into the details of the analysis process, we would like to introduce you to the glossary for Apache Lucene and the overall architecture of Apache Lucene. The basic concepts of the mentioned library are as follows:
- Document: This is a main data carrier used during indexing and searching, comprising one or more fields that contain the data we put in and get from Lucene.
- Field: This is a section of the document which is built of two parts; the name and the value.
- Term: This is a unit of search representing a word from the text.
- Token: This is an occurrence of a term in the text of the field. It consists of the term text, start and end offsets, and a type.
Apache Lucene writes all the information to the structure called inverted index. It is a data structure that maps the terms in the index to the documents and not the other way around as the relational database does in its tables. You can think of an inverted index as a data structure where data is term-oriented rather than document-oriented. Let's see how a simple inverted index will look. For example, let's assume that we have the documents with only the title field to be indexed and they look as follows:
- Elasticsearch Server 1.0 (document 1)
- Mastering Elasticsearch (document 2)
- Apache Solr 4 Cookbook (document 3)
So, the index (in a very simplified way) can be visualized as follows:
Each term points to the number of documents it is present in. This allows a very efficient and fast searching, such as the term-based queries. In addition to this, each term has a number connected to it, count, telling Lucene how often the term occurs.
Of course, the actual index created by Lucene is much more complicated and advanced because of additional files that include information such as term vectors, doc values, and so on. However, all you need to know for now is how the data is organized and not what is exactly stored.
Each index is divided into multiple write once and read many time segments. When indexing, after a single segment is written to the disk, it can't be updated. Therefore, the information on deleted documents is stored in a separate file, but the segment itself is not updated.
However, multiple segments can be merged together through a process called segments merge. After forcing the segments to merge or after Lucene decides that it is time to perform merging, the segments are merged together by Lucene to create larger ones. This can demand I/O; however, some information needs to be cleaned up because during this time, information that is not needed anymore will be deleted (for example, the deleted documents). In addition to this, searching with one large segment is faster than searching with multiple smaller ones holding the same data. That's because, in general, to search means to just match the query terms to the ones that are indexed. You can imagine how searching through multiple small segments and merging those results will be slower than having a single segment preparing the results.
Input data analysis
Of course, the question that arises is how the data that is passed in the documents is transformed into the inverted index and how the query text is changed into terms to allow searching. The process of transforming this data is called analysis. You may want some of your fields to be processed by a language analyzer so that words such as car and cars are treated as the same in your index. On the other hand, you may want other fields to be only divided on the white space or only lowercased.
Analysis is done by the analyzer, which is built of a tokenizer and zero or more token filters, and it can also have zero or more character mappers.
A tokenizer in Lucene is used to split the text into tokens, which are basically the terms with additional information, such as its position in the original text and its length. The results of the tokenizer's work is called a token stream, where the tokens are put one by one and are ready to be processed by the filters.
Apart from the tokenizer, the Lucene analyzer is built of zero or more token filters that are used to process tokens in the token stream. Some examples of filters are as follows:
- Lowercase filter: This makes all the tokens lowercased
- Synonyms filter: This is responsible for changing one token to another on the basis of synonym rules
- Multiple language stemming filters: These are responsible for reducing tokens (actually, the text part that they provide) into their root or base forms, the stem
Filters are processed one after another, so we have almost unlimited analysis possibilities with the addition of multiple filters one after another.
Finally, the character mappers operate on non-analyzed text—they are used before the tokenizer. Therefore, we can easily remove HTML tags from whole parts of text without worrying about tokenization.
We may wonder how all the preceding functionalities affect indexing and querying when using Lucene and all the software that is built on top of it. During indexing, Lucene will use an analyzer of your choice to process the contents of your document; of course, different analyzers can be used for different fields, so the name
field of your document can be analyzed differently compared to the summary
field. Fields may not be analyzed at all, if we want.
During a query, your query will be analyzed. However, you can also choose not to analyze your queries. This is crucial to remember because some of the Elasticsearch queries are analyzed and some are not. For example, the prefix and the term queries are not analyzed, and the match query is analyzed. Having the possibility to chose from the queries that are analyzed and the ones that are not analyzed are very useful; sometimes, you may want to query a field that is not analyzed, while sometimes you may want to have a full text search analysis. For example, if we search for the LightRed
term and the query is being analyzed by the standard analyzer, then the terms that would be searched are light
and red
. If we use a query type that has not been analyzed, then we will explicitly search for the LightRed
term.
What you should remember about indexing and querying analysis is that the index should match the query term. If they don't match, Lucene won't return the desired documents. For example, if you are using stemming and lowercasing during indexing, you need to ensure that the terms in the query are also lowercased and stemmed, or your queries wouldn't return any results at all. It is important to keep the token filters in the same order during indexing and query time analysis so that the terms resulting of such an analysis are the same.
Scoring and query relevance
There is one additional thing we haven't mentioned till now—scoring. What is the score of a document? The score is a result of a scoring formula that describes how well the document matches the query. By default, Apache Lucene uses the TF/IDF (term frequency / inverse document frequency) scoring mechanism—an algorithm that calculates how relevant the document is in the context of our query. Of course, it is not the only algorithm available, and we will mention other algorithms in the Mappings configuration section of Chapter 2, Indexing Your Data.
Note
If you want to read more about the Apache Lucene TF/IDF scoring formula, please visit Apache Lucene Javadocs for the TFIDFSimilarity
class available at http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html.
Remember though that the higher the score value calculated by Elasticsearch and Lucene, the more relevant is the document. The score calculation is affected by parameters such as boost, by different query types (we will discuss these query types in the Basic queries section of Chapter 3, Searching Your Data), or by using different scoring algorithms.