Searching with the URI request query
Before going into the details of Elasticsearch querying, we will use its capabilities of using a simple URI request to search. Of course, we will extend our search knowledge using Elasticsearch in Chapter 3, Searching Your Data, but for now, we will stick to the simplest approach.
Sample data
For the purpose of this section of the book, we will create a simple index with two document types. To do this, we will run the following commands:
curl -XPOST 'localhost:9200/books/es/1' -d '{"title":"Elasticsearch Server", "published": 2013}' curl -XPOST 'localhost:9200/books/es/2' -d '{"title":"Mastering Elasticsearch", "published": 2013}' curl -XPOST 'localhost:9200/books/solr/1' -d '{"title":"Apache Solr 4 Cookbook", "published": 2012}'
Running the preceding commands will create the books
index with two types: es
and solr
. The title
and published
fields will be indexed. If you want to check this, you can do so by running the mappings API call using the following command (we will talk about the mappings in the Mappings configuration section of Chapter 2, Indexing Your Data):
curl -XGET 'localhost:9200/books/_mapping?pretty'
This will result in Elasticsearch returning the mappings for the whole index.
The URI request
All the queries in Elasticsearch are sent to the _search
endpoint. You can search a single index or multiple indices, and you can also narrow down your search only to a given document type or multiple types. For example, in order to search our books
index, we will run the following command:
curl -XGET 'localhost:9200/books/_search?pretty'
If we have another index called clients
, we can also run a single query against these two indices as follows:
curl -XGET 'localhost:9200/books,clients/_search?pretty'
In the same manner, we can also choose the types we want to use during searching. For example, if we want to search only in the es
type in the books
index, we will run a command as follows:
curl -XGET 'localhost:9200/books/es/_search?pretty'
Note
Please remember that in order to search for a given type, we need to specify the index or indices. If we want to search for any index, we just need to set *
as the index name or omit the index name totally. Elasticsearch allows quite a rich semantics when it comes to choosing index names. If you are interested, please refer to http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/multi-index.html.
We can also search all the indices by omitting the indices and types. For example, the following command will result in a search through all the data in our cluster:
curl -XGET 'localhost:9200/_search?pretty'
Let's assume that we want to find all the documents in our books
index that contain the elasticsearch
term in the title
field. We can do this by running the following query:
curl -XGET 'localhost:9200/books/_search?pretty&q=title:elasticsearch'
The response returned by Elasticsearch for the preceding request will be as follows:
{ "took" : 4, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 2, "max_score" : 0.625, "hits" : [ { "_index" : "books", "_type" : "es", "_id" : "1", "_score" : 0.625, "_source" : {"title":"Elasticsearch Server", "published": 2013} }, { "_index" : "books", "_type" : "es", "_id" : "2", "_score" : 0.19178301, "_source" : {"title":"Mastering Elasticsearch", "published": 2013} } ] } }
The first section of the response gives us the information on how much time the request took (the took
property is specified in milliseconds); whether it was timed out (the timed_out
property); and information on the shards that were queried during the request execution—the number of queried shards (the total
property of the _shards
object), the number of shards that returned the results successfully (the successful
property of the _shards
object), and the number of failed shards (the failed
property of the _shards
object). The query may also time out if it is executed for a longer time than we want. (We can specify the maximum query execution time using the timeout
parameter.) The failed shard means that something went wrong on that shard or it was not available during the search execution.
Of course, the mentioned information can be useful, but usually, we are interested in the results that are returned in the hits
object. We have the total number of documents returned by the query (in the total
property) and the maximum score calculated (in the max_score
property). Finally, we have the hits
array that contains the returned documents. In our case, each returned document contains its index name (the _index
property), type (the _type
property), identifier (the _id
property), score (the _score
property), and the _source
field (usually, this is the JSON object sent for indexing; we will discuss this in the Extending your index structure with additional internal information section in Chapter 2, Indexing Your Data.
You may wonder why the query we've run in the previous section worked. We indexed the Elasticsearch
term and ran a query for elasticsearch
and even though they differ (capitalization), relevant documents were found. The reason for this is the analysis. During indexing, the underlying Lucene library analyzes the documents and indexes the data according to the Elasticsearch configuration. By default, Elasticsearch will tell Lucene to index and analyze both string-based data as well as numbers. The same happens during querying because the URI request query maps to the query_string
query (which will be discussed in Chapter 3, Searching Your Data), and this query is analyzed by Elasticsearch.
Let's use the indices analyze API (http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-analyze.html). It allows us to see how the analysis process is done. With it, we can see what happened to one of the documents during indexing and what happened to our query phrase during querying.
In order to see what was indexed in the title
field for the Elasticsearch Server
phrase, we will run the following command:
curl -XGET 'localhost:9200/books/_analyze?field=title' -d 'Elasticsearch Server'
The response will be as follows:
{ "tokens" : [ { "token" : "elasticsearch", "start_offset" : 0, "end_offset" : 13, "type" : "<ALPHANUM>", "position" : 1 }, { "token" : "server", "start_offset" : 14, "end_offset" : 20, "type" : "<ALPHANUM>", "position" : 2 } ] }
We can see that Elasticsearch has divided the text into two terms—the first one has a token value of elasticsearch
and the second one has a token value of server
.
Now let's look at how the query text was analyzed. We can do that by running the following command:
curl -XGET 'localhost:9200/books/_analyze?pretty&field=title' -d 'elasticsearch'
The response of the request looks as follows:
{ "tokens" : [ { "token" : "elasticsearch", "start_offset" : 0, "end_offset" : 13, "type" : "<ALPHANUM>", "position" : 1 } ] }
We can see that the word is the same as the original one that we passed to the query. We won't get into Lucene query details and how the query parser constructed the query, but in general, the indexed term after analysis was the same as the one in the query after analysis; so, the document matched the query and the result was returned.
There are a few parameters that we can use to control the URI query behavior, which we will discuss now. Each parameter in the query should be concatenated with the &
character, as shown in the following example:
curl -XGET 'localhost:9200/books/_search?pretty&q=published:2013&df=title&explain=true&default_operator=AND'
Please also remember about the '
characters because on Linux-based systems, the &
character will be analyzed by the Linux shell.
The q
parameter allows us to specify the query that we want our documents to match. It allows us to specify the query using the Lucene query syntax described in the The Lucene query syntax section in this chapter. For example, a simple query could look like q=title:elasticsearch
.
By using the df
parameter, we can specify the default search field that should be used when no field indicator is used in the q
parameter. By default, the _all
field will be used (the field that Elasticsearch uses to copy the content of all the other fields. We will discuss this in greater depth in the Extending your index structure with additional internal information section in Chapter 2, Indexing Your Data). An example of the df
parameter value can be df=title
.
The analyzer
property allows us to define the name of the analyzer that should be used to analyze our query. By default, our query will be analyzed by the same analyzer that was used to analyze the field contents during indexing.
The default_operator
property which can be set to OR
or AND
allows us to specify the default Boolean operator used for our query. By default, it is set to OR
, which means that a single query term match will be enough for a document to be returned. Setting this parameter to AND
for a query will result in the returning of documents that match all the query terms.
If we set the explain
parameter to true
, Elasticsearch will include additional explain
information with each document in the result—such as the shard, from which the document was fetched, and detailed information about the scoring calculation (we will talk more about it in the Understanding the explain information section in Chapter 5, Make Your Search Better). Also remember not to fetch the explain
information during normal search queries because it requires additional resources and adds performance degradation to the queries. For example, a single result can look like the following code:
{ "_shard" : 3, "_node" : "kyuzK62NQcGJyhc2gI1P2w", "_index" : "books", "_type" : "es", "_id" : "2", "_score" : 0.19178301, "_source" : {"title":"Mastering Elasticsearch", "published": 2013}, "_explanation" : { "value" : 0.19178301, "description" : "weight(title:elasticsearch in 0) [PerFieldSimilarity], result of:", "details" : [ { "value" : 0.19178301, "description" : "fieldWeight in 0, product of:", "details" : [ { "value" : 1.0, "description" : "tf(freq=1.0), with freq of:", "details" : [ { "value" : 1.0, "description" : "termFreq=1.0" } ] }, { "value" : 0.30685282, "description" : "idf(docFreq=1, maxDocs=1)" }, { "value" : 0.625, "description" : "fieldNorm(doc=0)" } ] } ] } }
By default, for each document returned, Elasticsearch will include the index name, type name, document identifier, score, and the _source
field. We can modify this behavior by adding the fields
parameter and specifying a comma-separated list of field names. The field will be retrieved from the stored fields (if they exist) or from the internal _source
field. By default, the value of the fields
parameter is _source
. An example can be like this fields=title
.
By using the sort
parameter, we can specify custom sorting. The default behavior of Elasticsearch is to sort the returned documents by their score in the descending order. If we would like to sort our documents differently, we need to specify the sort
parameter. For example, adding sort=published:desc
will sort the documents by the published
field in the descending order. By adding the sort=published:asc
parameter, we will tell Elasticsearch to sort the documents on the basis of the published
field in the ascending order.
If we specify custom sorting, Elasticsearch will omit the _score
field calculation for documents. This may not be the desired behavior in your case. If you want to still keep a track of the scores for each document when using custom sort, you should add the track_scores=true
property to your query. Please note that tracking the scores when doing custom sorting will make the query a little bit slower (you may even not notice it) due to the processing power needed to calculate the score.
By default, Elasticsearch doesn't have timeout for queries, but you may want your queries to timeout after a certain amount of time (for example, 5 seconds). Elasticsearch allows you to do this by exposing the timeout
parameter. When the timeout
parameter is specified, the query will be executed up to a given timeout
value, and the results that were gathered up to that point will be returned. To specify a timeout of 5 seconds, you will have to add the timeout=5s
parameter to your query.
Elasticsearch allows you to specify the results window (the range of documents in the results list that should be returned). We have two parameters that allow us to specify the results window size: size
and from
. The size
parameter defaults to 10
and defines the maximum number of results returned. The from
parameter defaults to 0
and specifies from which document the results should be returned. In order to return five documents starting from the eleventh one, we will add the following parameters to the query: size=5&from=10
.
The URI query allows us to specify the search type by using the search_type
parameter, which defaults to query_then_fetch
. There are six values that we can use: dfs_query_then_fetch
, dfs_query_and_fetch
, query_then_fetch
, query_and_fetch
, count
, and scan
. We'll learn more about search types in the Understanding the querying process section in Chapter 3, Searching Your Data.
Some of the queries use query expansion, such as the prefix query. We will discuss this in the Query rewrite section of Chapter 3, Searching Your Data. We are allowed to define whether the expanded terms should be lowercased or not by using the lowercase_expanded_terms
property. By default, the lowercase_expanded_terms
property is set to true
, which means that the expanded terms will be lowercased.
The Lucene query syntax
We thought that it will be good to know a bit more about what syntax can be used in the q
parameter passed in the URI query. Some of the queries in Elasticsearch (such as the one currently discussed) support the Lucene query parsers syntax—the language that allows you to construct queries. Let's take a look at it and discuss some basic features. To read about the full Lucene query syntax, please go to the following web page: http://lucene.apache.org/core/4_6_1/queryparser/org/apache/lucene/queryparser/classic/package-summary.html.
A query that we pass to Lucene is divided into terms and operators by the query parser. Let's start with the terms—you can distinguish them into two types—single terms and phrases. For example, to query for a term book
in the title
field, we will pass the following query:
title:book
To query for a phrase elasticsearch book
in the title
field, we will pass the following query:
title:"elasticsearch book"
You may have noticed the name of the field in the beginning and in the term or phrase later.
As we already said, the Lucene query syntax supports operators. For example, the +
operator tells Lucene that the given part must be matched in the document. The -
operator is the opposite, which means that such a part of the query can't be present in the document. A part of the query without the +
or -
operator will be treated as the given part of the query that can be matched but it is not mandatory. So, if we would like to find a document with the term book
in the title
field and without the term cat
in the description
field, we will pass the following query:
+title:book -description:cat
We can also group multiple terms with parenthesis, as shown in the following query:
title:(crime punishment)
We can also boost parts of the query with the ^
operator and the boost value after it, as shown in the following query:
title:book^4