Elasticsearch Server: Second Edition
上QQ阅读APP看书,第一时间看更新

Manipulating data with the REST API

The Elasticsearch REST API can be used for various tasks. Thanks to this, we can manage indices, change instance parameters, check nodes and cluster status, index data, search the data, or retrieve documents via the GET API. But for now, we will concentrate on using the CRUD (create-retrieve-update-delete) part of the API, which allows you to use Elasticsearch in a similar way to how you would use a NoSQL database.

Understanding the Elasticsearch RESTful API

In a REST-like architecture, every request is directed to a concrete object indicated by the path of the address. For example, if /books/ is a reference to a list of books in our library, /books/1 is the reference to the book with the identifier 1. Note that these objects can be nested. The /books/1/chapter/6 reference denotes the sixth chapter of the first book in the library, and so on. We have a subject for our API call. What about an operation that we would like to execute, such as GET or POST? To indicate this, request types are used. The HTTP protocol gives us quite a long list of types that can be used as verbs in the API calls. Logical choices are GET in order to obtain the current state of the requested object, POST to change the object state, PUT to create an object, and DELETE to destroy objects. There is also a HEAD request that is only used to fetch the base information of an object.

If we look at the following examples of the operations discussed in the Shutting down Elasticsearch section, everything should make more sense:

  • GET http://localhost:9000/: This command retrieves basic information about Elasticsearch
  • GET http://localhost:9200/_cluster/state/nodes/: This command retrieves the information about the nodes in the cluster
  • POST http://localhost:9200/_cluster/nodes/_shutdown: This command sends a shutdown request to all the nodes in the cluster

We now know what REST means, at least in general (you can read more about REST at http://en.wikipedia.org/wiki/Representational_state_transfer). Now, we can proceed and learn how to use the Elasticsearch API to store, fetch, alter, and delete data.

Storing data in Elasticsearch

As we have already discussed, in Elasticsearch, every piece of data—each document—has a defined index and type. Each document can contain one or more fields that will hold your data. We will start by showing you how to index a simple document using Elasticsearch.

Creating a new document

Now, we will try to index some of the documents. For example, let's imagine that we are building some kind of CMS system for our blog. One of the entities in this blog is articles (surprise!).

Using the JSON notation, a document can be presented as shown in the following example:

{
  "id": "1",
  "title": "New version of Elasticsearch released!",
  "content": "Version 1.0 released today!",
  "priority": 10,
  "tags": ["announce", "elasticsearch", "release"]
}

As we can see, the JSON document contains a set of fields, where each field can have a different form. In our example, we have a number (priority), text (title), and an array of strings (tags). In the following examples, we will show you the other types. As mentioned earlier in this chapter, Elasticsearch can guess these types (because JSON is semi-typed; for example, the numbers are not in quotation marks) and automatically customize how this data will be stored in its internal structures.

Of course, we would like to index our example document and make it available for searching. We will use an index named blog and a type named article. In order to index our example document to this index under the given type and with the identifier of 1, we will execute the following command:

curl -XPUT http://localhost:9200/blog/article/1 -d '{"title": "New version of Elasticsearch released!", "content": "Version 1.0 released today!", "tags": ["announce", "elasticsearch", "release"] }'

Note a new option to the cURL command: the -d parameter. The value of this option is the text that will be used as a request payload—a request body. This way, we can send additional information such as document definition. Also, note that the unique identifier is placed in the URL and not in the body. If you omit this identifier (while using the HTTP PUT request), the indexing request will return the following error:

No handler found for uri [/blog/article/] and method [PUT]

If everything is correct, Elasticsearch will respond with a JSON response similar to the following output:

{
 "_index":"blog",
 "_type":"article",
 "_id":"1",
 "_version":1
}

In the preceding response, Elasticsearch includes the information about the status of the operation and shows where a new document was placed. There is information about the document's unique identifier and current version, which will be incremented automatically by Elasticsearch every time it is updated.

Automatic identifier creation

In the last example, we specified the document identifier ourselves. However, Elasticsearch can generate this automatically. This seems very handy, but only when index is the only source of data. If we use a database to store data and Elasticsearch for full-text searching, the synchronization of this data will be hindered unless the generated identifier is stored in the database as well. The generation of a unique identifier can be achieved by using the POST HTTP request type and by not specifying the identifier in the URL. For example, look at the following command:

curl -XPOST http://localhost:9200/blog/article/ -d '{"title": "New version of Elasticsearch released!", "content": "Version 1.0 released today!", "tags": ["announce", "elasticsearch", "release"] }'

Note the use of the POST HTTP request method instead of PUT in comparison to the previous example. Referring to the previous description of REST verbs, we wanted to change the list of documents in the index rather than create a new entity, and that's why we used POST instead of PUT. The server should respond with a response similar to the following output:

{
  "_index" : "blog",
  "_type" : "article",
 "_id" : "XQmdeSe_RVamFgRHMqcZQg",
  "_version" : 1
}

Note the highlighted line, which holds the unique identifier generated automatically by Elasticsearch.

Retrieving documents

We already have documents stored in our instance. Now let's try to retrieve them by using their identifiers. We will start by executing the following command:

curl -XGET http://localhost:9200/blog/article/1

Elasticsearch will return a response similar to the following output:

{
 "_index" : "blog",
 "_type" : "article",
 "_id" : "1",
 "_version" : 1,
 "exists" : true, 
 "_source" : {
 "title": "New version of Elasticsearch released!", 
 "content": "Version 1.0 released today!", 
 "tags": ["announce", "elasticsearch", "release"] 
 }

In the preceding response, besides the index, type, identifier, and version, we can also see the information that says that the document was found (the exists property) and the source of this document (in the _source field). If document is not found, we get a reply as follows:

{
 "_index" : "blog",
 "_type" : "article",
 "_id" : "9999",
 "exists" : false
}

Of course, there is no information about the version and source because no document was found.

Updating documents

Updating documents in the index is a more complicated task. Internally, Elasticsearch must first fetch the document, take its data from the _source field, remove the old document, apply changes to the _source field, and then index it as a new document. It is so complicated because we can't update the information once it is stored in the Lucene inverted index. Elasticsearch implements this through a script given as an update request parameter. This allows us to do more a sophisticated document transformation than simple field changes. Let's see how it works in a simple case.

Please recall the example blog article that we've indexed previously. We will try to change its content field from the old one to new content. To do this, we will run the following command:

curl -XPOST http://localhost:9200/blog/article/1/_update -d '{
 "script": "ctx._source.content = \"new content\""
}'

Elasticsearch will reply with the following response:

{"_index":"blog","_type":"article","_id":"1","_version":2}

It seems that the update operation was executed successfully. To be sure, let's retrieve the document by using its identifier. To do this, we will run the following command:

curl -XGET http://localhost:9200/blog/article/1

The response from Elasticsearch should include the changed content field, and indeed, it includes the following information:

{
 "_index" : "blog",
 "_type" : "article",
 "_id" : "1",
 "_version" : 2,
 "exists" : true, 
 "_source" : {
 "title":"New version of Elasticsearch released!",
 "content":"new content",
 "tags":["announce","elasticsearch","release"]
 }

Elasticsearch changed the contents of our article and the version number for this document. Note that we didn't have to send the whole document, only the changed parts. However, remember that to use the update functionality, we need to use the _source field—we will describe how to use the _source field in the Extending your index structure with additional internal information section in Chapter 2, Indexing Your Data.

There is one more thing about document updates; if your script uses a field value from a document that is to be updated, you can set a value that will be used if the document doesn't have that value present. For example, if you want to increment the counter field of the document and it is not present, you can use the upsert section in your request to provide the default value that will be used. For example, look at the following lines of command:

curl -XPOST http://localhost:9200/blog/article/1/_update -d '{
 "script": "ctx._source.counter += 1",
 "upsert": {
 "counter" : 0
 }
}'

If you execute the preceding example, Elasticsearch will add the counter field with the value of 0 to our example document. This is because our document does not have the counter field present and we've specified the upsert section in the update request.

Deleting documents

We have already seen how to create (PUT) and retrieve (GET) documents. We also know how to update them. It is not difficult to guess that the process to remove a document is similar; we need to send a proper HTTP request using the DELETE request type. For example, to delete our example document, we will run the following command:

curl -XDELETE http://localhost:9200/blog/article/1

The response from Elasticsearch will be as follows:

{"found":true,"_index":"blog","_type":"article","_id":"1","_version":3}

This means that our document was found and it was deleted.

Now we can use the CRUD operations. This lets us create applications using Elasticsearch as a simple key-value store. But this is only the beginning!

Versioning

In the examples provided, you might have seen information about the version of the document, which looked like the following:

"_version" : 1

If you look carefully, you will notice that after updating the document with the same identifier, this version is incremented. By default, Elasticsearch increments the version when a document is added, changed, or deleted. In addition to informing us about the number of changes made to the document, it also allows us to implement optimistic locking (http://en.wikipedia.org/wiki/Optimistic_concurrency_control). This allows us to avoid issues when processing the same document in parallel. For example, we read the same document in two different applications, modify it differently, and then try to update the one in Elasticsearch. Without versioning the version, we will see the one sent for indexation as the last version. Using optimistic locking, Elasticsearch guards the data accuracy—every attempt to write the document that has been already changed will fail.

An example of versioning

Let's look at an example that uses versioning. Let's assume that we want to delete a document with the identifier 1 with the book type from the library index. We also want to be sure that the delete operation is successful if the document was not updated. What we need to do is add the version parameter with the value of 1 as follows:

curl -XDELETE 'localhost:9200/library/book/1?version=1'

If the version of the document in the index is different from 1, the following error will be returned by Elasticsearch:

{
 "error": "VersionConflictEngineException[[library][4] [book][1]: version conflict, current [2], provided [1]]",
 "status": 409
}

In our example, Elasticsearch compared the version number declared by us and saw that this version is not the same in comparison to the version of the document in Elasticsearch. That's why the operation failed.

Using the version provided by an external system

Elasticsearch can also be based on the version number provided by us. It is necessary when the version is stored in the external system—in this case, when you index a new document, you should provide the version parameter as in the preceding example. In such cases, Elasticsearch will only check if the version provided with the operation is greater (it is not important how much) than the one saved in the index. If it is, the operation will be successful, and if not, it will fail. To inform Elasticsearch that we want to use external version tracking, we need to add the version_type=external parameter in addition to the version parameter.

For example, if we want to add a document that has a version 123456 in our system, we will run a command as follows:

curl -XPUT 'localhost:9200/library/book/1?version=123456' -d {...}

Note

Elasticsearch can check the version number even after the document is removed. That's because Elasticsearch keeps information about the version of the deleted document. By default, this information is available for 60 seconds after the deletion of the document. This time value can be changed by using the index.gc_deletes configuration parameter.