Current location - Music Encyclopedia - Chinese History - Life cycle management of "basic skills" elastic search index (with code)
Life cycle management of "basic skills" elastic search index (with code)
Index is one of the most important concepts in elastic search, and it is also the basis of the whole elastic search operation. Elasticsearch provides index API for Elasticsearch life cycle management, including index creation, query, deletion and setting, index freezing and thawing, splitting and shrinking. Mastering index management is the basic ability of Elasticsearch development, operation and maintenance, and also helps to optimize Elasticsearch in the later stage.

To create an index, you can use the API provided by Elasticsearch in the following format:

Where is the name of the index to be created, is a required parameter, and all letters must be lowercase. When creating an index, you can also make related settings:

The request to create an index without setting anything is as follows:

"acknowledged" in the response body means that the index has been successfully created in the Elasticsearch cluster, and "shards_acknowledged" means that all copy fragments of the index fragment have been prepared before the request times out. Of course, even if "acknowledged" and "shards_acknowledged" are both false, it only means that the index is not completed before the timeout, and subsequent clusters may eventually create the index successfully.

The API for creating indexes also supports query parameters and request bodies, where the parameters are optional.

Query parameters support the following parameters:

The request body also supports three parameters:

The following code example specifies the query parameters and the request body:

Through the Get index API, you can query information about one or more indexes. The API format is as follows:

The target can be a data stream, an index, or an alias. Multiple indexes are separated by commas. The target also supports fuzzy query (*). If you query all indexes, you can use * or _all. If security permission control is turned on in the cluster, you need to obtain the index operation permission of view_index_metadata or try to query the index information. The following is a specific query case:

Get index API supports query parameters with url. These parameters are optional and mainly include the following:

First of all, we need to be clear that the indicators themselves cannot be modified. When we say to modify the index, we actually mean to modify the alias, field mapping and settings of the index.

First of all, explain the role of aliases. Elasticsearch provides the function of alias, and multiple indexes can be added to an alias, which is convenient to actually operate multiple specific indexes by operating an alias. For example, an index is built every year, including four indexes: index-20 19, index-2020, index-202 1, index-2022 * *, which are all related.

In fact, the modification of alias is to delete alias from the index and then add a new alias. These two actions are put together to ensure its atomicity and ensure that the alias will not point to any index at the moment it is deleted.

The following code unbinds my-index from index my-index-00000 1 and then rebinds it to index my-index-000002.

Mapping specifies the data structure and field type of the index, modifies the mapping of the index, and can modify the name and type of the original field and add new fields. The following is an email field that adds a new keyword type.

Support multiple indexes, data streams and aliases, support fuzzy search, and use * or _all to specify all data streams and aliases.

The following is the situation of modifying the settings. Setting the index has two main slices and two copies.

Deleting an index will delete the metadata information of its corresponding documents, fragments and clusters. The operation of the index is a high-risk operation and needs to be cautious. If the cluster turns on security permission control, it needs to obtain the index operation permission of delete_index or try to delete the index.

You cannot delete the write index of a data stream. To delete the current write index, you must flip the data stream and create a new write index.

The API for deleting indexes is as follows:

Is a required parameter to specify the index name. Multiple indexes can be separated by commas, aliases are not supported, and fuzzy matching is not supported by default. If you really need fuzzy matching, you need to set the cluster parameter action. The destructive _ requires _ name is false. Deleting the index API is similar to getting the index API, and also supports URL query parameters. These parameters include allow _ no _ indicators, expand_wildcards, ignore_unavailable, master_timeout and timeout, and their meanings are similar, so they are not repeated here. The following is the case code for deleting an index:

By default, once an index is created, it will be open. However, in some cases, it may be necessary to close the indexes, for example, some old and unnecessary indexes still occupy a certain space, and the cluster still needs some maintenance fees. Closing the index will block all read/write operations on this index, and the closed index does not need to maintain the internal data structure of the index or search documents, thus reducing the overhead on the cluster.

Special attention: closing the index will consume a lot of disk space, and it is also an operation that needs special attention in production. You can disable the closing of indexes by setting cluster. indexes. close. enable to false through the cluster setting API, and the default value is true.

When the index is opened or closed, the master node is responsible for restarting the fragments, and these fragments will go through the recovery process. After the index is opened or closed, the fragmented data will be automatically copied to ensure that there are enough duplicate fragments and high availability.

By default, only specific indexes that match the full name are supported, but if the parameter action. When destructive _ requirements _ name is set to false, you can use * or _all to refer to all indexes, but in this case, once an index fails to match, it will report an error and it is not recommended to open it.

Open API is used to reopen closed indexes. If the target is a data stream, all indexes corresponding to the data stream will be opened.

By default, only specific indexes that match the full name are supported, but if the parameter action. When destructive _ requirements _ name is set to false, you can use * or _all to refer to all indexes, but in this case, once an index fails to match, an error will be reported.

Exponential contraction and splitting refers to the number of major parts of the contraction or splitting index. First of all, we need to understand why we should shrink the split index.

In elasticsearch, the master node has a heavy workload in managing fragments. Reducing the number of fragments in the whole cluster can reduce the recovery time, the size of cluster state and the maintenance consumption of the cluster. In many cases, some cold indexes have no data to write after running for a period of time, and merging some small fragments can reduce the maintenance cost of the cluster.

On the other hand, if it is found in the course of business operation that a single fragment is too large due to large business volume and insufficient estimation, it is necessary to split the index and expand the number of main fragments.

First, exponential contraction

Elasticsearch has provided shrink API since version 5.0 to reduce the number of index fragments of some small indexes. In fact, the source index is not manipulated, but a new index with the same configuration as the source index is created, but the number of fragments is reduced. After the index fragmentation is reduced, the source index can be deleted.

In order to be able to slice and shrink through shrinking API, the index needs to meet the following three conditions:

In order to make the allocation of fragments easier, you can delete the replicated fragments of the index first, and then add the replicated fragments again after the shrinking operation is completed.

You can use the following code to delete all replica tiles, assign all master tiles to the same node, and set the index status to read-only:

It may take some time to reallocate the fragments of the source index. You can use the _cat API to track the progress, or use the cluster health API to wait for all fragments to be redistributed through the wait _ for _ no _ relocation _ shards parameter.

After the above steps are completed, the shrinking operation can be carried out. The following is the format of _shrink API:

In the following example, the index my-index-00000 1 is reduced to shrinked-my-index-000001.

Special note: Because the target fragment is obtained by using the remainder of the fragment number when adding a new document, the number of master fragments requested in the target index must be a factor of the number of master fragments in the source index. For example, an index with 8 master slices can be shrunk to 4, 2 or 1 master slices, or an index with 15 master slices can be shrunk to 5, 3 or 1 master slices. If the number of slices in the index is prime, then it can only be reduced to the main slice.

If the current index is a write index of the data stream, shrinking the index is not allowed. You need to flip the data stream and create a new write index to shrink the current index.

The whole process of index contraction is as follows:

Similarly, Elasticsearch also provides a Split API for splitting an index into a new index with more main slices. The format of the split API is as follows:

To complete the whole split operation, the following conditions need to be met:

The following API requests can make the index read-only:

If the current index is a write index of the data stream, index splitting is not allowed. Before splitting the current index, you need to flip the data stream and create a new write index.

The following is a request case of index splitting using Split API, which supports setting and alias.

The number of primary slices specified by index.number_of_shards must be a multiple of the number of source slices.

The number of fragments that can be split by index splitting is determined by the parameter index.number_of_routing_shards, and the number of routing fragments specifies the hash space, which is used internally to distribute documents among fragments in the form of consistent hash. For example, an index with 5 segments and number_of_routing_shards set to 30(5 x 2 x 3) can be split into 2 or 3 times. In other words, it can be broken down as follows:

5 10 30 (divided into 2 and 3 in turn)

5 15 30 (divided into 3 and 2 in turn)

5 30 (divided into 6 parts)

Index.number_of_routing_shards is a static configuration that can be specified when creating an index or setting an index on a closed index. Its default value depends on the number of primary slices in the original index. By default, it is allowed to split up to 1024 slices by a multiple of 2. However, the original number of master slices must be considered. For example, an index created with 5 slices can be divided into 10, 20, 40, 80, 160, 320 or at most 640 slices.

If the source index has only one master slice, it can be split into any number of master slices.

2.2. Workflow of indicator splitting

2.3. Why does Elasticsearch not support incremental re-segmentation?

Most key-value stores support automatic fragmentation as data grows. Why doesn't Elasticsearch support it?

The classic scheme is to add a fragment and then store the new data in this newly added fragment. However, this scheme may cause the index bottleneck of Elasticsearch, and the overall structure will become more complicated, because Elasticsearch needs to locate which fragment the document belongs to, which means that different hash schemes need to be used to rebalance the existing data.

The solution to this problem in key-value storage system is usually to use consistent hash. When the number of slices increases from n to N+ 1, the consistent hash only needs to redistribute the key of 1/N, for example, the solution of redis cluster is lace.

But the bottom layer of Elasticsearch fragment is actually Lucene index, and deleting a small amount of data from Lucene index is usually much more expensive than key-value storage system. Therefore, Elasticsearch chooses to split files at the index level and use hard links to copy files efficiently to avoid moving documents between indexes.

For the scenario of appending data without modification or deletion, you can create a new index and push the new data into it, and add an alias to read the old index and the new index, thus gaining greater flexibility. Suppose that the old index and the new index have m and n fragments respectively, which has no overhead compared with searching an index with M+N fragments.

2.4. How to monitor the progress of the split?

Split the index with Split API, and the normal return of API does not mean that the splitting process has been completed, but only means that the request to create the target index has been completed and the cluster state has been added. At this point, the master fragment may not be allocated and the replica fragment may not be successfully created.

Once the master is allocated, the state will be initializing and the segmentation process will begin. The slice's state will become active until the split process is completed.

You can use the _cat recovery API to monitor the segmentation process, or use the cluster health API to wait for all the master films to be allocated by setting the wait_for_status parameter to yellow.

Elasticsearch clone API can be used to copy and back up elastic search index data.

Second, index clone API.

Index cloning does not clone the metadata of the source index, including alias, ILM stage definition and CCR follower related information. Cloning the API will copy all configurations except index.number_of_replicas and index.auto_expand_replicas. These two special configurations can be clearly specified in the request to clone the API. The format of the clone API is as follows:

The index meets the conditions that can be cloned:

Referring to what I said before, by setting index.blocks.write to true, you can still ensure that the index is readable. The following is a case of cloning API:

Note: the value of index.number_of_shards must be consistent with the number of primary slices of the source index.

Thirdly, the process of index cloning.

Fourth, monitor the progress of cloning.

When cloning an index by using the cloning API, the normal return of the API does not mean that the cloning process has been completed, but only that the request to create the target index has been completed and the cluster state has been added. At this point, the master clip may not be allocated, and the duplicate clip may not be successfully created.

Once the primary partition is allocated, the status will be initializing and the cloning process will begin. The state of the partition will become active until the cloning process is completed.

You can use the _cat recovery API to monitor the cloning process, or you can use the cluster health API to wait for all the master films to be allocated by setting the wait_for_status parameter to yellow.

Rollover API is a very useful function provided by Elasticsearch. We all know that in MySQL, once the amount of data is relatively large, there may be cases of sub-database and sub-table, such as one table per month according to time. The flip function is similar to this case. Its principle is to create an index with an alias first, and then set certain rules (such as meeting the conditions of a certain time range). When the set rules are met, Elasticsearch will automatically create a new index, and the alias will automatically switch to point to the new index, which is equivalent to automatically establishing the index partition function at the physical level. When the query data falls within a certain period of time, it will be queried in a relatively small index, which is relatively small.

Flipping the API will create a new index for the data stream or index alias. (Before Elasticsearch 7.9, time series data were generally managed by index aliases. After Elasticsearch, data flow replaced this function, requiring less maintenance and automatically integrating with the data layer).

The effect of the scrolling API varies depending on the alias of the index to be scrolled:

When using the flip API, if you specify the name of a new index and the original index ends with a "-"and a number, the new index will keep the name and increase the number. For example, if the original index is my-index-00000 1, the new index will be my-index-000002.

If you use an index alias for time series data, you can use a date in the index name to track the rolling date. For example, you can create an alias to point to a file named. If the index was created in1May 6, 999, the name of the index is My-Index-May 6, 2099-00000 1. If the alias is scrolled on1May 7, 999, the name of the new index is my-index-2099.05.07-000002.

The format of the flip API is as follows:

Rollover API also supports query parameters and request bodies, among which query parameters support wait_for_active_shards, master_timeout, timeout and dry_run, especially dry_run. If dry_run is set to true, the request will not be actually executed, but it will check whether the current index meets the specified conditions, which is very useful for pre-testing.

The requester supports aliases, mappings and settings (these three parameters only support indexes, not data streams) and conditions.

Especially the terms. This is an optional parameter. If a condition is specified, scrolling is performed only when one or more conditions specified by the condition are met. If no condition is specified, scrolling will be unconditional. If you need to scroll automatically, you can use ILM to flip.

Conditions support the following attributes:

The following is an example of data flow in flipping:

The response information is as follows:

The following is an example of an inverted index alias:

2. Request scrolling API

If the index name of the alias uses a date mathematical expression, and the index scrolls at fixed intervals, you can use the date mathematical expression to narrow the search scope. For example, the following search target is the index created in the last three days.

Index freezing is an operation provided by Elasticsearch to reduce memory overhead. In version 7. 14, this function is marked as deprecated. After version 8, the use of heap memory has been improved, and the functions of freezing and thawing are no longer applicable.

The following is a simple demonstration. If it is version 7.x, it can still be used as a reference.

I. Index freeze

After the index is frozen, there is almost no overhead on the cluster except maintaining metadata in memory. After freezing, the index is read-only, and all writing operations, such as document writing and merging, will be blocked.

The API format is as follows:

The following is a code case of index freezing operation:

Note that freezing the index will close it and reopen it in the same API call, which will cause the main allocation not to be allocated for a short time and the cluster will turn red until the allocation is completed again.

Second, the index thaws.

The API format for freezing and unfreezing the corresponding index is as follows:

The following is a code case of index unfreezing operation:

Elasticsearch provides the Resolve index API to assist index parsing. According to the name or pattern matching of the provided index/alias/data stream, the information matching the index in the current cluster can be found. The following is the format of the API:

The case is as follows:

The functions of this API are mostly auxiliary, but not used much in practice. Please refer to the official documents for detailed parameters.

Join me as an expert on flexibility.