Elasticsearch, updating your mapping

Elasticsearch, updating your mapping

When I work on an elasticsearch based project, I have to deal with different steps.

Once I have done most of my job in assuring we have all the raw events we need, I have to start with deal with the various teams and their reporting tools that usualy require a fine grain mapping to enable them to properly sort / search / aggregate the information we have in the elasticsearc cluster.

A very common issue, while working on this task, is to face the errors because the fields are generic text or fail to aggregate because fielddata is not enabled:

Fielddata is disabled on text fields by default. Set fielddata=true on [your_field_name] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory.

First of all, DON’T enable fielddata, think about that twice and say “NO”. There is always a better way, they cost a lot. Read this for details: https://www.elastic.co/guide/en/elasticsearch/reference/current/fielddata.html

Retrieve the current mapping.

This is the first step, you have to understand what is wrong and how you intend to fix it. use

curl -XGET “http://host:9200/indexname/_mapping”

to retrieve the current mapping of your index.

Create or update the mapping for your index.

First of all, have a look at this URL: https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-types.html

It will help to clarify which types you can use / specify in your mapping. It is more likely that you will need to add ip, dates and long, especially if you’re going to deal with nested data.

By generally speaking the template you’re going to apply will look like below

{
  "order": 1,
  "settings": {
    ... this is where you specify any settings ...
  },
  "index_patterns": ["yourpattern"]
  "mappings": {
    .. this can be exactly what you get from the _mapping API of the index ...

You may want to do that because some fields are wrongly mapped with a wrong type (e.g. number are currently mapped as text or IP addresses are mapped as text).

Once you have done, you can just invoke the _template API and set/update:

curl -H 'Content-Type: application/json' http://localhost:9200/_template/yourpattern -d@/path/to/yourpattern.template.json

the template that will be used for the next index that will be created, so if you specify

"index_patterns": ["yourpattern-*"]

any new index matching that pattern will inherit the mapping (e.g. yourpattern-2019.01.15)

Update an existing index

The above template won’t affect the already create index, so you might need to “migrate” the existing data to be mapped according to the new template. You can do that by telling elasticsearch to create a new index with the _reindex API.

Below example assume your existing data are stored with index yourpattern-1-2019.01.15. Before proceeding, make sure nothing is writing your index (it could be a good idea to set a version number to the index so you can route the traffic to a new index by just telling the clients, like logstash, to use a different number), then you can run the below curl

curl -XPOST "http://127.0.0.1:9200/_reindex?wait_for_completion=false" -H 'Content-Type: application/json' -d'
{
  "source": {
    "index": "yourpattern-1-2019.01.15"
  },
  "dest": {
    "index": "yourpattern-1.1-2019.01.15"
  }
}'

By specifying ?wait_for_completion=false elasticsearch will return a token that can be used to monitor the process with the _tasks API

curl -XGET "http://127.0.0.1:19200/_tasks/MM-HD_-cRXeOF6QESzidSQ:1155556"

The Reindex API is an incredibly powerful tool, I recommend to read this for more details: https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-reindex.html

Once you have done, you can safely remove the previous index with the wrongly mapped data.

Advertisements
Elasticsearch, some note about shards and replicas

Elasticsearch, some note about shards and replicas

An elasticsearch index is composed of one or more shards. Shards can be primary or replica, primary shards are the first involved in a write, replica are created or updated after the primary.

Shards.

Shards are how elasticsearch scales horizontally and distributes load among the nodes of a cluster. A shard is a lucene index, it can hold up to Integer.MAX_VALUE-128 documents (2,147,483,519). You must tune your index to make sure you won’t have shards greater than 20/30GB (or you will have out of memory issues very soon). Changing the number of shards requires to build/re-build a new index, you must set this while creating the index (or through a proper mapping configuration),

see https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-params.html

{ 
  "settings":
   {
    "index.number_of_shards": 10
   }
}

See template API for details: https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-templates.html

Replicas.

They are important to scale reads and lower the pressure (replica are usually involved in the reads as primary target) and make you’re index resilient to failure. A replica can become a primary shard if needed.

The replicas can be set on-the-fly

curl -X PUT "localhost:9200/indexname/_settings" -H 'Content-Type: application/json' -d' {     "index" : {         "number_of_replicas" : 2     } } '

CentOS 7: Disabling OOMKiller for a process

site unreliability

In the latest version of the proc filesystem the OOMKiller has had some adjustments.  The valid range is now -1000 to +1000; previously it was -16 to +15 with a special score of -17 to outright disable it.  It also now uses /proc/<pid>/oom_score_adj instead of /proc/<pid>/oom_adj.  You can read the finer details here.

Given that, systemd now includes OOMScoreAdjust specifically for altering this.  To fully disable OOMKiller on a service simply add OOMScoreAdjust=-1000 directly underneath a [Service] definition, as follows.

...
[Service]
OOMScoreAdjust=-1000
...

This score can be adjusted if you want to ensure the parent PID lives, but children processes can be safely reaped by setting it to something like -999, then if “/bin/parent”, has “/bin/parent –memory-hungry-child,” it will be killed first.

If you have a third-party daemon (like Datadog, used in this example below) which manages itself and uses a sysvinit script you can still calm…

View original post 173 more words