Elasticsearch 8 NLP and Beyond

An introduction to the exciting Elasticsearch NLP features. We’ll cover language detection, NER and prediction.

Posted on Aug 15, 2022

Elasticsearch releases always include many features and improvements. You review all the features in the official release guide. The most interesting features introduced from our perspective are the new Natural Language Processing (NLP) features.

Elasticsearch NLP “The fun part!”

NLP as a term describes methods and techniques that allow software to understand natural language in text or audio. The Elasticsearch machine learning features are based on BERT and transformer models that align to the standard BERT model interface.For practical purposes, in Elasticsearch, we use ML models to facilitate NLP. These collections of models allow us to preform text processing to enrich text-based content, making it more robust and useful. This enrichment allows search requests to better understand strings of text allowing a search experience, for example, to interpret a user’s intent providing a better experience. Constructing and training models is a topic for a separate article, so for the sake of this one, we’ll assume that you already have an existing model in place. We’ll walk through how the new Elasticsearch interface capabilities can be used to store and leverage your model in the search solution. The first key is that most of these capabilities are applied at index time. This means the document’s processed and enriched, adding additional metadata information that was ‘inferred’ based on input coming from the same document and a pre-trained model during the ingestion process. For our example enrichment, we’ll utilize some common NLP tasks in Elastic:

  • Language detection
  • Extract named entities
  • Phrase prediction

Language detection

In this simple example, before indexing a document, the language detection model enriches the document based on what the model inferred from the document – in this case, the content of the document title. Understanding the language of a document is a simple but powerful capability. You could, for example, use specific language mappings ‘automagically’ based on the output of this model to provide more precise and meaningful results to your users. You know what’s the best part? This model is available by default in Elasticsearch!

[Insert Screenshot Here]

Using Language detection in Elasticsearch

To use the model, navigate to Dev Tools and in Kibana run the following query:

POST _ingest/pipeline/_simulate { "pipeline": { "processors": [ { "inference": { "model_id": "lang_ident_model_1" } } ] }, "docs": [ { "_source": { "text": "hello, my name is Gustavo" } } ] }

This POST runs the string through the model and, as you can see below, adds many fields to our ‘document’.

[Insert Screen Shot]

Extracting named entities (NER)

According to Wikipedia, named-entity recognition or NER “is a subtask of information extraction that seeks to locate and classify named entity mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.” This classification attempts to extract words from a selection of text into proper names or numerical entities.

Using NER in Elasticsearch

To use named-entity recognition in Elasticsearch we need to load one of the many supported 3rd party model.  Good news is that the process of loading models is straightforward.

1. Install Eland client to load models into Elasticsearch

2. Push the model to elasticsearch

docker build -t elastic/eland

3. Replace URL with yours in format user:password@url

docker run -it –rm –network host \ elastic/eland \ eland_import_hub_model \ --url https://user:password@example-instance-url:9243 \ --hub-model-id elastic/distilbert-base-cased-finetuned-conll03-english \ --task-type ner \ --start

Sit tight while the models are pushed to your cluster.   After it is pushed, you can just try the model.

POST /_ml/trained_models/dslim__bert-base-ner/deployment/_infer { "docs": { "text_field": "MC+A Is a search company located in Chicago, Michael Cizmar is the CEO" } }

[Insert Screen Shot]

As you can see above, the model has again added additional fields to our document, this time describing the classification. Pretty cool!

Mask Filling

Mask Filling or masked language modeling is an ML task of masking some words in a sentence and predicting which words should replace those masks. Mask filling can be very helpful when you need a statistical understanding of texted-based data, and it can be applied to domain-specific content, such as a large corpus of research papers.

Using Mask Filling in Elasticsearch

Since we already have eland in place, we just need to rerun it with a different model and task-type.

eland_import_hub_model --url https://user:password@example-instance-url:9243 --hub-model-id bert-base-uncased --task-type fill_mask --start

Mask filling is personally one of my favorite techniques to use because you never know what the algorithm will predict. Let’s take a closer look by executing this line:

POST /_ml/trained_models/bert-base-uncased/deployment/_infer { “docs”: { “text_field”: “Michael Cizmar is a [MASK] person” } }

What do you think the answer will be?

[Insert Screen Shot]

As you can see, the mask was replaced by ‘business.’ Not too bad out of the box!

POST /_ml/trained_models/bert-base-uncased/deployment/_infer { “docs”: { “text_field”: “The city of [MASK] is considered one of the best places to live” } }

[Insert Screen Shot]

Conclusion

In this article, we covered some of the more interesting NLP capabilities in Elasticsearch, along with a few demonstrations of how you can use them in Elasticsearch.

Additional Information