πŸŽ‰ Chrome extension is here! Install now β†’

Reverse Engineering Elasticsearch Highlights

Jack Hodkinson Β· 2020-12-07

Elasticsearch is a full text search database built on top of Lucene. It’s got some amazing features including a built in English language analyzer and a search term highlighter. Both of these features are incredibly useful, however, some information is lost when you use them together. This makes it difficult to figure out why your query matched a document. Fortunately, I have come up with a method to recover this information (in most circumstances). This may be of interest to you if you want to see full text search in action or if you are struggling with a similar problem.

Language Analyzers & Highlights

Elasticsearch is wonderful because its english language analyzer lets you find documents that match your query even if the text does not match your query exactly. If I am interested in finding documents about "new technologies" Elasticsearch will return documents that mention "new technology". The simple stemming behind the scenes saves you a lot of time thinking up all the possible combinations of queries you need to find the content you want. The other wonderful thing about Elasticsearch is that you can get it to highlight the words that match your query string within the containing document. Just like you see on google!

Highlighting is great because it allows users to see why the document matched their query. This is really important when the query string is very long and there are only partial matches. For example a query string like fabulous new technologies emerge might return results as in Figure 2.

The annoying part about Elasticsearch is that it doesn't tell you which keywords match the returned document, or which keywords are highlighted. Eg the response from Elasticsearch for the snippet above will be as follows:

Many see <em>emerging technologies</em> as a solution vector for the global challenges of the twenty-first centurey. ... Distribution of Micro-Nano <em>Technology</em>

This result tells you NOTHING about which keywords match! It's very frustrating if you want to do clever things with the results depending on which keywords are present.

Mapping the highlight to the query

I came up with a half baked solution here, starting with the fast vector highlighter and tag_schema set to to styled. This will return the following snippet:

Many see <em class="hlt4">emerging</em> <em class="hlt3">technologies</em> as a solution vector for the global challenges of the twenty-first centurey. ... Distribution of Micro-Nano <em class="hlt3">Technology</em>

What is going on here? Let's take a closer look at the tagged text to understand this better. The three tagged words (in order of appearance) are:

  1. <em class="hlt4">emerging</em>
  2. <em class="hlt3">technologies</em>
  3. <em class="hlt3">Technology</em>

According to the docs we should expect the class of the tag to follow htl1, htl2, htl3... etc. We can see these tags make sense if we look at the order of keywords in our query string

So in our example above, we see that technologies matched two words. Once near the beginning of the text and again at the end of the text. We can see we now have a way to trace back Technology to the keyword technologies and emerging back to the keyword emerge.

Things get a little bit more complicated if you are looking for exact matches of phrases. Eg, if you use quotation marks to specify an exact match. For example, lets look at the query string: fabulous "new technologies" emerge. This will not match the document above because it does not contain the exact match "new technologies", so lets pick a different example.

The new technology has emerged from darkness.

Our query string, fabulous "new technologies" emerge will result in

The <hlt2>new technology</hlt2> has <hlt3>emerged</hlt3> from darkness.

*Note I have abbreviated <em class="htl1"> to <htl1> for the sake of brevity.

We see that the quoted keyword "new technologies" has been treated as the 2nd phrase to tag - thus it gets assigned hlt2 . This time our query string maps to the following tag tokens

If we introduce a proximity search into our query string, such as in the following query string: fabulous "new technologies"~4 emerge and match this to the following example:

The new hot technology has emerged from darkness.

This will map to the following highlight response

The <hlt2>new</htl2> hot <htl2>technology</htl2> has <hlt3>emerged</htl3> from darkness.

So we see the tag order was preserved from the previous example.

Highlights & Boolean Queries

As any experienced Elasticsearch user will tell you, the query is rarely a simple query string. A query may take the form of a nested boolean query, such as the following example

{
    "query": {
        "bool": {
            "should": [
                {"match": {"text": "fabulous"}},
                {"match": {"text": "new"}},
            ],
            "must": [
                {"match": {"text": "technologies"}},
                {"match": {"text": "emerge"}},
            ]
        }
    }
}

The above query will result in the following text

The new hot technology has emerged from darkness.

Mapping to the following highlight

The <hlt4>new</htl4> hot <htl1>technology</htl1> has <hlt2>emerged</htl2> from darkness.

What the?? The order of the tags (hlt1, hlt2, hlt3, ...) is folloiwng an interesting pattern

{
    "query": {
        "bool": {
            "should": [
                {"match": {"text": "fabulous"}},      [hlt3]
                {"match": {"text": "new"}},           [hlt4]
            ],
            "must": [
                {"match": {"text": "technologies"}},  [hlt1]
                {"match": {"text": "emerge"}},        [hlt2]
            ]
        }
    }
}

We see here that the must field starts the indexing from hlt1 and then the tag index picks up again from the should field.

This gets slightly more complicated when you go to a nested boolean query, but the same rules apply. Using these rules, I've managed to create a parser that reads in an elasticsearch query object and returns a mapping between the keywords and the expected hlt tag associated with them. This works equally well if you are using the query_string query within a boolean query. Just make sure you follow the rules we explored in the previous section. If you are struggling with this problem please get in touch, I have some code that might help.

The Caveat

Unfortunately all these nice rules go out the window if you use a wildcard within a query or query string. The query string New tech* will result in the following text

I love technology. The various technologies of the 21st centure.

Being highlighted as so

I love <htl2>technology<htl2>. The various <htl3>technologies<htl3> of the 21st centure.

We see here that Elasticsearch has decided to give the word "technology" the tag htl2 and the word "technologies" htl3. These are treated as seperate highlight tokens. This may be all well and good when the wildcard is at the end of the query string (as tag inex above htl2 should refer to the tech* token in our query string. However, when the wildcard appears in the middle of the query string (or in the middle of a boolean query) then we have no way of knowing which tag belongs to which. I am quite perplexed by this probelm. If you have found a way around this please let me know!

Future investigations

In a future investigation I will test out how this can be accomplished using Lucene, perhaps there is a more straightforward way of doing it. I will post again if I figure it out!

Edit: I got a suggestion to try out the annotated text highlighter to handle wildcard queries. I will try this out and report back.

We're hiring!

If you find working on search an interesting problem, then you might find our company a great place to work. We are solving some really hard problems around Search, QA, and NLG. If you want to get involved please let me know! You can reach out to me at jack{at}quantcopy{dot}com. We are hiring engineers based in GMT +/- 4.