request demo login
All Posts
Finding_Hashtags_in_Elasticsearch_1.7.png
   

Finding Hashtags in Elasticsearch 1.7

Elasticsearch is a great tool for us, and I’ve talked a bit about setting up a cluster here. It’s particularly good at full text search. But something that we commonly search for at Pixlee are specific hashtags, and @ mentions, which unfortunately Elasticsearch isn’t so great at out of the box.

The reason Elasticsearch can’t do it out of the box is because it uses what it calls a standard analyzer. The standard analyzer breaks up text into individual words, based on how it expects “words” to be broken up. For example, spaces and most special characters in text can be assumed to be irrelevant, and can therefore be ignored.

As you can imagine, this means that each language actually has their own analyzer. But in our case, the standard and english analyzer does not recognize # or @ symbols as part of a word, and are instead ignored. An example of the english analyzer can be found here, whereas there is documentation for the standard analyzer here.

So let's give it a shot. What happens when we try to analyze a sentence with hashtags?

# curl -XGET 'localhost:9200/_analyze?analyzer=standard' -d 'a #test'
{
    "tokens": [{
        "token": "a",
        "start_offset": 0,
        "end_offset": 1,
        "type": "",
        "position": 1
    }, {
        "token": "test",
        "start_offset": 3,
        "end_offset": 7,
        "type": "",
        "position": 2
    }]
}

# curl -XGET 'localhost:9200/_analyze?analyzer=english' -d 'a #test'
{
    "tokens": [{
        "token": "test",
        "start_offset": 3,
        "end_offset": 7,
        "type": "",
        "position": 2
    }]
}

Unfortunately, both of those result in removing the hashtag. Instead of using these default analyzers, we clearly need to build out a custom one for our use case. Our initial approach, was to leave everything as is, but treat the # and @ symbols as alphanumeric characters as well. To do this we did the following:

curl -XPUT 'http://localhost:9200/test_mapping' -d '{
"settings" : {
      "analysis" : {
        "filter" : {
            "hashtag_as_alphanum" : {
                "type" : "word_delimiter",
                "type_table": ["# => ALPHANUM", "@ => ALPHANUM"]
            }
        },
        "analyzer" : {
            "hashtag" : {
                "type" : "custom",
                "tokenizer" : "whitespace",
                "filter" : ["lowercase", "hashtag_as_alphanum"]
            }
        }
    }
}}'

In the above example, we are keeping everything the same since the standard analyzer uses a lowercase filter (which indexes everything in lowercase to avoid case sensitivity), and the whitespace tokenizer (which ignores all characters that are alphanumeric [a-z0-9])

However, the twist we are throwing at elasticsearch here is that the # and @ symbols are also considered alphanumeric, so they will not be ignored by the whitespace analyzer, and the regex of what it is looking for is now [a-z0-9#@]

We can test it out, like so:

Note: The hashtag analyzer I've created only exists within the context of the index test_mapping, so I have to hit the _analyze endpoint with the test_mapping index

curl -XGET 'localhost:9200/test_mapping/_analyze?analyzer=hashtag' -d 'a #test'
{
    "tokens": [{
        "token": "a",
        "start_offset": 0,
        "end_offset": 1,
        "type": "word",
        "position": 1
    }, {
        "token": "#test",
        "start_offset": 2,
        "end_offset": 7,
        "type": "word",
        "position": 2
    }]
}

Great, the hashtag is still there. But sometimes, people don't put spaces in between hashtags, and our intial approach won't work so well:

curl -XGET 'localhost:9200/test_mapping/_analyze?analyzer=hashtag' -d 'a #test#some#other #testing'
{
    "tokens": [{
        "token": "a",
        "start_offset": 0,
        "end_offset": 1,
        "type": "word",
        "position": 1
    }, {
        "token": "#test#some#other",
        "start_offset": 2,
        "end_offset": 18,
        "type": "word",
        "position": 2
    }, {
        "token": "#testing",
        "start_offset": 19,
        "end_offset": 27,
        "type": "word",
        "position": 3
    }]
}

Well, that sucks. This gave us quite a headache as the tokenizer, analyzer, and filter uses in elasticsearch were a bit unclear from documentation, but ultimately, we realized that we could change hashtags into something that would be ignored, such as |#. So to work around this, we essentially “created” white spaces by adding a character filter for # like this:

curl -XPUT 'http://localhost:9200/test_mapping' -d '{
"settings" : {
      "analysis" : {
        "char_filter" : {
            "space_hashtags" : {
                "type" : "mapping",
                "mappings" : ["#=>|#"]
            }
        },
        "filter" : {
            "hashtag_as_alphanum" : {
                "type" : "word_delimiter",
                "type_table": ["# => ALPHANUM", "@ => ALPHANUM"]
            }
        },
        "analyzer" : {
            "hashtag" : {
                "type" : "custom",
                "char_filter" : "space_hashtags",
                "tokenizer" : "whitespace",
                "filter" : ["lowercase", "hashtag_as_alphanum"]
            }
        }
    }
}}'
curl -XGET 'localhost:9200/test_mapping/_analyze?analyzer=hashtag' -d 'a #test#some#other #testing'
{
    "tokens": [{
        "token": "a",
        "start_offset": 0,
        "end_offset": 1,
        "type": "word",
        "position": 1
    }, {
        "token": "#test",
        "start_offset": 2,
        "end_offset": 18,
        "type": "word",
        "position": 2
    }, {
        "token": "#some",
        "start_offset": 2,
        "end_offset": 18,
        "type": "word",
        "position": 3
    }, {
        "token": "#other",
        "start_offset": 2,
        "end_offset": 18,
        "type": "word",
        "position": 4
    }, {
        "token": "#testing",
        "start_offset": 20,
        "end_offset": 27,
        "type": "word",
        "position": 5
    }]
}

And it works!

Additional notes: You may be wondering why I used a pipe instead of a space. Strangely enough, if you give it a shot, it doesn't work. My theory behind this is that the space character is actually ignored when it parses the analyzer, so instead I had a use a character I knew would be ignored by the whitespace tokenizer, but not by whatever parses the analyzer settings.

Ultimately, I personally find elasticsearch's analyzers to be a bit of a black box, and the best way for me to figure out how to use it was to try a lot of different things and test them out. Hopefully it will be a priority for elastic to document how to customize it a bit more, but until then, the hope is that this helps someone out trying to work with hashtags in elasticsearch. 

 

join-pixlee-team

TAGS: Engineering,
Dennis Yu

Engineer @ Pixlee. Reddit Lurker, (e)sports fan, and volleyball enthusiast.

Social Media Informer