search

Stop Words

Hibernate Search lets you easily assign an @Analyzer on Fields, which are used to process terms before they are written to the index. An anlyzer can be used for instance for stemming and removing of words which are so frequent that they are insignificant for the results. These are examples for stop words:

[“a”, “an”, “and”, “are”, “as”, “at”, “be”, “but”, “by”, “for”, “if”,
 “in”, “into”, “is”, “it”, “no”, “not”, “of”, “on”, “or”, “such”, “that”,
 “the”, “their”, “then”, “there”, “these”, “they”, “this”, “to”, “was”, “will”, “with”]```


It is a common technique, to split input search terms into single keywords and use these keywords for combining a complex queries over several fields. The problem with this approach is that if a user provides such stop words as input and you manually split the input string into a list of keywords, for instance with a split method, it can occur that a stop word becomes a single keyword for Hibernate Search to process.

List keywordsList = Arrays.asList(searchKeywords.split(” “));```

Hibernate does not know what to search for an replies with this error message and also suggests a solution.

The query string 'of' applied on field 'title' has no meaningfull tokens to be matched. Validate the query input against the Analyzer applied on this field.```


In orderb to validate the string with the same analyzer as hibernate does when building the index, we can retrieve the analyzer from the class we want to search on and parse the string by providing it as a stream to the analyzer. First we get the analyser like this:

Validate.notNull(entityManager, “Entity manager can’t be null”); FullTextEntityManager fullTextEntityManager = org.hibernate.search.jpa.Search .getFullTextEntityManager(entityManager); QueryBuilder builder = fullTextEntityManager.getSearchFactory() .buildQueryBuilder().forEntity(MyClass.class).get();```

And then we can write a simple method, which takes the whole string as input and chops the whole string into a list of keywords, which is sent through the analyzer. Thus if your input string contains a stop word and it would not be added to the index, it will also not be included in the list. Thus Hibernate won’t even try to search for it, as it never sees the stemmed word in the first place.

/**
     * Validate input against the tokenizer and return a list of terms.
     * @param analyzer
     * @param string
     * @return
     */
    public static List<String> tokenizeString(Analyzer analyzer, String string)
    {
        List<String> result = new ArrayList&lt;String&gt;();
        try
        {
            TokenStream stream = analyzer.tokenStream(null, new StringReader(string));
            stream.reset();
            while (stream.incrementToken())
            {
                result.add(stream.getAttribute(CharTermAttribute.class).toString());
            }
            stream.close();
        } catch (IOException e)
        {
            // not thrown b/c we're using a string reader...
            throw new RuntimeException(e);
        }
        return result;
    }
}```




<div class="twttr_buttons">
  <div class="twttr_twitter">
    <a href="http://twitter.com/share?text=Validate+Hibernate+Search+Input+with+an+Analyzer" class="twitter-share-button" data-via="" data-hashtags=""  data-size="default" data-url="https://blog.stefanproell.at/2017/04/13/validate-hibernate-search-input-with-an-analyzer/"  data-related="" target="_blank">Tweet</a>
  </div>
  
  <div class="twttr_followme">
    <a href="https://twitter.com/@stefanproell" class="twitter-follow-button" data-show-count="true" data-size="default"  data-show-screen-name="false"  target="_blank">Follow me</a>
  </div>
</div>