Better synonym handling in Solr

Update: Download the plugin on Github.

It’s a pretty common scenario when working with a Solr-powered search engine: you have a list of synonyms, and you want user queries to match documents with synonymous terms. Sounds easy, right? Why shouldn’t queries for “dog” also match documents containing “hound” and “pooch”? Or even “Rover” and “canis familiaris”?

A Rover by any other name would taste just as sweet.

As it turns out, though, Solr doesn’t make synonym expansion as easy as you might like. And there are lots of good ways to shoot yourself in the foot.

The SynonymFilterFactory

Solr provides a cool-sounding SynonymFilterFactory, which can be a fed a simple text file containing comma-separated synonyms. You can even choose whether to expand your synonyms reciprocally or to specify a particular directionality.

For instance, you can make “dog,” “hound,” and “pooch” all expand to “dog | hound | pooch,” or you can specify that “dog” maps to “hound” but not vice-versa, or you can make them all collapse to “dog.” This part of the synonym handling is very flexible and works quite well.

Where it gets complicated is when you have to decide where to fit the SynonymFilterFactory: into the query analyzer or the index analyzer?

Index-time vs. query-time

The graphic below summarizes the basic differences between index-time and query-time expansion. Our problem is specific to Solr, but the choice between these two approaches can apply to any information retrieval system.

Index-time vs. query-time expansion.

Your first, intuitive choice might be to put the SynonymFilterFactory in the query analyzer. In theory, this should have several advantages:

  1. Your index stays the same size.
  2. Your synonyms can be swapped out at any time, without having to update the index.
  3. Synonyms work instantly; there’s no need to re-index.

However, according to the Solr docs, this is a Very Bad Thing to Do(™), and apparently you should put the SynonymFilterFactory into the index analyzer instead, despite what your instincts would tell you. They explain that query-time synonym expansion has two negative side effects:

  1. Multi-word synonyms won’t work as phrase queries.
  2. The IDF of rare synonyms will be boosted, causing unintuitive results.
  3. Multi-word synonyms won’t be matched in queries.

This is kind of complicated, so it’s worth stepping through each of these problems in turn.

Multi-word synonyms won’t work as phrase queries

At Health On the Net, our search engine uses MeSH terms for query expansion. MeSH is a medical ontology that works pretty well to provide some sensible synonyms for the health domain. Consider, for example, the synonyms for “breast cancer”:

breast neoplasm
breast neoplasms
breast tumor
breast tumors
cancer of breast
cancer of the breast


So in a normal SynonymFilterFactory setup with expand=”true”, a query for “breast cancer” becomes:

+((breast breast breast breast breast cancer cancer) (cancer neoplasm neoplasms tumor tumors) breast breast)


…which matches documents containing “breast neoplasms,” “cancer of the breast,” etc.

However, this also means that, if you’re doing a phrase query (i.e. “breast cancer” with the quotes), your document must literally match something like “breast cancer breast breast” in order to work.

Huh? What’s going on here? Well, it turns out that the SynonymFilterFactory isn’t expanding your multi-word synonyms the way you might think. Intuitively, if we were to represent this as a finite-state automaton, you might think that Solr is building up something like this (ignoring plurals):

What you reasonably expect.

But really it’s building up this:

The spaghetti you actually get.

And your poor, unlikely document must match all four terms in sequence. Yikes.

Similarly, the mm parameter (minimum “should” match) in the DisMax and EDisMax query parsers will not work as expected. In the example above, setting mm=100% will require that all four terms be matched:

+((breast breast breast breast breast cancer cancer) (cancer neoplasm neoplasms tumor tumors) breast breast)~4


The IDF of rare synonyms will be boosted

Even if you don’t have multi-word synonyms, the Solr docs mention a second good reason to avoid query-time expansion: unintuitive IDF boosting. Consider our “dog,” “hound,” and “pooch” example. In this case, a query for any one of the three will be expanded into:

+(dog hound pooch)


Since “hound” and “pooch” are much less common words, though, this means that documents containing them will always be artificially high in the search results, regardless of the query. This could create havoc for your poor users, who may be wondering why weird documents about hounds and pooches are appearing so high in their search for “dog.”

Index-time expansion supposedly fixes this problem by giving the same IDF values for “dog,” “hound,” and “pooch,” regardless of what the document originally said.

Multi-word synonyms won’t be matched in queries

Finally, and most seriously, the SynonymFilterFactory will simply not match multi-word synonyms in user queries if you do any kind of tokenization. This is because the tokenizer breaks up the input before the SynonymFilterFactory can transform it.

For instance, the query “cancer of the breast” will be tokenized by the StandardTokenizationFactory into [“cancer”, “of”, “the”, “breast”], and only the individual terms will pass through the SynonymFilterFactory. So in this case no expansion will take place at all, assuming there are no synonyms for the individual terms “cancer” and “breast.”

Edit: I’ve been corrected on this. Apparently, the bug is in the Lucene query parser (LUCENE-2605) rather than the SynonymFilterFactory.

Other problems

I initially followed Solr’s suggestions, but I found that index-time synonym expansion created its own issues. Obviously there’s the problem of ballooning index sizes, but besides that, I also discovering an interesting bug in the highlighting system.

When I searched for “breast cancer,” I found that the highlighter would mysteriously highlight “breast cancer X Y,” where “X” and “Y” could be any two words that followed “breast cancer” in the document. For instance, it might highlight “breast cancer frauds are” or “breast cancer is to.”

Highlighting bug.

After reading through this Solr bug, I discovered it’s because of the same issue above concerning how Solr expands multi-word synonyms.

With query-time expansion, it’s weird enough that your query is logically transformed into the spaghettified graph above. But picture what happens with index-time expansion, if your document contains e.g. “breast cancer treatment options”:

Your mangled document.

This is literally what Lucene thinks your document looks like. Synonym expansion has bought you more than you bargained for, with some Dada-esque results! “Breast tumor the options” indeed.

Essentially, Lucene now believes that a query for “cancer of the breast” (4 tokens) is the same as “breast cancer treatment options” (4 tokens) in your original document. This is because the tokens are just stacked one on top of the other, losing any information about which term should be followed by which other term.

Query-time expansion does not trigger this bug, because Solr is only expanding the query, not the document. So Lucene still thinks “cancer of the breast” in the query only matches “breast cancer” in the document.

Update: there’s a name for this phenomenon! It’s called “sausagization.”

Back to the drawing board

All of this wackiness led me to the conclusion that Solr’s built-in mechanism for synonym expansion was seriously flawed. I had to figure out a better way to get Solr to do what I wanted.

In summary, index-time expansion and query-time expansion were both unfeasible using the standard SynonymFilterFactory, since they each had separate problems:


  • Index size balloons.
  • Synonyms don’t work instantly; documents must be re-indexed.
  • Synonyms cannot be instantly replaced.
  • Multi-word synonyms cause arbitrary words to be highlighted.


  • Phrase queries do not work.
  • IDF values for rare synonyms are artificially boosted.
  • Multi-word synonyms won’t be matched in queries.

I began with the assumption that the ideal synonym-expansion system should be query-based, due to the inherent downsides of index-based expansion listed above. I also realized there’s a more fundamental problem with how Solr has implemented synonym expansion that should be addressed first.

Going back to the “dog”/”hound”/”pooch” example, there’s a big issue usability-wise with treating all three terms as equivalent. A “dog” is not exactly the same thing as a “pooch” or a “hound,” and certain queries might really be looking for that exact term (e.g. “The Hound of the Baskervilles,” “The Itchy & Scratchy & Poochy Show”). Treating all three as equivalent feels wrong.

Also, even with the recommended approach of index-time expansion, IDF weights are thrown out of whack. Every document that contains “dog” now also contains “pooch”, which means we have permanently lost information about the true IDF value for “pooch”.

In an ideal system, a search for “dog” should include documents containing “hound” and “pooch,” but it should still prefer documents containing the actual query term, which is “dog.” Similarly, searches for “hound” should prefer “hound,” and searches for “pooch” should prefer “pooch.” (I hope I’m not saying anything controversial here.) All three should match the same document set, but deliver the results in a different order.


My solution was to move the synonym expansion from the analyzer’s tokenizer chain to the query parser. So instead of expanding queries into the crazy intercrossing graphs shown above, I split it into two parts: the main query and the synonym query. Then I combine the two with separate, configurable weights, specify each one as “should occur,” and then wrap them both in a “must occur” boolean query.

So a search for “dog” is parsed as:

+((dog)^1.2 (hound pooch)^1.1)


The 1.2 and the 1.1 are the independent boosts, which can be configured as input parameters. The document must contain one of “dog”, “hound,” or “pooch”, but “dog” is preferred.

Handling synonyms in this way also has another interesting side effect: it eliminates the problem of phrase queries not working. In the case of “breast cancer” (with the quotes), the query is parsed as:

+(("breast cancer")^1.2 (("breast neoplasm") ("breast tumor") ("cancer ? breast") ("cancer ? ? breast"))^1.1)


(The question marks appear because of the stopwords “of” and “the.”)

This means that a query for “breast cancer” (with the quotes) will also match documents containing the exact sequence “breast neoplasm,” “breast tumor,” “cancer of the breast,” and “cancer of breast.”

I also went one step beyond the original SynonymFilterFactory and built up all possible synonym combinations for a given query. So, for instance, if the query is “dog bite” and the synonyms file contains:



… then the query will be expanded into:

dog bite
hound bite
pooch bite
dog nibble
hound nibble
pooch nibble


Try it yourself!

The code I wrote is a simple extension of the ExtendedDisMaxQueryParserPlugin, called the SynonymExpandingExtendedDisMaxQueryParserPlugin (long enough name?). I’ve only tested it to work with Solr 3.5.0, but it ought to work with any version that has EDisMax.

Edit: the instructions below are deprecated. Please follow the “Getting Started” guide on the Github page instead.

Here’s how you can use the parser:

  1. Drop this jar into your Solr’s lib/ directory.
  2. Add this definition to your solrconfig.xml:
  3. <queryParser name="synonym_edismax" class="solr.SynonymExpandingExtendedDismaxQParserPlugin">
      <!-- TODO: figure out how we wouldn't have to define this twice -->
      <str name="luceneMatchVersion">LUCENE_34</str>
      <lst name="synonymAnalyzers">
        <lst name="myCoolAnalyzer">
          <lst name="tokenizer">
            <str name="class">solr.StandardTokenizerFactory</str>
          <lst name="filter">
            <str name="class">solr.ShingleFilterFactory</str>
            <str name="outputUnigramsIfNoShingles">true</str>
            <str name="outputUnigrams">true</str>
            <str name="minShingleSize">2</str>
            <str name="maxShingleSize">4</str>
          <lst name="filter">
            <str name="class">solr.SynonymFilterFactory</str>
            <str name="tokenizerFactory">solr.KeywordTokenizerFactory</str>
            <str name="synonyms">my_synonyms_file.txt</str>
            <str name="expand">true</str>
            <str name="ignoreCase">true</str>
        <!-- add more analyzers here, if you want -->

    The analyzer you see defined above is the one used to split the query into all possible alternative synonyms. Synonyms that are exactly the same as the original query will be ignored, so feel free to use expand=true if you like.

    This particular configuration (StandardTokenizerFactory + ShingleFilterFactory + SynonymFilterFactory) is just the one that I found worked the best for me. Feel free to try a different configuration, but something really fancy might break the code, so I don’t recommend going too far.

    For instance, you can configure the ShingleFilterFactory to output shingles (i.e. word N-grams) of any size you want, but I chose shingles of size 1-4 because my synonyms typically aren’t longer than 4 words. If you don’t have any multi-word synonyms, you can get rid of the ShingleFilterFactory entirely.

    (I know that this XML format is different from the typical one found in schema.xml, since it uses lst and str tags to configure the tokenizer and filters. Also, you must define the luceneMatchVersion a second time. I’ll try to find a way to fix these problems in a future release.)

  4. Add defType=synonym_edismax to your query URL parameters, or set it as the default in solrconfig.xml.
  5. Add the following query parameters. The first one is required:
  6. Param Type Default Summary
    synonyms boolean false Enable or disable synonym expansion entirely. Enabled if true.
    synonyms.analyzer String null Name of the analyzer defined in solrconfig.xml to use. (E.g. in the example above, it’s myCoolAnalyzer). This must be non-null, if you define more than one analyzer.
    synonyms.originalBoost float 1.0 Boost value applied to the original (non-synonym) part of the query.
    synonyms.synonymBoost float 1.0 Boost value applied to the synonym part of the query.
    synonyms.disablePhraseQueries boolean false Enable or disable synonym expansion when the user input contains a phrase query (i.e. a quoted query).

Future work

Note that the parser does not currently expand synonyms if the user input contains complex query operators (i.e. AND, OR, +, and -). This is a TODO for a future release.

I also plan on getting in contact with the Solr/Lucene folks to see if they would be interested in including my changes in an upcoming version of Solr. So hopefully patching won’t be necessary in the future.

In general, I think my approach to synonyms is more principled and less error-prone than the built-in solution. If nothing else, though, I hope I’ve demonstrated that making synonyms work in Solr isn’t as cut-and-dried as one might think.

As usual, you can fork this code on GitHub!

65 responses to this post.

  1. Posted by lulucas on December 10, 2012 at 3:39 PM


    Thank you very much for your nice work !

    I try to run your class but I get the following error:
    GRAVE: java.lang.IllegalAccessError: class cannot access its superclass
    at java.lang.ClassLoader.defineClass1(Native Method)
    at java.lang.ClassLoader.defineClass(

    This message is the same on a solr server 3.5.0 and 3.6.1.

    Do you have any idea about the problem?

    Thanks in advance.


    • Hi there,

      Could you provide some more info about your situation? Full stacktrace, servlet container you’re running (e.g. Tomcat, Jetty), your Java version, etc.

      I found this page, which says it may be a problem with Tomcat 7. Let me know if modifying your web.xml fixed the problem for you.

      For the record, I used Java 6, Tomcat 6, and Solr 3.5.0.



      • Posted by lulucas on December 11, 2012 at 9:10 AM

        Thank you for your reply,

        My environment :
        JAVA : java version “1.6.0_18″ (OpenJDK Runtime Environment (IcedTea6 1.8.7) (6b18-1.8.7-2~squeeze1))
        TOMCAT : 7.0.23
        SOLR : 3.6.1

        I then added the tag metadata in the /opt/tomcat-master-dev/conf/web.xml file :

        The full Java stack Trace :
        11 déc. 2012 08:54:31 org.apache.solr.common.SolrException log
        GRAVE: java.lang.IllegalAccessError: class cannot access its superclass
        at java.lang.ClassLoader.defineClass1(Native Method)
        at java.lang.ClassLoader.defineClass(
        at Method)
        at java.lang.ClassLoader.loadClass(
        at java.lang.ClassLoader.loadClass(
        at java.lang.ClassLoader.loadClassInternal(
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(
        at org.apache.solr.core.SolrResourceLoader.findClass(
        at org.apache.solr.core.SolrCore.createInstance(
        at org.apache.solr.core.SolrCore.createInitInstance(
        at org.apache.solr.core.SolrCore.initPlugins(
        at org.apache.solr.core.SolrCore.initPlugins(
        at org.apache.solr.core.SolrCore.initPlugins(
        at org.apache.solr.core.SolrCore.initQParsers(
        at org.apache.solr.core.SolrCore.(
        at org.apache.solr.core.CoreContainer.create(
        at org.apache.solr.core.CoreContainer.load(
        at org.apache.solr.core.CoreContainer.load(
        at org.apache.solr.core.CoreContainer$Initializer.initialize(
        at org.apache.solr.servlet.SolrDispatchFilter.init(
        at org.apache.catalina.core.ApplicationFilterConfig.initFilter(
        at org.apache.catalina.core.ApplicationFilterConfig.getFilter(
        at org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(
        at org.apache.catalina.core.ApplicationFilterConfig.(
        at org.apache.catalina.core.StandardContext.filterStart(
        at org.apache.catalina.core.StandardContext.startInternal(
        at org.apache.catalina.util.LifecycleBase.start(
        at org.apache.catalina.core.ContainerBase.addChildInternal(
        at org.apache.catalina.core.ContainerBase.addChild(
        at org.apache.catalina.core.StandardHost.addChild(
        at org.apache.catalina.startup.HostConfig.deployDescriptor(
        at org.apache.catalina.startup.HostConfig$
        at java.util.concurrent.Executors$
        at java.util.concurrent.FutureTask$Sync.innerRun(
        at java.util.concurrent.ThreadPoolExecutor.runWorker(
        at java.util.concurrent.ThreadPoolExecutor$

        The problem is still there…

      • I think I’ve found the culprit. It looks like in Solr 3.5.0, ExtendedDismaxQParserPlugin defines “queryFields” to be package-private, whereas in Solr 3.6.1, it becomes private. So my code fails because I try to access the superclass’s “queryFields” (here).

        I’ve filed this as a bug on the GitHub page, and I’ll try to address it as soon as I can. Unfortunately this seems like a pretty nasty problem, and I don’t know if there’s any way I can make it work without gutting the offending lines of code. But I’ll look into it.

  2. Posted by lulucas on December 11, 2012 at 12:25 PM

    I also think the problem is at this private access to the field “queryFields”.
    I hope you can find a solution (My java skills stops here, sorry).
    Thank you very much for your promptness !


  3. I was just about to start experimenting with this when I found your blog via Google (small world? well maybe not that many people are playing with Solr and MeSH after all). Very helpful post!

    – Matthias (working in the same project as Nolan)


  4. Posted by Kannan on January 4, 2013 at 10:30 PM

    Thanks much for the nice work.
    We also were not happy with the solr synonym handling and were thinking along the lines of writing a query parser and glad that we found this post.

    We are using solr 4.0. I got the source code for the SynonymExpandingExtendedDismaxQParserPlugin from github and fixed few package imports (from solr to lucene — modeled after ExtendedDismaxQParserPlugin) and fixed couple of issues because of the new ResourceLoaderAware class, but hit into the private queryFields error.

    Would appreciate if you have any update on this issue.

    Also curious to find out if you contacted solr/lucene folks and
    their reaction.


    • I actually was able to fix the queryFields problem, but I just hadn’t merged my changes into the master branch yet, because during my testing of Solr 3.6.1 I ran into a (probably) unrelated issue. It’s committed to the master branch now.

      If you have other changes for Solr 4.0, though, then please merge your code with mine (up to the latest commit), test it, and if it works in Solr 4.0, then please send me a pull request in GitHub. Hopefully we can make this work for Solr 3.5, 3.6, 3.6.1, and 4.0 in one fell swoop!

      I did contact the Lucene/Solr folks. It remains to be seen if they will integrate my changes, but if not, then I’m also happy to just consider this as a separate plug-in.


      • Posted by Kannan on January 8, 2013 at 9:31 PM

        Thanks. Once we have working version of the code with solr 4.0, will send the code to you.

  5. Posted by AB on January 9, 2013 at 5:23 AM

    hi Nolan

    I set out with great anticipation to use your jar since we have clients of various industries that run into the need for exactly this kind of parser. I built the jar from your github source with this environment:

    Apache Maven 3.0.4 (r1232337; 2012-01-17 00:44:56-0800)
    Maven home: C:\apache\apache-maven-3.0.4
    Java version: 1.7.0_07, vendor: Oracle Corporation
    Java home: C:\Program Files\Java\jdk1.7.0_07\jre
    Default locale: en_US, platform encoding: Cp1252
    OS name: “windows 7″, version: “6.1”, arch: “amd64″, family: “windows”

    Solr 3.5

    but I keep hitting that error running with Jetty:
    Jan 08, 2013 4:43:47 PM org.apache.solr.common.SolrException log
    SEVERE: java.lang.IllegalAccessError: class
    ingExtendedDismaxQParser cannot access its superclass

    And now for the good news, it runs fine with Tomcat 6 apparently
    Jan 08, 2013 4:50:54 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader
    INFO: Adding ‘file:/C:/apache/solr350/contrib/querytimesynonymparser/hon-lucene-synonyms-1.0.jar’ to classloader
    Jan 08, 2013 4:50:54 PM org.apache.solr.core.SolrConfig

    Thank you for creating this little gem :)



  6. Posted by AB on January 9, 2013 at 10:56 PM

    Sadly, I rejoiced a little too soon. Tomcat loads it really nicely, but as soon as you open up your solr in the browser, back to square one with a fat HTTP500 error

    HTTP Status 500 – Severe errors in solr configuration. Check your log files for more detailed information on what may be wrong. If you want solr to continue after configuration errors, change: false in solr.xml
    ————————— java.lang.IllegalAccessError: class cannot access its superclass at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(Unknown Source) at Source) at Source) at$100(Unknown Source) at

    etc …


  7. @Nolan:

    > I did contact the Lucene/Solr folks. It remains to be seen if they will integrate my
    > changes, but if not, then I’m also happy to just consider this as a separate plug-in.

    Please do file the JIRA issue with your patch and set Fix Version to 4.2. Thanks.


  8. Posted by tandula on January 15, 2013 at 12:01 AM

    By placing the files apache-solr-solrj-3.5.0.jar & apache-solr-core-3.5.0.jar in /example/lib , I was able to get it past the original complaints.

    Now the compile error is the following.

    SEVERE: org.apache.solr.common.SolrException: Error Instantiating QParserPlugin,
    solr.SynonymExpandingExtendedDismaxQParserPlugin is not a
    at org.apache.solr.core.SolrCore.createInstance(
    at org.apache.solr.core.SolrCore.createInitInstance(
    at org.apache.solr.core.SolrCore.initPlugins(
    at org.apache.solr.core.SolrCore.initPlugins(
    at org.apache.solr.core.SolrCore.initPlugins(
    at org.apache.solr.core.SolrCore.initQParsers(
    at org.apache.solr.core.SolrCore.(

    QUESTION: Nolan, which exact solr version did you write this jar against? Perhaps I’m using a version of QParser that is different than yours :) I’ll get this thing to work



  9. Hi guys,

    I think this discussion is starting to outgrow WordPress. I’d prefer for us to document this in GitHub, so I created a new GitHub issue to track all the compatibility problems with Solr 3.6.0, 3.6.1, and 4.0.

    I can confirm seeing the same errors myself. And I would greatly appreciate any Solr guru who could help shed some light on these problems. :)

    – Nolan


  10. Posted by Okke Klein on January 18, 2013 at 12:47 PM

    If you need help, I suggest you make a Jira issue in the Solr project like Otis also suggested. You can upload none working patches so others can try to fix them.

    Good luck. Looking forward to testing this feature.


  11. Posted by Kevin Schaper on January 18, 2013 at 8:26 PM

    Thanks for the blog post! It’s nice to have a confirmation of the strange behavior I’m getting with edismax, q.op=AND and synonym expansion of multi-word ontology terms.

    An option I’ve considered is to move the term expansion out of Solr and into the application layer – that way I can carry forward your idea of a reduced boost for synonyms and also reduce the boost score based on how far down the DAG tree a child ontology term is.

    I’m working with Solr for a model organism database – always nice to find more bio/med search people!


  12. Posted by AB on January 29, 2013 at 1:37 AM

    I got it to work in run-time, and I want to just compliment you on how stunningly it works. I tested it with 3 word phrases and it broke it into single words, and then to ensure it finds the phrases exactly, I added quotes in the synonyms_extended.txt file. It worked like a charm.

    How you ask?
    I put the code for your synonym parser in with the rest of the actual Solr 3.5 source, and compiled the entire Solr source. Then I took that solr.war file and replaced my old one.

    The issues that we keep seeing here, has to do with how Jetty, and Tomcat for that matter load the solr.war file, who is the envelope to the jars, and since the war file had no previous knowledge of the synonym parser being an extension of the QParser, well, you see the picture.

    My next project is how to write it without needing to recompile solr.war, so that it’s just a synonymexpander.jar file I can drop into the lib folder, restart solr and be done and ready to go with the stock solr.war.

    Oh yes, I also added a requestHandler to do the work, so my solr queries are clean like this localhost:8983/solr/ab/?q=shoes

    Hon Synonyms

    2<-1 5<-2 6<90%
    text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4
    text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4


    hope this helps


    • Hi there,

      Yes, this helps a ton. Unfortunately, until we find a solution using a drop-in JAR file, it sounds like your “compile along with Solr” solution is the only reasonable one. Since this is the case, though, and since I’ve gotten so much positive feedback on this code, I will file an issue in the Solr JIRA itself to add my code as a patch.

      I’m also taking the liberty of adding your comments to the GitHub issue. Please, folks, restrict your comments/bugfixes/me-too’s to GitHub! :) Thanks.


  13. All right, compatibility bug fixed, documentation improved, and JIRA issue filed. Let’s see if this code can make it into an upcoming version of Solr.


  14. Interesting post.
    How does Solr handle synonyms when the term is ambiguous? Let’s take “book” which can be both a noun and a verb. Is Solr going to return synonyms of both meanings?

    Unless Solr fully disambiguates in context, which, to the best of my knowledge, it doesn’t, expanding query terms to synonyms will, in this case return even more irrelevant results than just using the query term.


    • Dear Philippe

      Do you know of an algorithm that deals with slang and words in context like you describe? as an example, Google returns results for Books, when searching for Book. and Booking sites, when you search for “Book It” or Booking.

      Solr has a well oiled Protected Words factory to deal with cases such as “Book It” and Booking. Enter what you need into the protected words file, ensure the Type of the field is using the Protected words factory, and voilla, Solr won’t stem and dismember the terms, but it has to appear as is in your data for this to work. Then, you can make synonyms between booking, book it, booking it, making a booking etc.



  15. Our own semantic platform (Inbenta) handles disambiguation in context and it does it for several languages. Your example of “booking”, “book it”, “booking it” is fine but it is, pardon my French, trivial. In all these instances “book” has only one meaning, the meaning of “reserve”, so there’s no need to disambiguate. What if the Content has documents containing expressions like “ship a book” and “book a ship”?


    • Posted by nimnio on February 27, 2013 at 12:27 AM

      Neither Solr nor Nolan’s open-source improvement deals with synonym semantics, nor do they claim to. Your “constructive criticism” seems no more than advertising.


  16. ??? I am sorry but I responded to a post by AB asking for an algorithm doing what I describe. It’s not advertising, it’s answering a question.


  17. Have you been able to do any work on the below statement?

    “Note that the parser does not currently expand synonyms if the user input contains complex query operators (i.e. AND, OR, +, and -). This is a TODO for a future release.”

    I am interested in using the Synonym handler you have created, but I need to add some additional information to the query to get the correct results. I have your code setup and it works as described thanks… just need a little more.


    • I currently don’t have any intention to solve the complex query operator problem (if you’re using complex operators, you’re probably not a naïve end-user who needs synonym expansion in the first place!), but you are welcome to submit a patch on GitHub if you’d like. :)


  18. Thanks a lot for the great work!
    Is it safe to use your lib with Solr 4.2?


  19. Could you explain what the effect of back pack=>backpack is?


  20. Posted by jhsuh on June 25, 2013 at 2:08 AM

    This is the very one I want and find for my search system.
    But I have one question and problem for this.
    This query parser uses raw query itself twice.
    For example, I search the query “ny” which has synonyms. And I check the query phrase, I find the “ny” twice on that.
    The query “new york” is same to that. When I check the score of matched documents, those get advantages.
    I think, raw query doesn’t need to be searched in the synonym search phrase. Please consider about it.

    $ tail synonyms.txt
    Television, Televisions, TV, TVs
    #notice we use “gib” instead of “GiB” so any WordDelimiterFilter coming
    #after us won’t split it into two words.

    #Synonym mappings can be used for spelling correction too
    pixima => pixma
    new york, nyc,ny, new york city
    dog => hound, pooch, canis familiaris, man’s best friend
    fc => football club
    ml => major league

    +((((Title_t:ny) (Title_t:beaches))~2) ((+(((Title_t:”new york city”) (Title_t:beaches))~2)) (+(((Title_t:”new york”) (Title_t:beaches))~2)) (+(((Title_t:ny) (Title_t:beaches))~2)) (+(((Title_t:nyc) (Title_t:beaches))~2))))

    +((((Title_t:new) (Title_t:york) (Title_t:beaches))~3^1.1) (((+(((Title_t:”new york city”) (Title_t:beaches))~2)) (+(((Title_t:”new york”) (Title_t:beaches))~2)) (+(((Title_t:ny) (Title_t:beaches))~2)) (+(((Title_t:nyc) (Title_t:beaches))~2)))^0.9))

    7.8 = (MATCH) sum of:

    3.3000002 = (MATCH) sum of:

    1.1 = (MATCH) weight(Title_t:new in 10044) [MinimalScoreDefaultSimilarity],
    result of: 1.1 = score(doc=10044,freq=1.0 = termFreq=1.0),
    product of: 1.1 = queryWeight, product of: 1.0 = idf(docFreq=83, maxDocs=144370) 1.1 = queryNorm 1.0 = fieldWeight in 10044,
    product of: 1.0 = tf(freq=1.0),
    with freq of: 1.0 = termFreq=1.0 1.0 = idf(docFreq=83, maxDocs=144370) 1.0 = fieldNorm(doc=10044)

    1.1 = (MATCH) weight(Title_t:york in 10044) [MinimalScoreDefaultSimilarity],
    result of: 1.1 = score(doc=10044,freq=1.0 = termFreq=1.0),
    product of: 1.1 = queryWeight, product of: 1.0 = idf(docFreq=46, maxDocs=144370) 1.1 = queryNorm 1.0 = fieldWeight in 10044, product of: 1.0 = tf(freq=1.0),
    with freq of: 1.0 = termFreq=1.0 1.0 = idf(docFreq=46, maxDocs=144370) 1.0 = fieldNorm(doc=10044)

    1.1 = (MATCH) weight(Title_t:beaches in 10044) [MinimalScoreDefaultSimilarity],
    result of: 1.1 = score(doc=10044,freq=1.0 = termFreq=1.0),
    product of: 1.1 = queryWeight,
    product of: 1.0 = idf(docFreq=5, maxDocs=144370) 1.1 = queryNorm 1.0 = fieldWeight in 10044,
    product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 1.0 = idf(docFreq=5, maxDocs=144370) 1.0 = fieldNorm(doc=10044)

    4.5 = (MATCH) sum of:

    4.5 = (MATCH) sum of:

    3.6 = (MATCH) weight(Title_t:”new york” in 10044) [MinimalScoreDefaultSimilarity],
    result of: 3.6 = score(doc=10044,freq=1.0 = phraseFreq=1.0),
    product of: 1.8 = queryWeight,
    product of: 2.0 = idf(),
    sum of: 1.0 = idf(docFreq=83, maxDocs=144370) 1.0 = idf(docFreq=46, maxDocs=144370) 0.9 = queryNorm 2.0 = fieldWeight in 10044,
    product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = phraseFreq=1.0 2.0 = idf(), sum of: 1.0 = idf(docFreq=83, maxDocs=144370) 1.0 = idf(docFreq=46, maxDocs=144370) 1.0 = fieldNorm(doc=10044)

    0.9 = (MATCH) weight(Title_t:beaches in 10044) [MinimalScoreDefaultSimilarity],
    result of: 0.9 = score(doc=10044,freq=1.0 = termFreq=1.0),
    product of: 0.9 = queryWeight,
    product of: 1.0 = idf(docFreq=5, maxDocs=144370) 0.9 = queryNorm 1.0 = fieldWeight in 10044, product of: 1.0 = tf(freq=1.0),
    with freq of: 1.0 = termFreq=1.0 1.0 = idf(docFreq=5, maxDocs=144370) 1.0 = fieldNorm(doc=10044)


  21. Posted by jhsuh on June 25, 2013 at 6:45 AM

    I have one more question about synonym_edismax.
    Actually we set the field types in schema.xml just like below and when we index/query for each field, just solr can work with those setting.

    But for synonym_edismax, I can set the only one tokenizer. How shall I do synonym search for each field which has diffrent type(diffrent tokenized)?


  22. Reblogged this on My Blog.


  23. Great post. Have you read the post from Mike McCandless ‘Lucene’s TokenStreams are actually graphs!’ at I think the problem you are having with the highlighting extra terms is a effect of a problem he called sausagization! LOL I have a similar situation to this, we are adding semantic tags that can span terms, and like you and I came to almost the exact same solution, although I didn’t think to play around with the boost. That’s a great idea.


    • Thanks for the link. “Sausagization” perfectly describes the highlighting problem I mentioned above.

      Since the author says that fixing this problem would require some fairly low-level changes in Lucene’s indexer, it sounds like my solution is still pretty useful for the time being!


  24. Posted by aowen on September 30, 2013 at 4:13 PM

    i’m using solr 4.3.1 and want to use your solution. unfortunately i get the following output with http://localhost:8983/solr/select/?q=dog&debugQuery=on&qf=text&defType=synonym_edismax&synonyms=true


    (+(() (((+())/no_coord) ((+())/no_coord) ((+())/no_coord) ((+())/no_coord))))/no_coord

    +(() ((+()) (+()) (+()) (+())))


    why isn’t it something like: +(DisjunctionMaxQuery((text:dog))…….


  25. Posted by aowen on September 30, 2013 at 4:45 PM

    sorry, everything is fine. it was a typo in the request


  26. Posted by Aaron on October 10, 2013 at 9:21 PM

    Hi Nolan,

    First of all, thanks for the great job done! Your synonyms expanding parser is very helpful and definitely a big step forward for the synonym handling in Solr. Were you able to make stemming work with your parser? What if you want “dogs” to be expanded without explicitly specifying plural form in the synonyms list?


    • Unfortunately, since the synonym expansion occurs before the query is processed by the query analyzer, plurals aren’t handled automatically. You’d have to either:

      1) manually include plurals in your synonyms file, or

      2) tweak the “synonym analyzer” and add a stemmer in the tokenization/analysis chain. (Your mileage may vary; I’ve never experimented with it myself.)

      Hope that helps!


  27. Posted by Peter Robsen on October 17, 2013 at 8:44 PM

    Hi, great job for us, now im trying get the synonyms from a web service that already we have, can you help me to achieve?


  28. Posted by Steve on December 20, 2013 at 8:06 AM

    Nice work, thanks. Do you know whether anyone’s tried applying a similar design for a raw Lucene environment/repository (i.e. subclassing QueryParser)?


    • No, although if you wanted to replicate what I did, the code itself would be pretty straightforward. Basically, I just built up a lattice, expanding the query into every possible synonym combination (e.g. dog bites -> dog nibbles, pooch bites, pooch nibbles).


  29. Hi Nolan,
    Very nice tool, thanks. Is it possible to integrate synonyms in other way rather than save them as a text file as described in the SynonymFilterFactory documentation? For example: if synonyms or other related concepts are in xml files (as thesaurus). If yes, where reading of such synonyms source can be done?


  30. Posted by tandula on January 23, 2014 at 4:20 PM

    hi Nolan

    It’s Jan 2014, Solr 4.6 is out, and in the Wiki, yours is the first mentioned and recognized 3rd party Query parser extension.

    Way to go Nolan!
    AB – Anria Billavara


  31. Posted by Alexander on February 10, 2014 at 3:45 PM


    This plugin gives the following error
    maxClauseCount is set to$TooManyClauses: maxClauseCount is set to 5100
    at org.apache.solr.handler.component.QueryComponent.prepare(
    at org.apache.solr.handler.component.SearchHandler.handleRequestBody(
    at org.apache.solr.handler.RequestHandlerBase.handleRequest(
    at org.apache.solr.core.SolrCore.execute(
    at org.apache.solr.servlet.SolrDispatchFilter.execute(
    at org.apache.solr.servlet.SolrDispatchFilter.doFilter(
    at org.apache.solr.servlet.SolrDispatchFilter.doFilter(
    at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(
    at org.eclipse.jetty.servlet.ServletHandler.doHandle(
    at org.eclipse.jetty.server.handler.ScopedHandler.handle(
    at org.eclipse.jetty.server.session.SessionHandler.doHandle(
    at org.eclipse.jetty.server.handler.ContextHandler.doHandle(
    at org.eclipse.jetty.servlet.ServletHandler.doScope(
    at org.eclipse.jetty.server.session.SessionHandler.doScope(
    at org.eclipse.jetty.server.handler.ContextHandler.doScope(
    at org.eclipse.jetty.server.handler.ScopedHandler.handle(
    at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(
    at org.eclipse.jetty.server.handler.HandlerCollection.handle(
    at org.eclipse.jetty.server.handler.HandlerWrapper.handle(
    at org.eclipse.jetty.server.Server.handle(
    at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(
    at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(
    at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(
    at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(
    at org.eclipse.jetty.http.HttpParser.parseNext(
    at org.eclipse.jetty.http.HttpParser.parseAvailable(
    at org.eclipse.jetty.server.BlockingHttpConnection.handle(
    at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(
    at org.eclipse.jetty.util.thread.QueuedThreadPool$

    I think this is because the synonym list is too big. Is it possible to use this kind of expanding in index time to avoid this error?


  32. Posted by Manuel Le Normand on February 18, 2014 at 8:46 AM

    Hi Nolan,
    I’m thinking of adapting your queryParser for dealing with word similarities. These are quasi-synonyms with different similarity scores determined by many hierarchies separate these terms (that are their boosts). It has a hierarchical structure built of bag of words (marked by {}) and terms, for example

    {countries} => {countries in europe}, {countries in asia}, australia, usa
    {countries in europe} => france, england
    {countries in asia} => china, japan, israel
    {celebrity} => britney spears, madonna

    we expect a following query: q={!synonym_edismax}(countries or celebrity) ==> q=max( (australia usa)^2 OR (france, england)^1 OR (china, japan, israel)^1) OR max(britney spears or madonna)

    Before I deep dive into the code, I wanted to know wether this queryParser would be adaptable for this need.

    Second of all, what do you think of contributing the code to solr project so you wouldn’t have to worry for maintaining the code?



    • Unfortunately, the synonym plugin wasn’t really designed for hierarchical synonyms or meronyms/holonyms as you describe. You can attach different synonym files to different fields in order to achieve what your want, but it’d be kinda hacky.

      As for Solr, apparently it’ll be included in 4.8!


      • Posted by Rimas on April 16, 2014 at 6:39 AM

        Can you explain how different synonym files can be attached to different fields using your synonim plugin?

      • Check out the example config. Where it says “myCoolAnalyzer”, you can add multiple tags with whatever analyzers you want, e.g. “myCoolAnalyzer2″, “myCoolAnalyzer3″, etc. Then when you query, you just specify the analyzer you want to use with the synonyms.analyzer option. Unfortunately you’ll have to do a separate query for each analyzer, though.

  33. Posted by Bernd Wölfel on July 28, 2014 at 9:06 AM

    Hi Nolan,

    this is an awesome plugin, exactly what I need for my project. Unfortunately I am not able to get it to run with the example config from GitHub. I’m using sole-4.6.1 with your latest plugin version (on a Sun-Java6 VM)

    The configuration besides that is pretty simple, the plugin itself works nicely, but it cannot find the one and only “MyCoolAnalyzer” with it’s defined values from solrconfig. It always gives me NoAnalyzerSpecified/AnalyzerNotFound no matter what I do (even got the Java source and set the name to look for statically to “MyCoolAnalyzer”. The Collection just stays empty.

    I guess it is a pretty dumb mistake, but if anyone could give me a pointer in the right direction I’d appreciate it very much.

    Thank you!



    • Posted by Bernd Wölfel on July 31, 2014 at 7:56 AM

      Sorry, I figured my issue out as well (misallocated synonyms file) – Thank you so much for the great Plugin!


    • Hi Bernd, if you check out the code you will see that there are Python scripts to set up a little Solr server and run the tests automatically. It even downloads the Solr binaries for you, so you don’t have to do anything except have Python and Java installed. If you compare that setup to yours, I’m sure you’ll see what the issue is! Cheers.


  34. Hi Nolan,

    Thanks for great plugin. I tried your parser. Its expanding the queries as expected but I’m getting zero responses. Also I must mention that I’m just starting with solr and lucene.

    Here’s a part of the result for the query

    responseHeader: {
    status: 0,
    QTime: 18,
    params: {
    debugQuery: “on”,
    synonyms.synonymBoost: “1.1”,
    q: “crowd finance”,
    qf: “text”,
    synonyms: “true”,
    wt: “json”,
    synonyms.originalBoost: “1.2”,
    defType: “synonym_edismax”
    response: {
    numFound: 0,
    start: 0,
    docs: [ ]
    debug: {
    rawquerystring: “crowd finance”,
    querystring: “crowd finance”,
    parsedquery: “SynonymExpandingExtendedDismaxQuery(custom(boost(+(((text:crowd) (text:finance))~2) (title:”crowd finance”~10^25.0) (concept_tags:”crowd finance”~10^25.0) (content:”crowd finance”~10),product(pow(int(share_count),const(1.5)),0.08/(3.16E-8float(ms(const(1406797200000),date(earliest_known_date)))+0.05)))))”,
    parsedquery_toString: “custom(boost(+(((text:crowd) (text:finance))~2) (title:”crowd finance”~10^25.0) (concept_tags:”crowd finance”~10^25.0) (content:”crowd finance”~10),product(pow(int(share_count),const(1.5)),0.08/(3.16E-8
    explain: { },
    queryToHighlight: [
    “ (text:finance))~2^1.2″,
    “ +(text:business) +(text:finance))~1) (title:”crowd business and finance”~10^25.0) (concept_tags:”crowd business and finance”~10^25.0) (content:”crowd business and finance”~10)))^1.1″,
    “ (text:funding))~2) (title:”crowd funding”~10^25.0) (concept_tags:”crowd funding”~10^25.0) (content:”crowd funding”~10)))^1.1″,
    “ (text:business) (text:finance))~3) (title:”crowd business finance”~10^25.0) (concept_tags:”crowd business finance”~10^25.0) (content:”crowd business finance”~10)))^1.1″,
    “ (text:financial))~2) (title:”crowd financial”~10^25.0) (concept_tags:”crowd financial”~10^25.0) (content:”crowd financial”~10)))^1.1″
    expandedSynonyms: [
    “crowd business and finance”,
    “crowd business finance”,
    “crowd finance”,
    “crowd finance”,
    “crowd financial”,
    “crowd funding”
    mainQueryParser: [
    [ ],
    synonymQueryParser: [
    [ ],

    Am I missing something obvious?



  35. Posted by Frédéric on August 29, 2014 at 10:00 AM

    Hey Nolan, long time no see!!!

    Guess what… I m currently working on my medical thesis which concerns some kind of search engine in the patients documentation. The thing is supposed to be well suited to in-hosp physicians (whose effectiveness on EHRs is quite poor in general, I must admit… ).

    Anyways i was running some query on google with “mesh lucene semantic” (i think) and I ended up here! And i very much enjoyed it, I mut say!!! ;)

    I m gonna delve into it right away so thanks for the work!

    I hope we ll see each other again! Cheers

    Frédéric (HUG)


    • Salut Fréd! That’s awesome; glad to know you’re still doing well at the HUG. :)

      The synonyms project is still going strong, and still the most popular project on the HON’s GitHub page. So yeah, it filled a neat little void in the Solr ecosystem.

      Take care; hope your thesis goes well!


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Get every new post delivered to your Inbox.

Join 828 other followers

%d bloggers like this: