NLP | Read the Tea Leaves

Archive for the ‘NLP’ Category

2 Apr

AI ambivalence

Posted by Nolan Lawson in Machine Learning, NLP. 28 comments

I’ve avoided writing this post for a long time, partly because I try to avoid controversial topics these days, and partly because I was waiting to make my mind up about the current, all-consuming, conversation-dominating topic of generative AI. But Steve Yegge’s “Revenge of the junior developer” awakened something in me, so let’s go for it.

I don’t come to AI from nowhere. Longtime readers may be surprised to learn that I have a master’s in computational linguistics, i.e. I studied this kind of stuff 20-odd years ago. In fact, two of the authors of the famous “stochastic parrot” paper were folks I knew at the time – Emily Bender was one of my professors, and Margaret Mitchell was my lab partner in one of our toughest classes (sorry my Python sucked at the time, Meg).

That said, I got bored of working in AI after grad school, and quickly switched to general coding. I just found that “feature engineering” (which is what we called training models at the time) was not really my jam. I much preferred to put on some good coding tunes, crank up the IDE, and bust out code all day. Plus, I had developed a dim view of natural-language processing technologies largely informed by my background in (non-computational) linguistics as an undergrad.

In linguistics, we were taught that the human mind is a wondrous thing, and that Chomsky had conclusively shown that humans have a natural language instinct. The job of the linguist is to uncover the hidden rules in the human mind that govern things like syntax, semantics, and phonology (i.e. why the “s” in “beds” is pronounced like a “z” unlike in “bets,” due to the voicing of the final consonant).

Then when I switched to computational linguistics, suddenly the overriding sensation I got was that everything was actually about number-crunching, and in fact you could throw all your linguistics textbooks in the trash and just let gobs of training data and statistics do the job for you. “Every time I fire a linguist, the performance goes up,” as a famous computational linguist said.

I found this perspective belittling and insulting to the human mind, and more importantly, it didn’t really seem to work. Natural-language processing technology seemed stuck at the level of support vector machines and conditional random fields, hardly better than the Markov models in your iPhone 2’s autocomplete. So I got bored and disillusioned and left the field of AI.

Boy, that AI thing sure came back with a vengeance, didn’t it?

Still skeptical

That said, while everybody else was either reacting with horror or delight at the tidal wave of gen-AI hype, I maintained my skepticism. At the end of the day, all of this technology was still just number-crunching – brute force trying to approximate the hidden logic that Chomsky had discovered. I acknowledged that there was some room for statistics – Peter Norvig’s essay mentioning the story of an Englishman ordering an “ale” and getting served an “eel” due to the Great Vowel Shift still sticks in my brain – but overall I doubted that mere stats could ever approach anything close to human intelligence.

Today, though, philosophical questions of what AI says about human cognition seem beside the point – these things can get stuff done. Especially in the field of coding (my cherished refuge from computational linguistics), AIs now dominate: every IDE assumes I want AI autocomplete by default, and I actively have to hunt around in the settings to turn it off.

And for several years, that’s what I’ve been doing: studiously avoiding generative AI. Not just because I doubted how close to “AGI” these things actually were, but also because I just found them annoying. I’m a fast typist, and I know JavaScript like the back of my hand, so the last thing I want is some overeager junior coder grabbing my keyboard to mess with the flow of my typing. Every inline-coding AI assistant I’ve tried made me want to gnash my teeth together – suddenly instead of writing code, I’m being asked to constantly read code (which as everyone knows, is less fun). And plus, the suggestions were rarely good enough to justify the aggravation. So I abstained.

Later I read Baldur Bjarnason’s excellent book The Intelligence Illusion, and this further hardened me against generative AI. Why use a technology that 1) dumbs down the human using it, 2) generates hard-to-spot bugs, and 3) doesn’t really make you much more productive anyway, when you consider the extra time reading, reviewing, and correcting its output? So I put in my earbuds and kept coding.

Meanwhile, as I was blissfully coding away like it was ~2020, I looked outside my window and suddenly realized that the tidal wave was approaching. It was 2025, and I was (seemingly) the last developer on the planet not using gen-AI in their regular workflow.

Opening up

I try to keep an open mind about things. If you’ve read this blog for a while, you know that I’ve sometimes espoused opinions that I later completely backtracked on – my post from 10 years ago about progressive enhancement is a good example, because I’ve almost completely swung over to the progressive enhancement side of things since then. My more recent “Why I’m skeptical of rewriting JavaScript tools in ‘faster’ languages” also seems destined to age like fine milk. Maybe I’m relieved I didn’t write a big bombastic takedown of generative AI a few years ago, because hoo boy.

I started using Claude and Claude Code a bit in my regular workflow. I’ll skip the suspense and just say that the tool is way more capable than I would ever have expected. The way I can use it to interrogate a large codebase, or generate unit tests, or even “refactor every callsite to use such-and-such pattern” is utterly gobsmacking. It also nearly replaces StackOverflow, in the sense of “it can give me answers that I’m highly skeptical of,” i.e. it’s not that different from StackOverflow, but boy is it faster.

Here’s the main problem I’ve found with generative AI, and with “vibe coding” in general: it completely sucks out the joy of software development for me.

Imagine you’re a Studio Ghibli artist. You’ve spent years perfecting your craft, you love the feeling of the brush/pencil in your hand, and your life’s joy is to make beautiful artwork to share with the world. And then someone tells you gen-AI can just spit out My Neighbor Totoro for you. Would you feel grateful? Would you rush to drop your art supplies and jump head-first into the role of AI babysitter?

This is how I feel using gen-AI: like a babysitter. It spits out reams of code, I read through it and try to spot the bugs, and then we repeat. Although of course, as Cory Doctorow points out, the temptation is to not even try to spot the bugs, and instead just let your eyes glaze over and let the machine do the thinking for you – the full dream of vibe coding.

I do believe that this is the end state of this kind of development: “giving into the vibes,” not even trying to use your feeble primate brain to understand the code that the AI is barfing out, and instead to let other barf-generating “agents” evaluate its output for you. I’ll accept that maybe, maybe, if you have the right orchestra of agents that you’re conducting, then maybe you can cut down on the bugs, hallucinations, and repetitive boilerplate that gen-AI seems prone to. But whatever you’re doing at that point, it’s not software development, at least not the kind that I’ve known for the past ~20 years.

Conclusion

I don’t have a conclusion. Really, that’s my current state: ambivalence. I acknowledge that these tools are incredibly powerful, I’ve even started incorporating them into my work in certain limited ways (low-stakes code like POCs and unit tests seem like an ideal use case), but I absolutely hate them. I hate the way they’ve taken over the software industry, I hate how they make me feel while I’m using them, and I hate the human-intelligence-insulting postulation that a glorified Excel spreadsheet can do what I can but better.

In one of his podcasts, Ezra Klein said that he thinks the “message” of generative AI (in the McLuhan sense) is this: “You are derivative.” In other words: all your creativity, all your “craft,” all of that intense emotional spark inside of you that drives you to dance, to sing, to paint, to write, or to code, can be replicated by the robot equivalent of 1,000 monkeys typing at 1,000 typewriters. Even if it’s true, it’s a pretty dim view of humanity and a miserable message to keep pounding into your brain during 8 hours of daily software development.

So this is where I’ve landed: I’m using generative AI, probably just “dipping my toes in” compared to what maximalists like Steve Yegge promote, but even that little bit has made me feel less excited than defeated. I am defeated in the sense that I can’t argue strongly against using these tools (they bust out unit tests way faster than I can, and can I really say that I was ever lovingly-crafting my unit tests?), and I’m defeated in the sense that I can no longer confidently assert that brute-force statistics can never approach the ineffable beauty of the human mind that Chomsky described. (If they can’t, they’re sure doing a good imitation of it.)

I’m also defeated in the sense that this very blog post is just more food for the AI god. Everything I’ve ever written on the internet (including here and on GitHub) has been eagerly gobbled up into the giant AI katamari and is being used to happily undermine me and my fellow bloggers and programmers. (If you ask Claude to generate a “blog post title in the style of Nolan Lawson,” it can actually do a pretty decent job of mimicking my shtick.) The fact that I wrote this entire post without the aid of generative AI is cold comfort – nobody cares, and likely few have gotten to the end of this diatribe anyway other than the robots.

So there’s my overwhelming feeling at the end of this post: ambivalence. I feel besieged and horrified by what gen-AI has wrought on my industry, but I can no longer keep my ears plugged while the tsunami roars outside. Maybe, like a lot of other middle-aged professionals suddenly finding their careers upended at the peak of their creative power, I will have to adapt or face replacement. Or maybe my best bet is to continue to zig while others are zagging, and to try to keep my coding skills sharp while everyone else is “vibe coding” a monstrosity that I will have to debug when it crashes in production someday.

I honestly don’t know, and I find that terrifying. But there is some comfort in the fact that I don’t think anyone else knows what’s going to happen either.

14 May

Using distributed search handlers in Solr 3.6.2

Posted by Nolan Lawson in NLP. Tagged: lucene, solr. Leave a comment

TL;DR: disable lazy-loading on the /spell handler if you’re using Solr 3.6.2 and distributed (i.e. sharded) search.

I discovered an interesting bug in Solr 3.6.2 the other day, so I thought I’d share it here.

While upgrading a distributed Solr system from version 3.5.0 to 3.6.2, everything worked as expected with minimal changes to the configuration, except for the /spell search handler.

The other search handlers I’d defined (the standard /select and a /suggest for autosuggestions) responded just fine, but a distributed /spell failed with a mysterious error in the logs:

Server returned HTTP response code: 400 for URL: 
http://localhost:8080/solr-search/shard2/spell
    ?shards=localhost%3A8080%2Fsolr-search%2Fshard1%2Clocalhost%3A8080%2Fsolr-search%2Fshard2
    &shards.qt=%2Fspell[...]

The logs from the server side weren’t any more helpful:

SEVERE: org.apache.solr.common.SolrException: Bad Request

Bad Request

request: http://localhost:8080/solr-search/shard1/spell
 at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:427)
 at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:249)
 at org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:129)
 at org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:103)
 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
 at java.util.concurrent.FutureTask.run(FutureTask.java:138)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
 at java.util.concurrent.FutureTask.run(FutureTask.java:138)
 at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
 at java.lang.Thread.run(Thread.java:680)

Running Wireshark to capture HTTP requests on localhost dug up the following silent error message from Tomcat:

HTTP Status 400 - isShard is only acceptable with search handlers
    type: Status report
    message: isShard is only acceptable with search handlers
    description: The request sent by the client was syntactically 
        incorrect (isShard is only acceptable with search handlers).

Huh! I could have sworn it was a SearchHandler. Checking the configuration for the /spell handler in solrconfig.xml, I indeed found the following:

<requestHandler name="/spell" class="solr.SearchHandler" lazy="true">
...
</requestHandler>

After some fruitless fiddling with the configuration, I finally gave up and launched a remote debugger in Eclipse. Grepping the Solr 3.6.2 source code showed that the exception was thrown from SolrCore.java lines 1373-1374:

if (req.getParams().getBool(ShardParams.IS_SHARD,false) 
        && !(handler instanceof SearchHandler))
  throw new SolrException(SolrException.ErrorCode.BAD_REQUEST,
        "isShard is only acceptable with search handlers");

In the Eclipse debugger, I saw that when this line was invoked, the handler object was actually a LazyRequestHandlerWrapper (an internal class in 3.6.2) rather than a SearchHandler, causing the if statement to evaluate to false. According to the LazyRequestHandlerWrapper documentation:

  /**
   * The <code>LazyRequestHandlerWrapper</core> wraps any {@link SolrRequestHandler}.  
   * Rather then instanciate and initalize the handler on startup, this wrapper waits
   * until it is actually called.  This should only be used for handlers that are
   * unlikely to be used in the normal lifecycle.
   * 
   * You can enable lazy loading in solrconfig.xml using:
   * 
   * <pre>
   *  &lt;requestHandler name="..." class="..." startup="lazy"&gt;
   *    ...
   *  &lt;/requestHandler&gt;
   * </pre>
   * 
   * This is a private class - if there is a real need for it to be public, it could
   * move
   * 
   * @version $Id: RequestHandlers.java 1306137 2012-03-28 03:30:52Z dsmiley $
   * @since solr 1.2
   */

Aha! So that’s why my /spell handler was behaving strangely – it was the only one using lazy loading.

Well, I never needed the lazy loading that badly anyway. Setting lazy="false" in the XML configuration immediately corrected the problem.

Interestingly, it appears that the offending code has been commented out in Solr 4.3.0 (SolrCore.java:1812-1814):

// TODO: this doesn't seem to be working correctly and causes problems with the example server and distrib (for example /spell)
// if (req.getParams().getBool(ShardParams.IS_SHARD,false) && !(handler instanceof SearchHandler))
//   throw new SolrException(SolrException.ErrorCode.BAD_REQUEST,"isShard is only acceptable with search handlers");

I suspect that the reason this code backfires is that, in example/solr/collection1/conf/solrconfig.xml, the /spell handler is defined as lazy="true". This would mean that the commented code could be fixed by simply publicizing the LazyRequestHandlerWrapper class and accounting for handler wrapping in the if statement:

SolrRequestHandler trueHandler = handler instanceof LazyRequestHandlerWrapper 
        ? ((LazyRequestHandlerWrapper)handler).getWrappedHandler()
        : handler;

if (req.getParams().getBool(ShardParams.IS_SHARD,false) && !(trueHandler instanceof SearchHandler))
  throw new SolrException(SolrException.ErrorCode.BAD_REQUEST,"isShard is only acceptable with search handlers");

I’ve already submitted this as a patch to Solr. We’ll see if the developers agree with my hunch.

31 Oct

Better synonym handling in Solr

Posted by Nolan Lawson in NLP. Tagged: information retrieval, lucene, nlp, query expansion, solr, synonyms. 88 comments

Update: Download the plugin on Github.

It’s a pretty common scenario when working with a Solr-powered search engine: you have a list of synonyms, and you want user queries to match documents with synonymous terms. Sounds easy, right? Why shouldn’t queries for “dog” also match documents containing “hound” and “pooch”? Or even “Rover” and “canis familiaris”?

A Rover by any other name would taste just as sweet.

As it turns out, though, Solr doesn’t make synonym expansion as easy as you might like. And there are lots of good ways to shoot yourself in the foot.

The SynonymFilterFactory

Solr provides a cool-sounding SynonymFilterFactory, which can be a fed a simple text file containing comma-separated synonyms. You can even choose whether to expand your synonyms reciprocally or to specify a particular directionality.

For instance, you can make “dog,” “hound,” and “pooch” all expand to “dog | hound | pooch,” or you can specify that “dog” maps to “hound” but not vice-versa, or you can make them all collapse to “dog.” This part of the synonym handling is very flexible and works quite well.

Where it gets complicated is when you have to decide where to fit the SynonymFilterFactory: into the query analyzer or the index analyzer?

Index-time vs. query-time

The graphic below summarizes the basic differences between index-time and query-time expansion. Our problem is specific to Solr, but the choice between these two approaches can apply to any information retrieval system.

Index-time vs. query-time expansion.

Your first, intuitive choice might be to put the SynonymFilterFactory in the query analyzer. In theory, this should have several advantages:

Your index stays the same size.
Your synonyms can be swapped out at any time, without having to update the index.
Synonyms work instantly; there’s no need to re-index.

However, according to the Solr docs, this is a Very Bad Thing to Do(™), and apparently you should put the SynonymFilterFactory into the index analyzer instead, despite what your instincts would tell you. They explain that query-time synonym expansion has two negative side effects:

Multi-word synonyms won’t work as phrase queries.
The IDF of rare synonyms will be boosted, causing unintuitive results.
Multi-word synonyms won’t be matched in queries.

This is kind of complicated, so it’s worth stepping through each of these problems in turn.

Multi-word synonyms won’t work as phrase queries

At Health On the Net, our search engine uses MeSH terms for query expansion. MeSH is a medical ontology that works pretty well to provide some sensible synonyms for the health domain. Consider, for example, the synonyms for “breast cancer”:

breast neoplasm
breast neoplasms
breast tumor
breast tumors
cancer of breast
cancer of the breast

So in a normal SynonymFilterFactory setup with expand=”true”, a query for “breast cancer” becomes:

+((breast breast breast breast breast cancer cancer) (cancer neoplasm neoplasms tumor tumors) breast breast)

…which matches documents containing “breast neoplasms,” “cancer of the breast,” etc.

However, this also means that, if you’re doing a phrase query (i.e. “breast cancer” with the quotes), your document must literally match something like “breast cancer breast breast” in order to work.

Huh? What’s going on here? Well, it turns out that the SynonymFilterFactory isn’t expanding your multi-word synonyms the way you might think. Intuitively, if we were to represent this as a finite-state automaton, you might think that Solr is building up something like this (ignoring plurals):

What you reasonably expect.

But really it’s building up this:

The spaghetti you actually get.

And your poor, unlikely document must match all four terms in sequence. Yikes.

Similarly, the mm parameter (minimum “should” match) in the DisMax and EDisMax query parsers will not work as expected. In the example above, setting mm=100% will require that all four terms be matched:

+((breast breast breast breast breast cancer cancer) (cancer neoplasm neoplasms tumor tumors) breast breast)~4

The IDF of rare synonyms will be boosted

Even if you don’t have multi-word synonyms, the Solr docs mention a second good reason to avoid query-time expansion: unintuitive IDF boosting. Consider our “dog,” “hound,” and “pooch” example. In this case, a query for any one of the three will be expanded into:

+(dog hound pooch)

Since “hound” and “pooch” are much less common words, though, this means that documents containing them will always be artificially high in the search results, regardless of the query. This could create havoc for your poor users, who may be wondering why weird documents about hounds and pooches are appearing so high in their search for “dog.”

Index-time expansion supposedly fixes this problem by giving the same IDF values for “dog,” “hound,” and “pooch,” regardless of what the document originally said.

Multi-word synonyms won’t be matched in queries

Finally, and most seriously, the SynonymFilterFactory will simply not match multi-word synonyms in user queries if you do any kind of tokenization. This is because the tokenizer breaks up the input before the SynonymFilterFactory can transform it.

For instance, the query “cancer of the breast” will be tokenized by the StandardTokenizationFactory into [“cancer”, “of”, “the”, “breast”], and only the individual terms will pass through the SynonymFilterFactory. So in this case no expansion will take place at all, assuming there are no synonyms for the individual terms “cancer” and “breast.”

Edit: I’ve been corrected on this. Apparently, the bug is in the Lucene query parser (LUCENE-2605) rather than the SynonymFilterFactory.

Back to the drawing board

All of this wackiness led me to the conclusion that Solr’s built-in mechanism for synonym expansion was seriously flawed. I had to figure out a better way to get Solr to do what I wanted.

In summary, index-time expansion and query-time expansion were both unfeasible using the standard SynonymFilterFactory, since they each had separate problems:

Index-time

Index size balloons.
Synonyms don’t work instantly; documents must be re-indexed.
Synonyms cannot be instantly replaced.
Multi-word synonyms cause arbitrary words to be highlighted.

Query-time

Phrase queries do not work.
IDF values for rare synonyms are artificially boosted.
Multi-word synonyms won’t be matched in queries.

I began with the assumption that the ideal synonym-expansion system should be query-based, due to the inherent downsides of index-based expansion listed above. I also realized there’s a more fundamental problem with how Solr has implemented synonym expansion that should be addressed first.

Going back to the “dog”/”hound”/”pooch” example, there’s a big issue usability-wise with treating all three terms as equivalent. A “dog” is not exactly the same thing as a “pooch” or a “hound,” and certain queries might really be looking for that exact term (e.g. “The Hound of the Baskervilles,” “The Itchy & Scratchy & Poochy Show”). Treating all three as equivalent feels wrong.

Also, even with the recommended approach of index-time expansion, IDF weights are thrown out of whack. Every document that contains “dog” now also contains “pooch”, which means we have permanently lost information about the true IDF value for “pooch”.

In an ideal system, a search for “dog” should include documents containing “hound” and “pooch,” but it should still prefer documents containing the actual query term, which is “dog.” Similarly, searches for “hound” should prefer “hound,” and searches for “pooch” should prefer “pooch.” (I hope I’m not saying anything controversial here.) All three should match the same document set, but deliver the results in a different order.

Solution

My solution was to move the synonym expansion from the analyzer’s tokenizer chain to the query parser. So instead of expanding queries into the crazy intercrossing graphs shown above, I split it into two parts: the main query and the synonym query. Then I combine the two with separate, configurable weights, specify each one as “should occur,” and then wrap them both in a “must occur” boolean query.

So a search for “dog” is parsed as:

+((dog)^1.2 (hound pooch)^1.1)

The 1.2 and the 1.1 are the independent boosts, which can be configured as input parameters. The document must contain one of “dog”, “hound,” or “pooch”, but “dog” is preferred.

Handling synonyms in this way also has another interesting side effect: it eliminates the problem of phrase queries not working. In the case of “breast cancer” (with the quotes), the query is parsed as:

+(("breast cancer")^1.2 (("breast neoplasm") ("breast tumor") ("cancer ? breast") ("cancer ? ? breast"))^1.1)

(The question marks appear because of the stopwords “of” and “the.”)

This means that a query for “breast cancer” (with the quotes) will also match documents containing the exact sequence “breast neoplasm,” “breast tumor,” “cancer of the breast,” and “cancer of breast.”

I also went one step beyond the original SynonymFilterFactory and built up all possible synonym combinations for a given query. So, for instance, if the query is “dog bite” and the synonyms file contains:

dog,hound,pooch
bite,nibble

… then the query will be expanded into:

dog bite
hound bite
pooch bite
dog nibble
hound nibble
pooch nibble

Try it yourself!

The code I wrote is a simple extension of the ExtendedDisMaxQueryParserPlugin, called the SynonymExpandingExtendedDisMaxQueryParserPlugin (long enough name?). I’ve only tested it to work with Solr 3.5.0, but it ought to work with any version that has EDisMax.

Edit: the instructions below are deprecated. Please follow the “Getting Started” guide on the Github page instead.

Here’s how you can use the parser:

Drop this jar into your Solr’s lib/ directory.
Add this definition to your solrconfig.xml:

<queryParser name="synonym_edismax" class="solr.SynonymExpandingExtendedDismaxQParserPlugin">
  <!-- TODO: figure out how we wouldn't have to define this twice -->
  <str name="luceneMatchVersion">LUCENE_34</str>
  <lst name="synonymAnalyzers">
    <lst name="myCoolAnalyzer">
      <lst name="tokenizer">
        <str name="class">solr.StandardTokenizerFactory</str>
      </lst>
      <lst name="filter">
        <str name="class">solr.ShingleFilterFactory</str>
        <str name="outputUnigramsIfNoShingles">true</str>
        <str name="outputUnigrams">true</str>
        <str name="minShingleSize">2</str>
        <str name="maxShingleSize">4</str>
      </lst>
      <lst name="filter">
        <str name="class">solr.SynonymFilterFactory</str>
        <str name="tokenizerFactory">solr.KeywordTokenizerFactory</str>
        <str name="synonyms">my_synonyms_file.txt</str>
        <str name="expand">true</str>
        <str name="ignoreCase">true</str>
      </lst>
    </lst>
    <!-- add more analyzers here, if you want -->
  </lst>
</queryParser>

The analyzer you see defined above is the one used to split the query into all possible alternative synonyms. Synonyms that are exactly the same as the original query will be ignored, so feel free to use expand=true if you like.

This particular configuration (StandardTokenizerFactory + ShingleFilterFactory + SynonymFilterFactory) is just the one that I found worked the best for me. Feel free to try a different configuration, but something really fancy might break the code, so I don’t recommend going too far.

For instance, you can configure the ShingleFilterFactory to output shingles (i.e. word N-grams) of any size you want, but I chose shingles of size 1-4 because my synonyms typically aren’t longer than 4 words. If you don’t have any multi-word synonyms, you can get rid of the ShingleFilterFactory entirely.

(I know that this XML format is different from the typical one found in schema.xml, since it uses lst and str tags to configure the tokenizer and filters. Also, you must define the luceneMatchVersion a second time. I’ll try to find a way to fix these problems in a future release.)

Add defType=synonym_edismax to your query URL parameters, or set it as the default in solrconfig.xml.
Add the following query parameters. The first one is required:

Param	Type	Default	Summary
synonyms	boolean	false	Enable or disable synonym expansion entirely. Enabled if true.
synonyms.analyzer	String	null	Name of the analyzer defined in solrconfig.xml to use. (E.g. in the example above, it’s myCoolAnalyzer). This must be non-null, if you define more than one analyzer.
synonyms.originalBoost	float	1.0	Boost value applied to the original (non-synonym) part of the query.
synonyms.synonymBoost	float	1.0	Boost value applied to the synonym part of the query.
synonyms.disablePhraseQueries	boolean	false	Enable or disable synonym expansion when the user input contains a phrase query (i.e. a quoted query).

Future work

Note that the parser does not currently expand synonyms if the user input contains complex query operators (i.e. AND, OR, +, and –). This is a TODO for a future release.

I also plan on getting in contact with the Solr/Lucene folks to see if they would be interested in including my changes in an upcoming version of Solr. So hopefully patching won’t be necessary in the future.

In general, I think my approach to synonyms is more principled and less error-prone than the built-in solution. If nothing else, though, I hope I’ve demonstrated that making synonyms work in Solr isn’t as cut-and-dried as one might think.

As usual, you can fork this code on GitHub!

8 Sep

Relatedness Calculator: Part Deux

Posted by Nolan Lawson in NLP, Webapps. Tagged: grails, relatedness calculator, ui design, webapps. 2 comments

Recently I added a ton of features and improvements to the Relatedness Calculator. Not only is the new app faster, lighter, prettier, and lower-cholesterol, but it’s also made me a savvier web developer in the process. How about that! I find myself actually learning a lot about web development, and slowly correcting my shameful n00b errors from the first version.

Version One

Now, I think the first version of the Relatedness Calculator was pretty cool. Neato, even. It solved the original problem that got me excited about this project in the first place, which was:

How closely related am I to my grandma’s cousin’s daughter?

— A dude on a message board

Answering this question in an automated way required writing a parser, defining a hell of a lot of English relationships, and writing an algorithm to calculate the relatedness coefficient for arbitrarily chained relationships (e.g. “X’s X’s X’s (…) X”).

Nothing could be simpler.

As it turns out, there’s a lot you have to know about genealogy in order to write such a system. At the minimum, you have to read something like Richard Dawkins’ essay on genesmanship from The Selfish Gene. But then there are also a lot of non-obvious things you learn that go beyond genealogy 101.

For instance, I learned that there’s a whole class of relationships that can be expressed with “half” – it’s not just half-brothers and half-sisters. There are half-cousins, half-uncles, half-second cousins, and even half-great-nieces too. What they all have in common is having two common ancestors in the family tree, which get reduced to just one in the “half” versions. Neat, huh?

There’s also a sort of “relatedness addition” that goes into calculating the relatedness coefficients for chained relations. If, for instance, I was parsing “uncle’s daughter,” I discovered I could take the common ancestors for “uncle” and the common ancestors for “daughter,” then use what the math geeks call matrix addition to get the right set of common ancestors. And it worked! I have no idea if it’s justified mathematically, but it made my unit tests pass, so that’s all I care about. (Man, am I a computer programmer, or what?)

There are also some relations that don’t make sense when you chain them together – for instance, “cousin’s brother” or “father’s son.” Examples like these kept screwing up the relatedness addition, because in fact they were redundant ways of expressing a much simpler concept. I.e., “cousin’s brother” is really just a cousin, and “father’s son” is really just you, or possibly your brother. The app had to handle all these errors to avoid barfing up nonsensical results, such as arbitrarily reduced relatedness coefficients for “father’s son’s father’s son’s father’s son’s…” etc.

So essentially, version one was an academic exercise. I barely paid any attention to the UI, and instead just made sure that the core logic was bulletproof. But all that changed with the second version.

User Confusion

From looking at the logs, I could see that users were typing a lot of unexpected queries that completely broke the system:

step-cousin twice removed on my mother’s side

john’s uncle’s wife

identical twin

father’s 2nd wife’s son

4th great-grandfather’s granddaughter

The worst part about a user typing something the system doesn’t understand is that they get a big fat error page. Most users will probably leave the site the first time something like that happens. After all, they went to all the trouble of typing a query out, and the system puked. Who would want to waste their time on a site like that?

Yes, my anonymous Internet application knows all about your friend John.

Obviously some of these queries are things I can fix – “identical twin” and “twice removed” are all features I added in version two. But other things don’t really make any sense in terms of calculating relatedness. Who is “John”? How am I supposed to know if you’re biologically related to his wife? And you’re not biologically related to your step-cousins, either!

I decided the biggest problem here was not that the system was throwing error messages, but rather that users didn’t know what kind of input the system expected. This is a classic autocompletion problem. People hate staring at a blank page. So starting with version two, I added a nice little autocomplete box.

Showing the space of possible queries goes a long way to help increase the user’s comprehension of the system. “Do I have to start with the word ‘my’?” “Do I have to put a space after the word ‘great’?” “Do I have to type ‘grandfather’ or can I just put ‘gramps’?” All these questions are answered in real time, as you type. It also makes playing around with the system rewarding and fun.

Memory usage

One of the biggest problems I ran into after adding autocompletion was OutOfMemoryErrors. I had used a simple trie object to suggest autocompletions, but this was really taxing the available memory on my Amazon EC2 Micro instance (613 MB).

So the first thing I did was add a 1G swap file on the OS. This had the effect of degrading the performance, while avoiding a complete application crash. I knew I still had some work to do.

I deduced that the new trie object was the culprit, given that the system didn’t crash until after the first autocomplete request was made. So using NetBeans to profile the memory usage, I slowly started slimming down the trie.

The original trie used a classic node-based data structure, similar to a linked list:

public class Trie<T> {
    private TrieNode root = new TrieNode();

    private class TrieNode {

        T value;
        Map<Character, TrieNode> next = Maps.newHashMap();

    }

A parameterized Trie<T> could hold any object in the T, and out of laziness I was storing the entire String that led to a leaf TrieNode. So first off I axed that, since the String itself could just be reconstructed from the breadcrumb trail of Characters anyway.

Next, I cut the memory usage of the Trie roughly in half by replacing all instances of HashMap<Character, TrieNode> with a custom SparseCharArray<TrieNode>. My SparseCharArray is basically just an Object array that has some of the behavior of a Map when I need it. I had already used a similar data structure for my Japanese Name Converter.

In the end, the new trie greatly improved the memory usage of the application, to the point where I didn’t even really need the swap space anymore. (But it’s nice to keep, just in case!)

Browser optimization

Having very little experience in building webapps, I didn’t have the faintest clue how to optimize one. Luckily there are a lot of smart folks who have already figured most of this stuff out, and who publish nice, condensed best-practice guides. Also, the debugging tools are incredibly slick these days, and between Firebug and Chrome’s Developer Tools, I had everything I needed to get under the hood and start tinkering around.

As it turned out, the biggest drain on the application’s performance was just resource management. Grails automates a lot of that, and it’s nice when it works, but when it doesn’t, you have to get your hands dirty. I eventually noticed I had a misbehaving plugin that was requesting Javascript resources that didn’t exist, so I was getting 302 (“moved temporarily”) redirects. On my teensy Micro instance running in Oregon, halfway across the globe from me, this was adding a nearly half-second roundtrip for each request. Once I fixed the links, though, the resources could be cached in the browser and therefore loaded in almost 0 time after the first request.

Another basic problem with redirects came from the root of the app itself. Linking to “/relatedness-calculator” instead of “/relatedness-calculator/” (i.e. without the slash), amazingly, also added about half a second to the request, which would occur anytime the user clicked the logo. So the addition of a single character saved me half a second of response time.

De-uglification

I don’t think there’s much that needs to be said here. Just take a look at the before and after pictures:

Before.

After.

Yep, I finally broke down and learned how to use Gimp. Bubbly, glossy Web 2.0 text for the win!

Another subtle visual change I made was to the family tree graph. To make the relationships in the graph clearer, I added emphasis to the nodes that were most relevant to the user’s query. For instance, in the case of “grandpa’s cousin,” the system draws a green border around “you,” “your grandpa,” and “your grandpa’s cousin.”

I don’t think I’ve ever met a single one of my grandpa’s cousins.

Conclusion

In the end, I’m pretty pleased with the changes I made to the Relatedness Calculator, and I’m proud of the end product. The only thing I find that I’m missing is the thrill I always got from Android development, from getting my code immediately into users’ hands and finding out whether an app was a winner or a dud. With Android development, users always seemed to quickly find my apps and give me feedback on them, even if I never did any marketing.

For instance, even my all-time least popular app, App Tracker, has 4,000 downloads, 50 ratings, 30 comments, and a handful of emails I’ve received from interested users. But with the Relatedness Calculator, it’s been out for three months and I haven’t heard a peep about it. It doesn’t seem to rank highly in Google’s search results, and according to the logs it’s only gotten about 250 query hits.

So I suppose the next step with the Relatedness Calculator is to figure out how to get people to use the damn thing. “Search engine optimization” and all that. My first instinct is to start emailing around to genealogy sites to see if any of them would be interested in linking to me, or maybe even running a standalone version on their own servers.

Alternatively, I could just make an Android version and see if my clout on the Play Store helps push it up in the search results. But I’m hesitant to do that, because this isn’t a product that makes a whole lot of sense as a mobile app. And plus, the mobile site works just fine.

In any case, I don’t plan on abandoning the Relatedness Calculator anytime soon. It’s a nice, simple project that gives me an opportunity to write some gnarly regexes while learning a thing or two about web development. And anyway, I’ve got an app that can figure out your relatedness to your identical twin’s children. How cool is that?

2 Jun

Comparing boost methods in Solr

Posted by Nolan Lawson in NLP. Tagged: boost, information retrieval, lucene, solr. 11 comments

Note: I decided to put the summary and conclusion first, for the benefit of people stumbling across this article from a search engine. You guys might not want to read a wall of text. For everyone else who’s interested in the justification for these conclusions, keep reading.

Summary of boost methods

Boost Method, with Example	Type	Input	Works With
`{!boost b}`	Multiplicative	Function	`lucene` `dismax` `edismax`
q={!boost b=myBoostFunction()}myQuery	Multiplicative	Function	`lucene` `dismax` `edismax`
`{!boost b}` with variables	Multiplicative	Function	`lucene` `dismax` `edismax`
q={!boost b=$myboost v=$qq} &myboost=myBoostFunction() &qq=myQuery	Multiplicative	Function	`lucene` `dismax` `edismax`
`bq` (boost query)	Additive	Query	`dismax` `edismax`
q=myQuery &bq=_val_:”myBoostFunction()“	Additive	Query	`dismax` `edismax`
`bf` (boost function)	Additive	Function	`dismax` `edismax`
q=myQuery &bf=myBoostFunction()	Additive	Function	`dismax` `edismax`
`boost`	Multiplicative	Function	`edismax`
q=myQuery &boost=myBoostFunction()	Multiplicative	Function	`edismax`

Conclusions (TL;DR)

Prefer multiplicative boosting to additive boosting.
Be careful not to confuse queries with functions.

Recently I inherited a Solr project. Having never used Solr or Lucene before, but being well-versed in the dark arts of computational linguistics (from ye olde university days, anyway), I was eager to roll up my sleeves and get acquainted with it. I’d seen the formulas and proofs and squiggly stuff before – now I wanted to get my hands on something that really works.

And as I turns out, Lucene/Solr is a pretty slick piece of software. After over 10 years of development, it’s basically become a Swiss army knife for anything related to information retrieval. It’s got a bazillion different methods for parsing your queries, caching search results, tokenizing your stored text… It slices, it dices. But like any mature open-source project, it’s also got some inconsistencies and odd bits of historical baggage. Some of this is clear from the documentation, some of it isn’t.

One area that was especially unclear to me was “query boosting.” It’s a common scenario when building a search engine: you want to apply a boost function based on some static document attribute. For instance, maybe you want to give more preference to recent documents, or maybe you want to apply a PageRank score. The goal is to give your query results a gentle “nudge” in a certain direction, without completely throwing the TF-IDF score out with the bathwater.

As it turns out, there’s a good way of doing this in Solr. In fact, there’s more than one way. Let me explain.

In the Solr FAQs, the primary means for boosting queries is given as the following:

q={!boost b=myBoostFunction()}myQuery

It would be straightforward enough if this were the only method. But the DisMax query parser docs also mention bq, the “boost query” parameter, and bf, the “boost function” parameter. Furthermore, the ExtendedDisMax parser docs mention a third parameter, simply called boost, which they boast is “a multiplier rather than an addend, improving your boost results.” They also assert backwards compatibility with bq and bf.

At this point, my head was spinning. The Javadoc for Lucene’s Similarity.java describes just one simple boost function. The formulas in that document make for pretty thick reading, but if you have some experience in IR, it’s at least something you can wrap your head around. But now it looks like we’ve got 4 different boost functions. Which one should you pick?

Well, in the code base I inherited, we wanted to boost the logarithm of a static attribute called “relevancy score,” which was a precomputed, query-independent value attached to each document. To boost this value, the previous developer had decided to use the {!boost b} syntax. So for the query “foo,” our parameter q would be:

{!boost b=log(relevancy_score)}foo

This seemed to work reasonably well, but I wanted to experiment with the other methods. In particular, I wanted to see if I could abstract away the boost and keep it in a separate parameter, rather than doing ugly string manipulation of the q variable.

So I set up a simple test to compare all the different ways of applying boosts in Solr. These tests were run on Solr 3.5.0, using an index with about 4 million documents crawled from the web. I tested the three most popular query parsers – lucene, dismax, and edismax – and tried all four boost methods. For good measure, I also threw in a slightly different formulation of the {!boost b} method, which looks like this:

q={!boost b=$boostParam v=$qq} &boostParam=... &qq=...

… where boostParam and qq can be any string; they’re just variable references.

For each boost method, I queried 1000 documents and took the MD5 sum of each result set, in order to figure out which queries were identical. I tested several queries to ensure that my findings were consistent. The script I wrote is on GitHub if you want to check my work.

Below are my results for the query “diabetes” (my documents were healthcare-related), plus color-coding to show which result sets were identical. I also tried to give meaningful names to the result sets, based on what I could gleam from the Solr documentation.

Boost Method	Lucene Parser	DisMax Parser	EDisMax Parser
Basic (no boost)	No change	No change	No change
q=diabetes	No change	No change	No change
`{!boost b}`	Multiplicative boost	Multiplicative boost	Multiplicative boost
q={!boost b=log(relevancy_score)}diabetes	Multiplicative boost	Multiplicative boost	Multiplicative boost
`{!boost b}` with variables	Multiplicative boost	Multiplicative boost	Multiplicative boost
q={!boost b=$myboost v=$qq} &myboost=log(relevancy_score) &qq=diabetes	Multiplicative boost	Multiplicative boost	Multiplicative boost
`bq` (boost query)	No change	Additive boost	Some other additive boost?
q=diabetes &bq=log(relevancy_score)	No change	Additive boost	Some other additive boost?
`bf` (boost function)	No change	Boost function, additive	Boost function, additive
q=diabetes &bf=log(relevancy_score)	No change	Boost function, additive	Boost function, additive
`boost`	No change	No change	Multiplicative boost
q=diabetes &boost=log(relevancy_score)	No change	No change	Multiplicative boost

(Don’t worry about the “multiplicative” and “additive” stuff – we’ll get to that later.) Using debugQuery=on, we can see how Solr is parsing these queries. This helps make a lot more sense out of the results pattern:

Boost Method	Parsed Query
Basic	`text:diabetes`
`{!boost b}` or `boost`	`BoostedQuery(boost( text:diabetes, log(double(relevancy_score))))`
`bq` with DisMax	`+DisjunctionMaxQuery( (text:diabetes)) () text:log text:relevancy_score`
`bq` with EDisMax	`+DisjunctionMaxQuery( (text:diabetes)) (text:log text:relevancy_score)`
`bf` with DisMax/EDisMax	`+DisjunctionMaxQuery( (text:diabetes)) FunctionQuery(log(double(relevancy_score)))`

A few insights leap out from looking at these tables. First off, it’s a relief to see that {!boost b} does indeed work the same with or without the variables. I think the variables are nice, because they abstract away the boost function from the query. The syntax is a little verbose, though.

Second, I was obviously barking up the wrong tree with bq (“boost query”), because it parses my function like a query. I.e., it’s literally looking for text containing “log” and “relevancy_score.” I realized later that this is because bq takes a query, not a function. Now, bq may be useful for cases where you’d want to boost a particular query – for instance, say you’ve got a sweetheart deal with Sony, so you want to add bq=manufacturer:sony^2. But it’s not useful for boosting static attributes.

Also, according to this thread on the Solr mailing list, bq and bf are essentially two sides of the same coin. Any query can be expressed as a function (using _val_:"..."), and any function can be expressed as a query (using query({!v=...})). So bq and bf are functionally equivalent, and historically one was just a shortcut to the other. Chris Hostetter, an original Solr contributor, fills us in on the story:

[T]he existence is entirely historic. I added bq because i needed it, and then i added bf because the _val_:”…” syntax was anoying [sic].

Third, it’s interesting to note that bq actually behaves differently with the DisMax parser vs. the EDisMax parser. The Lucid Imagination documentation suggests that they should be the same:

the additive boost functions of DisMax (bf and bq) are also supported

… but apparently, EDisMax behaves slightly differently from DisMax, because it automatically conjoins the “log” and “relevancy_score” tokens, which changes the results. That’s something worth considering if you’re already making use of bq.

So finally, that just leaves a proper analysis of the “multiplicative boost” (shown in green) and the “boost function, additive” (shown in blue). Both seem reasonable, so which one is the right solution?

From looking at the parsed queries, it seems that here we’ve finally found the multiplicative/additive split alluded to in the documentation. The bf (“boost function”) simply runs two separate queries – the main query and the boost query – and then takes the disjunction of the two using DisjunctionMaxQuery. That is, it just adds the scores together.

The {!boost b} and boost methods, on the other hand, apply a true multiplicative boost, using BoostedQuery. That is, they multiply the boost function’s score by whatever score would normally be spit out. This method is more faithful to the Lucene Javadoc for Similarity.java, and it seems to be the recommended choice, given how dismissively the word “additive” is tossed around in the documentation.

So basically, this is the boost you’re looking for. If you’re using the default lucene parser or the dismax parser, go with the {!boost b} method. If you’re using edismax, though, take advantage of the nice boost parameter and use that instead.

Read the Tea Leaves Software and other dark arts, by Nolan Lawson

Archive for the ‘NLP’ Category

AI ambivalence

Still skeptical

Opening up

Conclusion

Using distributed search handlers in Solr 3.6.2

Better synonym handling in Solr

The SynonymFilterFactory

Index-time vs. query-time

Multi-word synonyms won’t work as phrase queries

The IDF of rare synonyms will be boosted

Multi-word synonyms won’t be matched in queries

Other problems

Back to the drawing board

Solution

Try it yourself!

Future work

Relatedness Calculator: Part Deux

Version One

User Confusion

Memory usage

Browser optimization

De-uglification

Conclusion

Comparing boost methods in Solr

Conclusions (TL;DR)

Recent Posts

About Me

Archives

Tags

Links