Archive for June, 2012

Introducing the Relatedness Calculator

I’m releasing a new open-source app today. It’s called the Relatedness Calculator. It’s a pretty simple app: what it does is take the name of a relative, written in plain English, and calculate the relatedness coefficient for that relative. It even accepts complex English descriptions, like “father’s half-brother’s granddaughter.” Plus, it draws a family tree, so that you can visualize the relationship. It’s pretty fun if you’re into genealogy!

Why did I write this app? Well, the inspiration actually came from a strange place: a message board. On this message board, a guy was wondering if anyone though it would be weird to date a distant relative of his. He said she was his grandma’s cousin’s daughter, and although that’s a pretty big distance in the family tree, he was worried about the social stigma if the two of them were to get married. Having a layman’s interest in genetics, I did the math for him and told him that he was only 1.5% related to her. And if he compared that to cousins (12.5%), which is kind of the borderline of acceptability for most cultures, he could see that it’s not really a big deal.

Doing the math for such a far-flung relation, though, was kind of tough. I had to go back and re-read some of Richard Dawkin’s excellent The Selfish Gene to brush up on the calculations. And then I wanted to make sure that my math was right, so I wrote some Java classes to help me out. Then I started writing unit tests. Then I started writing a parser, and… well, sometimes when you’re moderately autistic, as many programmers are, you start to get carried away. So I did, and that became a standalone Java library.

At this point I knew I wanted to make an app out of this project, because I thought it was neat enough to interest a lot of people (and not just those who want to date their relatives). I could have easily built an Android app out of it, but I’ve already written 8 Android apps, and anyway I didn’t think it was the right platform. Nobody’s going to download such a trivial app to their phone. Also, there didn’t seem to be any general-purpose relatedness calculators on the web, so I thought: why not make my own?

This is my first published webapp, so in many ways it’s kind of a “hello world” for me. I used Grails, because it seemed like a nice framework, and Groovy is close to Java. Astute Grails fans will notice that I didn’t even bother to change the default theming, but that’s typically what I do when I write apps. I don’t have a head for design, so even with my Android apps I’ve always just used the default theme. I prefer focusing on functionality over flash.

To deploy the app was a bit complicated. I hesitated between choosing Google App Engine and Amazon EC2, but in the end I decided to go with the latter, because as it turned out I needed full OS access in order to run Graphviz on the command line in a separate process, which I don’t believe is supported by GAE.

I registered for an Amazon EC2 Micro instance because it’s free for one year, so we’ll see how it holds up. Even though the app is pretty lightweight – no persistence, no heavy computation, most everything cached – I’ve already seen some significant slowdown. Since the server is free, though, I can’t really complain.

The biggest difference I find with developing webapps versus Android apps is just the barrier to entry. When it comes to costs, being an Android developer will only run you the $25 signup fee, and then after that Google takes care of the rest. For a webapp, though, I need to pay $10-15/year for the domain name, and then my micro EC2 instance (after the first free year) will cost at least $15/month, plus more if the app is popular. That’s pretty steep for hobby apps like mine, which don’t even bring in any money.

As for deployment, you also don’t need to know how to set up a LAMP stack on Android, or how to build a WAR file, or even how to use the command line. You just submit your app to Google using the web interface, and you’re done.

Plus, of course, the project structure itself is much simpler. Android apps are just Java and XML, whereas with a modern webapp, you need to make sure your Javascript plays nice with your CSS, and that your HTML plays nice with whatever’s generating it on the server side – Java, Groovy, Perl, Python, Ruby, etc. That’s a lot to take in at once.

When it comes to Android, I think an undergrad, with one or two intro Java courses under their belt, could probably build a decent app just by reading the Android tutorials. Building a webapp, on the other hand, requires a lot of additional specialized knowledge. Having come from a server-side background, and having had to figure this stuff out myself by trial and error, I can definitely say that web development is not for the faint of heart. It’s a global standard, and the W3C has tried to please everybody, so much of it feels like a Frankenstein’s monster of barely compatible compromises, rather than a holistic framework built from the ground up, like what Google did with Android.

And yet, web development is so… satisfying. I love that I can build this app, and now anybody with a browser can just point themselves to my site and make use of my code. It even works on mobile! Just play around with the CSS a bit, and you can turn your webapp into a perfectly functional, cross-platform mobile application. I suppose these observations seem pretty banal to folks who have been doing web development for forever, but for me, it’s still kind of amazing.

Anyway, the Relatedness Calculator is live now, so go check it out. And if you have improvements to make, fork it on GitHub!

Comparing boost methods in Solr

Note: I decided to put the summary and conclusion first, for the benefit of people stumbling across this article from a search engine. You guys might not want to read a wall of text. For everyone else who’s interested in the justification for these conclusions, keep reading.


Summary of boost methods

Boost Method, with Example Type Input Works With
{!boost b} Multiplicative Function  lucene
 dismax
 edismax
q={!boost b=myBoostFunction()}myQuery
{!boost b} with variables Multiplicative Function  lucene
 dismax
 edismax
q={!boost b=$myboost v=$qq}
  &myboost=myBoostFunction()
  &qq=myQuery
bq (boost query) Additive Query  dismax
 edismax
q=myQuery
  &bq=_val_:”myBoostFunction()
bf (boost function) Additive Function  dismax
 edismax
q=myQuery
  &bf=myBoostFunction()

boost
Multiplicative Function  edismax
q=myQuery
  &boost=myBoostFunction()

Conclusions (TL;DR)

  1. Prefer multiplicative boosting to additive boosting.
  2. Be careful not to confuse queries with functions.


Recently I inherited a Solr project.  Having never used Solr or Lucene before, but being well-versed in the dark arts of computational linguistics (from ye olde university days, anyway), I was eager to roll up my sleeves and get acquainted with it.  I’d seen the formulas and proofs and squiggly stuff before – now I wanted to get my hands on something that really works.

And as I turns out, Lucene/Solr is a pretty slick piece of software.  After over 10 years of development, it’s basically become a Swiss army knife for anything related to information retrieval. It’s got a bazillion different methods for parsing your queries, caching search results, tokenizing your stored text…  It slices, it dices.  But like any mature open-source project, it’s also got some inconsistencies and odd bits of historical baggage. Some of this is clear from the documentation, some of it isn’t.

One area that was especially unclear to me was “query boosting.”  It’s a common scenario when building a search engine: you want to apply a boost function based on some static document attribute.  For instance, maybe you want to give more preference to recent documents, or maybe you want to apply a PageRank score.  The goal is to give your query results a gentle “nudge” in a certain direction, without completely throwing the TF-IDF score out with the bathwater.

As it turns out, there’s a good way of doing this in Solr.  In fact, there’s more than one way.  Let me explain.

In the Solr FAQs, the primary means for boosting queries is given as the following:

q={!boost b=myBoostFunction()}myQuery

It would be straightforward enough if this were the only method. But the DisMax query parser docs also mention bq, the “boost query” parameter, and bf, the “boost function” parameter. Furthermore, the ExtendedDisMax parser docs mention a third parameter, simply called boost, which they boast is “a multiplier rather than an addend, improving your boost results.” They also assert backwards compatibility with bq and bf.

At this point, my head was spinning. The Javadoc for Lucene’s Similarity.java describes just one simple boost function. The formulas in that document make for pretty thick reading, but if you have some experience in IR, it’s at least something you can wrap your head around. But now it looks like we’ve got 4 different boost functions. Which one should you pick?

Well, in the code base I inherited, we wanted to boost the logarithm of a static attribute called “relevancy score,” which was a precomputed, query-independent value attached to each document. To boost this value, the previous developer had decided to use the {!boost b} syntax.  So for the query “foo,” our parameter q would be:

{!boost b=log(relevancy_score)}foo

This seemed to work reasonably well, but I wanted to experiment with the other methods. In particular, I wanted to see if I could abstract away the boost and keep it in a separate parameter, rather than doing ugly string manipulation of the q variable.

So I set up a simple test to compare all the different ways of applying boosts in Solr. These tests were run on Solr 3.5.0, using an index with about 4 million documents crawled from the web. I tested the three most popular query parsers – lucene, dismax, and edismax – and tried all four boost methods. For good measure, I also threw in a slightly different formulation of the {!boost b} method, which looks like this:

q={!boost b=$boostParam v=$qq}
&boostParam=...
&qq=...

… where boostParam and qq can be any string; they’re just variable references.

For each boost method, I queried 1000 documents and took the MD5 sum of each result set, in order to figure out which queries were identical. I tested several queries to ensure that my findings were consistent. The script I wrote is on GitHub if you want to check my work.

Below are my results for the query “diabetes” (my documents were healthcare-related), plus color-coding to show which result sets were identical. I also tried to give meaningful names to the result sets, based on what I could gleam from the Solr documentation.

Boost Method Lucene
Parser
DisMax
Parser
EDisMax
Parser
Basic (no boost) No change No change No change
q=diabetes
{!boost b} Multiplicative
boost
Multiplicative
boost
Multiplicative
boost
q={!boost b=log(relevancy_score)}diabetes
{!boost b} with variables Multiplicative
boost
Multiplicative
boost
Multiplicative
boost
q={!boost b=$myboost v=$qq}
  &myboost=log(relevancy_score)
  &qq=diabetes
bq (boost query) No change Additive boost Some other
additive boost?
q=diabetes
  &bq=log(relevancy_score)
bf (boost function) No change Boost function,
additive
Boost function,
additive
q=diabetes
  &bf=log(relevancy_score)
boost No change No change Multiplicative
boost
q=diabetes
  &boost=log(relevancy_score)

 

(Don’t worry about the “multiplicative” and “additive” stuff – we’ll get to that later.) Using debugQuery=on, we can see how Solr is parsing these queries. This helps make a lot more sense out of the results pattern:

Boost Method Parsed Query
Basic text:diabetes
{!boost b} or boost BoostedQuery(boost( text:diabetes, log(double(relevancy_score))))
bq with DisMax +DisjunctionMaxQuery( (text:diabetes)) () text:log text:relevancy_score
bq with EDisMax +DisjunctionMaxQuery( (text:diabetes)) (text:log text:relevancy_score)
bf with DisMax/EDisMax +DisjunctionMaxQuery( (text:diabetes)) FunctionQuery(log(double(relevancy_score)))

 

A few insights leap out from looking at these tables. First off, it’s a relief to see that {!boost b} does indeed work the same with or without the variables. I think the variables are nice, because they abstract away the boost function from the query. The syntax is a little verbose, though.

Second, I was obviously barking up the wrong tree with bq (“boost query”), because it parses my function like a query. I.e., it’s literally looking for text containing “log” and “relevancy_score.” I realized later that this is because bq takes a query, not a function. Now, bq may be useful for cases where you’d want to boost a particular query – for instance, say you’ve got a sweetheart deal with Sony, so you want to add bq=manufacturer:sony^2. But it’s not useful for boosting static attributes.

Also, according to this thread on the Solr mailing list, bq and bf are essentially two sides of the same coin. Any query can be expressed as a function (using _val_:"..."), and any function can be expressed as a query (using query({!v=...})). So bq and bf are functionally equivalent, and historically one was just a shortcut to the other. Chris Hostetter, an original Solr contributor, fills us in on the story:

[T]he existence is entirely historic. I added bq because i needed it, and then i added bf because the _val_:”…” syntax was anoying [sic].

Third, it’s interesting to note that bq actually behaves differently with the DisMax parser vs. the EDisMax parser. The Lucid Imagination documentation suggests that they should be the same:

the additive boost functions of DisMax (bf and bq) are also supported

… but apparently, EDisMax behaves slightly differently from DisMax, because it automatically conjoins the “log” and “relevancy_score” tokens, which changes the results. That’s something worth considering if you’re already making use of bq.

So finally, that just leaves a proper analysis of the “multiplicative boost” (shown in green) and the “boost function, additive” (shown in blue). Both seem reasonable, so which one is the right solution?

From looking at the parsed queries, it seems that here we’ve finally found the multiplicative/additive split alluded to in the documentation. The bf (“boost function”) simply runs two separate queries – the main query and the boost query – and then takes the disjunction of the two using DisjunctionMaxQuery. That is, it just adds the scores together.

The {!boost b} and boost methods, on the other hand, apply a true multiplicative boost, using BoostedQuery. That is, they multiply the boost function’s score by whatever score would normally be spit out. This method is more faithful to the Lucene Javadoc for Similarity.java, and it seems to be the recommended choice, given how dismissively the word “additive” is tossed around in the documentation.

So basically, this is the boost you’re looking for. If you’re using the default lucene parser or the dismax parser, go with the {!boost b} method. If you’re using edismax, though, take advantage of the nice boost parameter and use that instead.