S3 bucket listing that’s easier on the eyes

Update: I learned that Shrub exists. It’s much nicer than what I hacked up in an hour!

This is just a quick one.

I host a lot of public files on an Amazon S3 bucket. It’s my main mechanism for publishing releases of my open-source software.

So I was amazed to discover recently that S3 doesn’t have an easy way to just… show all the files. Like, not even a basic directory listing, which you could easily get with an Apache server. Just nothing.

Directory listing in Apache

This is all I wanted.

Well, that’s not entirely true. There is this ancient sample code from Amazon, made in 2008, that I found frozen in ice. But it looks like crap.

Amazon's standard S3 directory listing

“$folder$”? Seriously?

So I made a better one, using Bootstrap for styling. Below is a screenshot, and here it is in action.

My pretty Bootstrap S3 index.html

Much better.

To use it, just download the index.html file from the GitHub page and drop it into the root of your public S3 bucket. That’s it!

As an aside, isn’t it awesome how easy web development has become, thanks to modern tools like Bootstrap, JQuery, and Handlebars? That file from Amazon used 174 lines of Javascript, whereas mine is only 99. Of course I have three external dependencies, but I use CDNs, so you probably won’t notice a difference in performance. How cool is that?

CouchDB doesn’t want to be your database. It wants to be your web site.

I’d like to talk to you today about Couch apps. No, not CouchApps. No, not necessarily CouchApps either. The phrase has been bandied around a lot, so it’s worth explaining what I mean: I’m talking about webapps that exclusively use CouchDB for their backend, whether or not they’re actually hosted within CouchDB and regardless of how they’re built.

Yes, this is a thing people are actually trying to do, and no, it’s not crazy. The purpose of this article is to explain why.

First off, some background: CouchDB is a NoSQL database (or key-value store, as the cool kids say) written in Erlang. It is probably the origin of this joke. Nobody who uses CouchDB cares that it is written in Erlang, though, because the big selling point is that you can interact with it using Javascript, JSON, and plain ol’ HTTP. It is “a database for the web,” the first of its kind.

CouchDB: it’s a database, right?

When I first started using CouchDB, I tried to treat it like any other database. I looked for connectors based on the language I was using: Ektorp for Java, AnyEvent::CouchDB for Perl, Nano for Node. And I used the web interface (bewilderingly called “Futon”) as I would a query browser – neat for debugging, but not much else. The fact that it ran in a web browser just kinda seemed like a gimmick.

Recently, though, when I was working on a Node app that didn’t go anywhere but was a fun diversion, I came across this quote by Couch apostle J. Chris Anderson:

Because CouchDB is a web server, you can serve applications directly [to] the browser without any middle tier. When I’m feeling punchy, I like to call the traditional application server stack “extra code to make CouchDB uglier and slower.”

Suddenly, I realized what CouchDB was all about.

No wait, CouchDB is a miracle

See, here I was, using client-side Javascript to talk to Express to talk to Node to talk to Nano to talk to Couch, and at each step I was converting parameter names from underscores to camel case (or whatever my petty hangups are), all the while introducing bugs as I tried to make each layer fit nicely with the next one. And I had a working web server right in front of me! CouchDB! Why not just call it directly, you fool?! (I shout at myself in hindsight.)

I think the reason a lot of developers, like myself, might have missed this epiphany is that we’re used to treating databases as, well, databases. Whether it’s MongoDB or MySQL or Oracle, you gotta have your JDBC connector for Java and perhaps an ORM layer or maybe you just give up on Hibernate and write all the database objects yourself, so half of your code is getters and setters, but that’s OK, because that’s how we abstract the database.

You see, you can’t just have your peanut butter and jelly sandwich! You need an interface between the bread and the peanut butter, and an abstraction layer between the peanut butter and the jelly, and don’t even get me started on the jelly and the bread! What, you want your bread to get soggy?

As a programmer, I’m so used to treating databases as this other, alien thing that needs to be handled with latex gloves, separately from my application code, that reaching for the nearest library has become a reflex.

But you don’t need that with CouchDB. Because… it’s just HTTP. Any extra layers just give you another API to learn.

CouchDB is the web done right

And in fact, CouchDB is better than HTTP, because CouchDB actually fulfills the promise of what RESTful services were supposed to be, instead of the kludges we’ve come to expect. Look! DELETE actually deletes things! POST isn’t just what you use when you need to send more data than a GET allows! And HEAD and PUT are actually useful, instead of just being trivia to impress your friends at dinner parties — “Oh, did you know that there are actually more HTTP commands than just GET and POST?” “Oh, how fascinating!”

You see, once you set aside your preconceived notion of what a database is supposed to be, you can actually get rid of all your fancy connectors and just use a standard HTTP library. (I like Requests for Python.) You can even use the network debugger in a browser window to see how CouchDB does everything. It’s all just AJAX!

And then, if you make it this far down the rabbit hole, you might notice that CouchDB actually has a user authentication database, with password hashing. You might also notice that it’s even got roles and privileges and administrator controls. And that’s when you realize, with fascinated horror, the most insidious thing about CouchDB:

CouchDB doesn’t want to be your database; it wants to be your web site.

And finally, this is where we come back to the subject of Couch apps. A Couch app is just a pure HTML/CSS/Javascript application, with only CouchDB as its backend, and this is the intended use case for CouchDB.

Now, think about what this proposition means to you as a developer. The web is moving more and more towards rich, client-side applications — we’ve had jQuery for years, and now we even have MVC with platforms like Ember, Knockout, and AngularJS. If CouchDB does user authentication (it’s got a “signup” button right on the home page, for crying out loud), paging, indexing, full-text search, geo data, and it all speaks HTTP, well… what does that actually leave us to do on the server?

Take a long look in the mirror, and really ask yourself! And yes, for those of you who do machine learning and scientific computing and business intelligence, I can already see you raising your hands, but for the rest of us who get paid to write Twitter clones, the answer is: not much. Your average CRUD app can magically transform into a PGPD app (PUT, GET, POST, DELETE), you can throw it up on CouchDB with some nice HTML and CSS to style it, and be at your local brewpub by 3. Or maybe you could just send the default Futon interface to the client and tell them you wrote it.

Futon interface in CouchDB

“See, it’s a collaborative document editor, and the dude on the Couch is a lazy writer…”

Now, this is the dream. And CouchDB, as it stands in 2013, actually gets us pretty damn far toward that dream. The app I’m releasing this week, Ultimate Crossword, is a testament to that. It’s a pure Couch app that only cheats by using Solr for full-text search (because I was too lazy to learn the Lucene plugin). It’s got user accounts, data aggregation, and even continuous syncing between the client and server thanks to the wonderful PouchDB.

Building this site gave me a lot of insight into what’s possible with a Couch app. However, I also got a reality check about where CouchDB still falls short of achieving the dream. I’ve got four big complaints:

1) No per-document read privileges

This is a big one. CouchDB has three basic security modes:

  • Everyone can do everything.
  • Some people can write (some documents), everyone can read (all documents).
  • Some people can write (some documents), some people can read (all documents).

If you want to give users exclusive read access to certain documents, you have to create a separate database for each user. And unfortunately, CouchDB has no feature to do this automatically. So you need a process on the server with administrative privileges to do it, breaking the pure “Couch app” ideal. Then, if you want to aggregate the data, you actually need another process to sync to a separate database, and… well, it just gets messy. I’m strongly rooting for this feature to show up in a future CouchDB release.

2) No password recovery.

This is a feature that users have come to expect from modern web sites. And despite all its security flaws (in that it makes your email a single point of failure), it seems here to stay.

Now, CouchDB can store arbitrary data in the users table (like email addresses), and you can even do custom validation. But for the whole “give us your email, and we’ll send you a new password” thing, you’re on your own.

On the bright side, the passwords are all salted and PBKDF2-hashed, so no attacker has much to gain from cracking your Couch.

3) No database migration.

This is a big one for me, although I wonder if I’m the only one. Since my early days of Java development, I’ve appreciated having Liquibase so I could track my database schema changes in version control.

In theory, CouchDB should be ideal for something like this, since it versions everything, and even its views (aka indexes) are their own documents. But I haven’t found a good recipe for managing this yet. For the time being, I just keep a series of Python scripts that create the databases.

4) Views are not indexes, and documents are not tables.

One of the nice things about SQL databases as a development paradigm is the flexibility of the SQL language itself. Decided you wanna sort by dogsLastName instead of favoritePokemon? No problem, we’ll just add an index. Too much data getting sent across the wire? No big deal, we’ll just SELECT the fields we need, instead of SELECT(*).

In CouchDB, you can’t do a WHERE and you can’t just SELECT the fields you want. Any query that’s not simply fetching a whole document by its ID requires a view, and those are costly to create. I’ve worked with Couch databases containing millions of documents, and rebuilding a view would often take days. I’d have a coworker ask me to add a new filter criterion for a view, and on Friday I’d say, “Okay, it’ll be ready by Monday.” For the Ultimate Crossword app, I stupidly decided to use CouchDB to crunch the data itself, and I ended up needing five separate Couch servers running on solid state drives in order to process it in in a reasonable amount of time. (CouchDB is best thought of as a single-process application. It’s append-only, so it uses one process per database file.)

Also, the fact that you can’t SELECT arbitrary fields means you need to start thinking about how much data you want to send over the wire with each document, and how to threshold it. I found myself structuring my database into a summary/detail format early on, and modeling the documents very tightly to the user interface, in ways that just made me feel icky.

Database purists, of course, would say that this is where the latex gloves are supposed to come out. But I think that if CouchDB simply had a better system for managing migrations (see #3) and/or faster view creation, this would be a non-issue. I’d also love it if the output of a view could be put into its own database, so I could have endlessly kaleidoscoping views of my data. One more for the wishlist!

Conclusion

Despite these drawbacks, I still think CouchDB has a lot of potential to revolutionize the way people write webapps. I certainly still plan to use it for quick hacking (hell, the crossword app only took me ten days to write), and Couch’s append-only design means I’ll never have to worry about my data getting corrupted. (It’s been proudly touted as “the Honda Accord of databases.”)

But for all its developers’ humility, CouchDB is a really exciting technology. When you step back and look at it, it’s a daring, crazy proposition, a bold statement about how awesome web development would be if we could just let it be the web. It’s a raving streetside lunatic, grabbing random people by the shoulders and screaming at them with frantic urgency: “We don’t need the server anymore! We only need the database! The database is the server!”

In short, CouchDB is an expression of an ideal, a fantastical tale of science fiction told by wide-eyed dreamers. And if there’s one truth about wide-eyed dreamers, it’s this: with hindsight, their predictions either seem delusional, or inevitable.

(Psssst! Go check out my Ultimate Crossword app! It’ll make you feel bad about your user authentication!)

Update: I decided to remove the CouchDB user authentication from the Ultimate Crossword app (I realized it was irresponsible to let people collaboratively “solve” the puzzle), but it’s still a pure Couch app!

Creating a contact with multiple fields in Android

For the impatient: skip the article, download the code.

Recently, when writing a physician directory for the Canton of Geneva, I wanted to include a feature for adding a new contact. That is, I wanted a button that would pop up the “Add a new contact” screen, with various fields (such as phone number, postal address, and email address) already filled in. Piece of cake, right?

Adding a contact in the physicians app.

Adding a contact in the physicians app.

Unfortunately, it turns out that the Android docs and Stack Overflow are pretty bereft of clear, concise instructions for creating a contact with multiple fields of various types, e.g. work phone, mobile phone, or home fax (if such a thing still exists).

Plus, the entire ContactsContract changed in API level 11 (Honeycomb), meaning that anything written for ICS or Jelly Bean wouldn’t work in Gingerbread, and vice-versa. Oh joy.

Luckily for you — assuming you stumbled across this post after a frustrated trip to Google — I’ve written a helper class to do all the heavy lifting. It provides a simple, fluent API that works for Android version 2.1 (Eclair) through 4.2 (Jelly Bean), and it’s open source.

You create a contact like this:

Intent intent = new AddContactIntentBuilder("Joe Blow")
    .addFormattedAddress("123 Fake Street, Springfield USA",
        StructuredPostal.TYPE_HOME)
    .addPhone("555-867-5309", Phone.TYPE_HOME)
    .addPhone("555-123-4567", Phone.TYPE_WORK)
    .addPhone("555-987-6543", Phone.TYPE_FAX_WORK)
    .addEmail("joe.blow@gmail.com", Email.TYPE_HOME)
    .addEmail("joe@blow.com", Email.TYPE_WORK)
    .build();

startActivity(intent);

And here’s what this code produces, in both Jelly Bean and Gingerbread:

Adding a contact in Jelly Bean and Gingerbread.

Adding a contact in Jelly Bean and Gingerbread.

Happy contact creating!

Download or fork the code from GitHub.

Web sockets with Socket.io, Node.js, and Nginx: port 80 considered harmful

TL;DR: web sockets are more widely supported on port 443 (via SSL) than port 80. Check that your ISP actually supports the port you’re using at WebSocketsTest.com.

I just wasted a good half-day trying to debug a problem with web sockets in the Socket.io plugin for Node.js running on Nginx 1.4.1.

Running the Node app locally on port 3000, everything worked swell. However, as soon as I deployed it to my production server, I was seeing about a 10-second startup in my app. This occurred in four different browsers – Chrome, Chromium, Firefox, and Safari.

Each one was reporting a 502 Bad Gateway response from the server. So since the client assumed web sockets were not being supported, Socket.io was falling back on XHR polling, which is less speedy.

The Node logs said:

warn: websocket connection invalid
info: transport end (undefined)

The Nginx logs said:

2013/05/21 10:41:00 [error] 6117#0: *8 upstream prematurely closed 
    connection while reading response header from upstream [...]

Neither was very helpful. So after reading several forums and blog posts online, and adding lots of tweaks and hacks that didn’t work, my Nginx site configuration bloated up to this:

# node.js running locally on port 3000
upstream node {
    server 127.0.0.1:3000;
    keepalive 256; # not necessary
}


# the public nginx server instance running on port 80
server {
    listen 80;
    server_name <mysite.com>; # domain of my site
    access_log /var/log/nginx/<mysite>.access.log;
    error_log /var/log/nginx/<mysite>.error.log;
    
    # supposedly prevents 502 bad gateway error;
    # ultimately not necessary in my case
    large_client_header_buffers 8 32k;

    # run the app on the root directory
    location / {

        # the following is required for WebSockets
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header Host $http_host;
        proxy_set_header X-NginX-Proxy true;
 
        # supposedly prevents 502 bad gateway error;
        # ultimately not necessary in my case
        proxy_buffers 8 32k;
        proxy_buffer_size 64k;
        
        # the following is required
        proxy_pass http://node;
        proxy_redirect off;
 
        # the following is required as well for WebSockets
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";

        tcp_nodelay on; # not necessary
    }
 }

Since web sockets worked fine when running Node locally, I assumed it was a problem with Nginx. What eventually clued me in was that even running Node directly on port 80 yielded the same error.

Finally I realized that the problem was with my Internet provider themselves – they were disallowing web sockets connections on port 80! WebSocketsTest.com turned out to be an invaluable resource in debugging this.

Incidentally, it would have worked if I had used SSL on port 443. According to WebSocketsTest’s aggregate data, port 443 is supported about 89% of the time, compared to 78% for port 80.

I tested a few public Wifi spots around Geneva and can confirm that this figure seems about right. I would be curious to know if support varies by geographic region, but WebSocketsTest doesn’t publish that data.

Lesson #1: even if you’re using a modern browser, your Internet provider may not be up to snuff. Check it first, before you wreck your brain!

Lesson #2: If you plan on using web sockets in a production environment, SSL is apparently the way to go. Not only is it secure, but it’s also better supported (as of 2013).

Using distributed search handlers in Solr 3.6.2

TL;DR: disable lazy-loading on the /spell handler if you’re using Solr 3.6.2 and distributed (i.e. sharded) search.

I discovered an interesting bug in Solr 3.6.2 the other day, so I thought I’d share it here.

While upgrading a distributed Solr system from version 3.5.0 to 3.6.2, everything worked as expected with minimal changes to the configuration, except for the /spell search handler.

The other search handlers I’d defined (the standard /select and a /suggest for autosuggestions) responded just fine, but a distributed /spell failed with a mysterious error in the logs:

Server returned HTTP response code: 400 for URL: 
http://localhost:8080/solr-search/shard2/spell
    ?shards=localhost%3A8080%2Fsolr-search%2Fshard1%2Clocalhost%3A8080%2Fsolr-search%2Fshard2
    &shards.qt=%2Fspell[...]

The logs from the server side weren’t any more helpful:

SEVERE: org.apache.solr.common.SolrException: Bad Request

Bad Request

request: http://localhost:8080/solr-search/shard1/spell
 at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:427)
 at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:249)
 at org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:129)
 at org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:103)
 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
 at java.util.concurrent.FutureTask.run(FutureTask.java:138)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
 at java.util.concurrent.FutureTask.run(FutureTask.java:138)
 at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
 at java.lang.Thread.run(Thread.java:680)

Running Wireshark to capture HTTP requests on localhost dug up the following silent error message from Tomcat:

HTTP Status 400 - isShard is only acceptable with search handlers
    type: Status report
    message: isShard is only acceptable with search handlers
    description: The request sent by the client was syntactically 
        incorrect (isShard is only acceptable with search handlers).

Huh!  I could have sworn it was a SearchHandler.  Checking the configuration for the /spell handler in solrconfig.xml, I indeed found the following:

<requestHandler name="/spell" class="solr.SearchHandler" lazy="true">
...
</requestHandler>

After some fruitless fiddling with the configuration, I finally gave up and launched a remote debugger in Eclipse.  Grepping the Solr 3.6.2 source code showed that the exception was thrown from SolrCore.java lines 1373-1374:

if (req.getParams().getBool(ShardParams.IS_SHARD,false) 
        && !(handler instanceof SearchHandler))
  throw new SolrException(SolrException.ErrorCode.BAD_REQUEST,
        "isShard is only acceptable with search handlers");

In the Eclipse debugger, I saw that when this line was invoked, the handler object was actually a LazyRequestHandlerWrapper (an internal class in 3.6.2) rather than a SearchHandler, causing the if statement to evaluate to false.  According to the LazyRequestHandlerWrapper documentation:

  /**
   * The <code>LazyRequestHandlerWrapper</core> wraps any {@link SolrRequestHandler}.  
   * Rather then instanciate and initalize the handler on startup, this wrapper waits
   * until it is actually called.  This should only be used for handlers that are
   * unlikely to be used in the normal lifecycle.
   * 
   * You can enable lazy loading in solrconfig.xml using:
   * 
   * <pre>
   *  &lt;requestHandler name="..." class="..." startup="lazy"&gt;
   *    ...
   *  &lt;/requestHandler&gt;
   * </pre>
   * 
   * This is a private class - if there is a real need for it to be public, it could
   * move
   * 
   * @version $Id: RequestHandlers.java 1306137 2012-03-28 03:30:52Z dsmiley $
   * @since solr 1.2
   */

Aha! So that’s why my /spell handler was behaving strangely – it was the only one using lazy loading.

Well, I never needed the lazy loading that badly anyway. Setting lazy="false" in the XML configuration immediately corrected the problem.

Interestingly, it appears that the offending code has been commented out in Solr 4.3.0 (SolrCore.java:1812-1814):

// TODO: this doesn't seem to be working correctly and causes problems with the example server and distrib (for example /spell)
// if (req.getParams().getBool(ShardParams.IS_SHARD,false) && !(handler instanceof SearchHandler))
//   throw new SolrException(SolrException.ErrorCode.BAD_REQUEST,"isShard is only acceptable with search handlers");

I suspect that the reason this code backfires is that, in example/solr/collection1/conf/solrconfig.xml, the /spell handler is defined as lazy="true". This would mean that the commented code could be fixed by simply publicizing the LazyRequestHandlerWrapper class and accounting for handler wrapping in the if statement:

SolrRequestHandler trueHandler = handler instanceof LazyRequestHandlerWrapper 
        ? ((LazyRequestHandlerWrapper)handler).getWrappedHandler()
        : handler;

if (req.getParams().getBool(ShardParams.IS_SHARD,false) && !(trueHandler instanceof SearchHandler))
  throw new SolrException(SolrException.ErrorCode.BAD_REQUEST,"isShard is only acceptable with search handlers");

I’ve already submitted this as a patch to Solr. We’ll see if the developers agree with my hunch.

JavaScript development and the paradox of choice

There are a lot of folks trying to sell their miracle cure for the problem of writing efficient, testable, maintainable JavaScript. And there’s an equal number of folks decrying the proliferation of almost-there libraries and flash-in-the-pan frameworks.

Bootstrap. Backbone. Handlebars. Angular. I’ve spent so much time hearing snatches of conversation about these tools, and trying to make sense of them, that after awhile it all starts to sound like some crazy beat poetry.

Listen:

angular backbone bootstrap cordova handlebars lawnchair underscore jasmine karma testacular grunt yeoman blueprint ember bower require sencha dojo mootools phonegap modernizr prototype meteor…

If you shouted that on a street corner while wielding a bottle of bourbon, you wouldn’t look out of place. I’ve seen the best minds of my generation destroyed trying to understand this mess.

Police Chief Wiggum and a raving derelict, from Simpsons episode 3F02.

Pictured: a seasoned JavaScript developer.

A good JavaScript is hard to find

Part of the reason there are so many snake-oil salesmen is that the cure is so badly needed. Web development is both 1) hard and 2) absolutely crucial. Facebook and Gmail have set the bar high enough that nowadays everyone expects beautiful, responsive, browser-based applications that take milliseconds to download and work on every rectangular-shaped device you can throw at it. It’s a tall order.

And the reason it feels like snake oil is that none of these tools solves the entire problem. I’ve tried many of them, hoping that I had finally found the JavaScript silver bullet, and I’ve always felt vaguely disappointed afterwards. The medicine tastes good going down, I get excited watching YouTube tutorials and reading GitHub pages and coding in a new paradigm, and then afterwards I still end up sweating feverishly over the Chrome Developer Tools, trying to center a disobedient div or figure out why my event isn’t firing. I exchange one set of problems for another.

And then I lay awake at night wondering, “Well, maybe instead of JQuery UI, I should have used YUI or Bootstrap or…”? Then it’s back for another dose of the same old medicine.

Grandpa Simpson selling some 'revitalizing tonic,' from Simpsons episode 2F07.

Step right up and put some fury in your JQuery, some zest in your CSS!

Another world is possible

This situation really frustrates me, primarily because I come from a Java background. And in Java Land, the platform is mature enough that there’s a basic suite of components that have emerged as the brain-dead, obvious solutions to common problems.

  • Need to test your app? Duh, use JUnit.
  • Need basic HTTP operations? Double duh, use Apache HTTP Client.
  • Need ORM? What are you, stupid? Use Hibernate.
  • Need a package manager? Cripes, it’s the 21st century: use Maven. Or Ivy, if you want something even simpler.

And if you use modern Java-based frameworks like Android or Grails, you’ll see that a lot of these third-party tools are already baked in: e.g. JUnit and HTTP Client for Android; Ivy, Hibernate, and JUnit for Grails. New Java developers pick up stuff like JUnit without thinking about it, as if it were just part of the language. And it practically is.

Even Java itself is mature enough that I’ve honestly felt satisfied since Java 6, and haven’t seen much need to upgrade. String switches in Java 7? Yawn, I’ve been using Enums since Java 5. Lambdas in Java 8? No need, Google Guava has me covered.

No silver JS bullet

JavaScript, on the other hand, is anything but mature. There is no “obvious” choice for third-party components – with the exception of JQuery, which is so omnipresent nowadays that it almost is JavaScript.

But aside from JQuery, there’s no one-stop solution that everyone rallies behind. For each of my “easy” questions for Java above, you get a forest of forked decision trees in JavaScript:

Mr. Burns contemplates Ketchup vs. Catsup, from Simpsons episode 2F07.

Ketchup or catsup? Karma or Selenium?

The paradox of choice

When you’ve got dozens of popular frameworks, many of them with overlapping or even conflicting goals, the choices can be overwhelming. And even after you choose one, it’s easy to end up second-guessing yourself and fretting endlessly over your decision. It’s a familiar case of the phenomenon popularly known as the paradox of choice.

So what’s a poor JavaScript developer to do?

Let’s say, for instance, that your boss tells your team that you need to write a mobile webapp. Do you choose JQuery Mobile, Sencha Touch, or Dojo Mobile? And what if you need to write a regular data-driven Ajax app? Do you choose Angular, Ember, or Backbone? Each of them has a snazzy self-laudatory website and fierce partisans on Stack Overflow. Looks like you’ve got some reading to do!

I’m new to web development, but I’ve come to believe that the only surefire solution to the problem of competing frameworks is to try them all. Not for a mission-critical project, of course – instead, you should just write a stub app. That way, you’ll discover each framework’s strengths and shortcomings, you’ll understand the problems it’s trying to solve, and you’ll be able to make an informed decision when it really counts.

In my opinion, it’s better to have three developers on your team take a week to write stub apps in three different frameworks, rather than blindly embark down a single path based on the attractiveness of a documentation page or the charisma of a YouTube evangelist.

My own stub app

I decided to try this approach recently with three frameworks I was curious about – Angular, Bootstrap, and PhoneGap. They seemed to have orthogonal goals, so in theory they should play nicely together.

My objective was to write a webapp with nice MVC features (Angular) that would look pretty (Bootstrap) and could work as a native Android or iOS app (PhoneGap). For the task itself, I chose to write an end-of-game score calculator for one of my favorite Euro-style board games, Imperial. This had the benefit of being a well-defined problem that scratched a personal itch, and plus it gave me something to show off to my board gamer buddies.

For the feature specifications, the usual suspects applied. I needed to persist user data, because presumably users would want to see their saved games, or resume a game if they accidentally closed the tab. The design had to be responsive in order to accomodate multiple screen sizes, because you could imagine using this app in your browser as well as on a smartphone. It had to support deep-linking, because what if you wanted to share the game results with your friends? And of course, the UI had to present the data in a useful way: who’s in first place, who came in second, are there any ties, etc.

When I first described this project to one of my coworkers, his reaction was “that sounds like way more than a stub app.” Which is true – as soon as you exceed a certain level of complexity, you run into interesting problems, for which the frameworks are supposed to provide useful solutions. This is exactly the point of writing the app.

The end result of this experiment is the Imperial Score Calculator. It’s available as both a mobile-friendly webapp and an Android app (iOS version coming soon). And of course the source code is on GitHub.

Imperial Score Calculator desktop-sized screenshot

Imperial Score Calculator

I’ve learned something today

In the end, I’m very satisfied with the project. Not because the app itself is the best I’ve ever written (it’s not), but because it taught me some hard lessons that I’ll take with me to my next web project. For instance, here are some of the lessons learned:

  • Bootstrap does not magically make everything responsive. Do not design for the desktop and then hope that when you resize the viewport everything will “just work.” Some assembly required.
  • Angular is a godsend. It’s as if someone stepped out of a time machine and showed us what HTML6 will look like, today. I initially wrote the app in JQuery; a naïve Angular rewrite resulted in about 20% less code.
  • That being said, Angular does not instantly replace JQuery, unless you really grok directives. I still had to fall back on the good ol’ $ from time to time.
  • Lawnchair is a cool idea, but it’s poorly documented, and the asynchronous approach means you can’t save user data in the onbeforeunload event. In the end, I just went with LocalStorage.
  • PhoneGap is awesome. But man oh man, do not try debugging it without Weinre, unless you like pulling your hair out.

These are all opinions that I hold after working on this app. And I don’t expect you, dear reader, to swallow any of them just based on my say-so. The only way you can learn these lessons is to build a stub app yourself.

And perhaps you’ll have a totally different experience and come to totally different conclusions. Your mileage may vary. But you won’t know until you take the car out for a test drive.

Imperial Score Calculator mobile-sized screenshot

I ended up using a completely different layout for the mobile version.

Conclusion

JavaScript development is hard. The community is going through some growing pains, with everyone defending their cherished framework. The only solution to this problem of fragmentation and “There’s More Than One Million Ways To Do It” is time.

I do see some rays of hope in projects like Meteor and Yeoman, which are very opinionated meta-frameworks that attempt to combine multiple “best of class” JavaScript solutions into one easy package for web developers. In a sense, they’re trying to solve the problem that’s already been solved in Java Land.

But since Java Land is an increasingly irrelevant, fading power next to the ascendant hegemony that is the People’s Republic of JavaScript, the solution can’t come soon enough. In the meantime, I’ll keep writing stub apps.

KeepScore version 1.2.2: you asked for it, you got it

My favorite part about working on a software project with real-world users is the feedback I get. It’s often said by industry veterans that you don’t know what kind of app you’re building until your users actually get their hands on it, and the wisdom of this statement has proven itself to me over and over.

With KeepScore, the app itself is pretty simple – it just keeps score. And each time I write an update, I tell myself, “Welp, that’s about all it’ll ever need.” Then I get an email from an interested user with a cool new use case, and I just can’t help but code it up.

So the app keeps growing and growing, but at each step I have to be extra-careful to keep the UI itself streamlined, simple, and dead-easy to use. With KeepScore version 1.2.2 (released today), I think I’ve managed to strike a good balance between functionality and usability.

Here are the new features:

Share

As many of you requested, you can now share your KeepScore games with a friend. You can send a single game, specific games, or all your games.

Just choose the games you want, hit the “Share” button up top, and KeepScore will create a special XML file that a friend can open with KeepScore on their own device.

The "Share" feature.

The “Share” feature.

This feature also allows you to back up your saved games to Dropbox, Google Drive, or your favorite cloud storage service.

Automatic backups

Speaking of backups, there’s no more do-it-yourself! KeepScore automatically saves a backup whenever you start a new game. Look for them in the “Restore” popup.

All your games are automatically backed up.

All your games are automatically backed up.

These files are gzipped, so they take up a minimal amount of space on your external storage.

Export spreadsheet

As board gamers, we’re geeks. And as geeks, we love analyzing our board game habits in a number of different ways. Who wins the most games? Who’s scored the most points? What games do we play the most often?

The "Export Spreadsheet" feature.

The “Export Spreadsheet” feature.

Rather than create a separate screen to answer each of these questions, KeepScore now offers an “Export Spreadsheet” feature. The spreadsheet may be imported into Excel, LibreOffice, Google Docs, or any document editor that accepts CSV files.

Data nerds rejoice.

Data nerds rejoice.

Once you’ve opened up the spreadsheet, you can slice and dice the data to your geeky heart’s content.

More Holo goodness

KeepScore 1.2.2 expands support for the Android “Holo” theme, which means it will look more beautiful and more consistent across different Android devices.

Holo everywhere.

Holo everywhere.

Additionally, I’ve revamped the default “Light” theme to be more clean and minimalist. It’s inspired by the “card” interface from Google Now, which I adore.

The new, Holo-style look.

The new, Holo-style look.

And if you’re scared by change, the old look is still available in the settings under Color Scheme -> Classic Light.

The classic look.

The classic look.

Whose turn is it?

A perennial complaint from users is that it’s hard to know if you’ve forgotten to add a player’s score. For round-based games (like Hearts) or games where the scoring order is important (like cribbage) this can be a real nuisance.

KeepScore 1.2.2 solves this problem using a clever suggestion from my buddy Alex Lougheed: add a little bullet icon to show which player was updated last. This means you can go player-by-player, totaling up the individual scores, without ever losing your place.

The blue bullet indicates who was scored last.

The blue bullet indicates who was scored last.

And if you’re playing a game where the player order doesn’t matter, you can disable the bullet in the settings.

Zoom in on the chart

On many devices, the history chart doesn’t show up very well, because it either gets cut off or it’s too small to see. Rather than fiddle with the presets for every possible screen size, I’ve added some handy zoom in/zoom out buttons.

Zoom in, out, and all around.

Zoom in, out, and all around.

Of course, pinch-to-zoom would be even nicer, but this works in a pinch (no pun intended!).

Internationalization

As always, KeepScore is localized into French and Japanese by yours truly. The German translation is out of date, though, and no other languages are currently supported.

Parlez-vous nippon?

Parlez-vous nippon?

According to the Play Store statistics, the top languages of KeepScore users are:

  1. English (United States)
  2. English (United Kingdom)
  3. French (France)
  4. English (Canada)
  5. German (Germany)
  6. English (Australia)
  7. Japanese (Japan)
  8. Dutch (Netherlands)
  9. Italian (Italy)

So if you speak German, Dutch, or Italian, and if you have some free time, please offer a translation!

Donate

This isn’t really a new feature, but I’ve added a Donate version of KeepScore to the Google Play store for $2.99.

Since I started work on this app, many people have asked where they could throw some change in my jar. But I resisted adding a Donate button, because after all, it’s just a counting app.

Recently, though, I noticed that the number of code commits to the KeepScore repository has actually surpassed any other Android app I’ve written (even CatLog and Pokédroid!). So I’ve had to admit to myself that this little counting app has morphed into quite the serious project.

So if you’d like to support KeepScore, you can download the Donate version from the Google Play Store, or just donate via PayPal.

Rest assured, though, that I will continue working on KeepScore regardless of your donations. For me, it’s just a fun app to write, and plus there’s still a lot of work to do. Next up: colors per player and battery-saving enhancements.

Personal password security that actually works

I’ve been thinking a lot about password security recently. Not because I’m paranoid, but because I’m a geek, and geeks love to optimize their lives.

Passwords are interesting, because they’re absolutely essential in the 21st century world, and almost all of us are doing it wrong. I’m not even talking about the fact that 91% of people use the 1000 most common passwords. Savvier folks using common substitutions like “MyP4ssw0rd1” are vulnerable too. And don’t even get me started on how parents always choose their kids’ names as their password. (I’m looking at you, Mom!)

As pointed out in this Wired article, even strong passwords can be insecure, because all an attacker needs is access to your email, and then they can use password recovery systems to get everything else – bank accounts, photos, blogs, you name it. It’s a scary prospect.

However, I disagree with the author’s conclusion, that passwords themselves are inherently flawed. There is always a tension between security and accessibility, and his alternatives are too impractical to ever actually gain currency. (Have the web site snap a photo of me, email the photo to three friends, and then ask at least one of them to confirm my identity? Excuse my editorial laugh.)

The inherent vulnerability in most people’s password systems is just that they use the same password everywhere. Which is understandable, because unless you’ve got a superhuman memory, you can’t remember your user name and password for the dozens of sites you use on a regular basis. Hence the convention of providing your email address as your universal login, and using the same password everywhere, which gives an attacker who cracks any rinky-dink site immediate access to your email. And thus, everything.

Think about that for a second. When you sign up with, say, the Buffalo Youth Hockey Association, you give them your email and probably the same password to access that email. And then you do the same thing for dozens of other sites. Does that make you feel secure?

Buffalo Youth Hockey Association site.  Totally secure.

The Buffalo Youth Hockey Association. No doubt a fine organization, but what about all the other ones?

My solution

So here’s my method, which I think is a nice compromise between security and convenience. It’s based on Joel Spolsky’s approach, although I use GitHub instead of DropBox.

Essentially, I keep a PasswdSafe database file with one strong master password stored in GitHub, and my GitHub account is protected with another strong password. I don’t use either of these two passwords anywhere else, and I haven’t written them down or anything. They only exist in my head.

I sync the database file across my Linux laptop, MacBook, and Android phone using Git, and then I use PasswordGorilla and PasswdSafe for Android to read and edit the file. That way I can keep all my devices up to date as I add and change passwords.

My password file.

My password file.

All of my passwords are in this single file. I use a different password across each of the sites I commonly access (email, shopping, banking, etc.), and each one is a random 16-character alphanumeric string generated by PasswordGorilla. When I need a password, I just open PasswordGorilla, enter my master password, double-click the password I want to copy, and then paste it into the web site.

Of course, this creates a single point of failure, which may seem insecure at first glance. But let’s see what a potential attacker would actually have to do to gain access to these passwords.

Smart method (unlikely)

First off, an attacker would need to get their hands on my database file. And to do this, they’d have to hack into my GitHub account.

This is just basic password guessing, and although I’m using what I consider a strong password (based on the “correct horse battery staple” system), let’s assume for the sake of argument that the attacker manages to guess it, and that they manage to guess it before GitHub starts locking them out.

Now that they have my database file, they’d have to guess my second, equally strong master password. In this case, they won’t ever be locked out, and since the file is local, they can use whatever hardware they want. So a brute-force attack seems like a good approach.

But considering that PasswordGorilla uses key stretching, this means that a brute-force attacker would have to patiently wait one or two seconds to test each password in sequence. And according to HowSecureIsMyPassword, even a weak password like “MyPassword” would take 4 billion years to crack this way (assuming 1 calculation per second).

So even if our attacker somehow made it to my second gated entrance, it’s unlikely they would ever get past it.

Dumb method (more likely)

Unfortunately, the blind spot in all this is that there’s a much easier way for an attacker to get my passwords: just get my computer. An attacker doesn’t need access to my GitHub account if the files are already on all my laptops and smartphone. They just need to steal one of those devices.

As for the master password, once they have physical access to my computer, they can simply install a keylogger. Then, as soon as I type out my master password in PasswordGorilla, all my secrets are as good as cracked.

Permit me to postulate that this second scenario is much more likely than the first. It’s easy, it’s simple, and it doesn’t require much technical knowledge – only physical access to my devices.

Extra security measures

So in truth, the weakest link in this entire password system is just my devices themselves. Upon realizing this, I took some extra measures to ensure that my laptops and phone were sufficiently secured.

For my laptops, it was as easy as choosing a strong password and setting a timeout lock on the screensaver. OS X didn’t do this by default, but Ubuntu did.

As for my Android phone, I switched from a pattern-based lock screen to a pin-based lock screen. This is because, as we now know, patterns can easily be cracked by just holding a phone up to the light and looking at the smudges.

You may think it's safe, but your oily thumbs betray you.

You may think it’s safe, but your oily thumbs betray you.

I now use a 6-digit numeric code, with lock-out after 5 tries. At first, it was quite a bit slower than my pattern-based lock screen, but once I got used to it, I could punch in the code almost as fast as before. And now my phone is much more secure.

What’s the point?

At the end of the day, password security is only as important as the stuff you’re securing. Unlike the author of the Wired article above, I’m a pretty unknown small-time developer, so my WordPress and Twitter accounts are not terribly interesting. And as a fresh-out-of-debt college grad, there’s not a whole lot in my bank accounts, either. So my convoluted security system is protecting a largely empty treasure chest.

But passwords are essential to Internet services, and increasingly our lives are lived online in the cloud. In twenty years, I can’t imagine how many hundreds more passwords I’ll have, and what kinds of stuff they’ll be protecting. The habits I build now could really help me out in the future.

Plus, I find that my new system is actually easier to use than the old one. How could I remember whether my user name was “nlawson” or “nolanlawson” or “NolanLawson” on all the forums where I was signed up? I’d inevitably have to try a few different combinations, or use the password recovery system, which took time.

Now I just open up PasswordGorilla, double-click, and paste. Nothing could be simpler. This is a system that even Mom could use. (And you should, Mom!)

Most importantly, now I can sleep soundly knowing that I’ve done the utmost to protect my online identity. Because I would hate to imagine that my account at the Buffalo Youth Hockey Association could ever compromise my account at the bank.

Note: sorry to pick on Buffalo hockey fans. You guys are all right.

Make your workplace more fun with a Jenkins alarm system

At every development team I’ve worked with, we’ve used Jenkins to notify us when the build broke. (Or Hudson, as it was called before Kohsuke Kawaguchi nailed a proclamation to the church wall.) Everyone on the team would receive an email when a unit test failed, or when someone forgot to commit a file, or when some other random blunder occurred.

Jenkins is really invaluable for finding problems early. And of course, publicly shaming the guilty party is always a great source of fun. The Continuous Integration Game plugin is even better for this.

But at my current employer, my colleague Alexandre Masselot had a singularly brilliant idea: instead of just firing off an email, why not add a visual cue as well? So he set up a physical flag system, attached to a USB servo device, with a cron script that would raise the flag whenever a Jenkins build failed. It looked like this:

Everyone on the dev team loved it. If the flag was raised when we walked into the office in the morning, the first question at the scrum meeting would be, “Who broke the build?” And as soon as the flag started to rise, you could hear the servo cranking, and the guilty developer would announce, “That was me!” A good time was had by all.

But could we do better?

The developers on the far side of the room couldn’t always hear the cranking sound, so they often didn’t notice when the flag was raised. So we decided to add an audio cue as well. Every time the build was broken, the machine would play the Star Wars “Imperial March” (aka Darth Vader’s theme), using the Unix beep command, since this particular machine had no speakers. It sounded like this:

And every time the build was fixed, it would play the main theme of Star Wars, to celebrate the joyous occasion:

This solved the immediate problem. The cacophonous beeping that signaled a build failure could be heard on the far side of the room. And beyond, where even the non-devs in the office could enjoy the sweet sound of pure geekiness.

But could we do better?

Whenever Darth Vader thundered his beeping fanfare, all the developers would immediately stop and check Jenkins to see which component broke. We didn’t have a quick way to know “whodunnit.”

So we added a new feature: an Android phone attached via USB, plus a simple text-to-speech app that would announce the name of the guilty party and the component that he or she broke.

The end result is that whenever the build is broken: the following events occur:

  1. The flag goes up
  2. The dreaded “Imperial March” sounds
  3. A robotic voice says “So-and-so broke the build, in the project such-and-such.”

The final system looks like this:

 

Then, when the build is back to normal, the system announces who fixed it, and all is forgiven:

 

So nowadays, whenever we start to hear the infernal beeping from our Jenkins machine, everyone takes off their headphones and patiently waits to hear who broke the build. Possibly with some apologies/excuses/complaints from the accused individual. (“It’s because the downstream project built before I was ready!”)

In any case, it makes our office much more melodious and much more fun. And luckily, the nearest non-dev is a Star Wars geek, so she doesn’t mind our antics.

Do it yourself

If you’d like to recreate our setup, it will only cost you about 40 bucks and a little bit of development time. You’ll need:

  1. A Yocto-Servo device ($25)
  2. A micro USB cable to connect it ($6)
  3. A micro servo to work the flag ($8)
  4. The flag itself (we used a Canadian flag, because it’s what I had, eh)
  5. The cron script to call Jenkins and raise the alarm
  6. The SimpleTalker app, if you have an Android device available. We use an old HTC Magic.

Bonus features

We also experimented with a few other features. If you use this excellent Python script to convert MIDI files into beep format, for instance, you can find any MIDI you like and make it into euphonious beep music. Here’s what this Super Mario Bros. theme from VGMusic.com sounds like:

And you could use the Mario “game over” theme for a build failure:

Alternatively, the Android app I wrote supports specifying an MP3 file on the device’s storage, in addition to the text-to-speech. Maybe each of your developers prefers a personalized “failure” and “success” theme? The sky’s the limit.

And if the beeping sound isn’t annoying enough, I will humbly point out that Yoctopuce also offers USB modules to operate an emergency rotating light. I take no responsibility for the mental health of your office-mates if you actually install such a device.

Summary

Programming is fun. And in our office, we’ve found that turning programming into an audiovisual experience makes it even more fun. I hope you’ll get a kick out of the code we’ve written, and that your boss won’t think it’s a waste of time. (Ours didn’t, although she keeps the door to her office firmly closed now.)

Better synonym handling in Solr

Update: Download the plugin on Github.

It’s a pretty common scenario when working with a Solr-powered search engine: you have a list of synonyms, and you want user queries to match documents with synonymous terms. Sounds easy, right? Why shouldn’t queries for “dog” also match documents containing “hound” and “pooch”? Or even “Rover” and “canis familiaris”?

A Rover by any other name would taste just as sweet.

As it turns out, though, Solr doesn’t make synonym expansion as easy as you might like. And there are lots of good ways to shoot yourself in the foot.

The SynonymFilterFactory

Solr provides a cool-sounding SynonymFilterFactory, which can be a fed a simple text file containing comma-separated synonyms. You can even choose whether to expand your synonyms reciprocally or to specify a particular directionality.

For instance, you can make “dog,” “hound,” and “pooch” all expand to “dog | hound | pooch,” or you can specify that “dog” maps to “hound” but not vice-versa, or you can make them all collapse to “dog.” This part of the synonym handling is very flexible and works quite well.

Where it gets complicated is when you have to decide where to fit the SynonymFilterFactory: into the query analyzer or the index analyzer?

Index-time vs. query-time

The graphic below summarizes the basic differences between index-time and query-time expansion. Our problem is specific to Solr, but the choice between these two approaches can apply to any information retrieval system.

Index-time vs. query-time expansion.

Your first, intuitive choice might be to put the SynonymFilterFactory in the query analyzer. In theory, this should have several advantages:

  1. Your index stays the same size.
  2. Your synonyms can be swapped out at any time, without having to update the index.
  3. Synonyms work instantly; there’s no need to re-index.

However, according to the Solr docs, this is a Very Bad Thing to Do(™), and apparently you should put the SynonymFilterFactory into the index analyzer instead, despite what your instincts would tell you. They explain that query-time synonym expansion has two negative side effects:

  1. Multi-word synonyms won’t work as phrase queries.
  2. The IDF of rare synonyms will be boosted, causing unintuitive results.
  3. Multi-word synonyms won’t be matched in queries.

This is kind of complicated, so it’s worth stepping through each of these problems in turn.

Multi-word synonyms won’t work as phrase queries

At Health On the Net, our search engine uses MeSH terms for query expansion. MeSH is a medical ontology that works pretty well to provide some sensible synonyms for the health domain. Consider, for example, the synonyms for “breast cancer”:

breast neoplasm
breast neoplasms
breast tumor
breast tumors
cancer of breast
cancer of the breast

 

So in a normal SynonymFilterFactory setup with expand=”true”, a query for “breast cancer” becomes:

+((breast breast breast breast breast cancer cancer) (cancer neoplasm neoplasms tumor tumors) breast breast)

 

…which matches documents containing “breast neoplasms,” “cancer of the breast,” etc.

However, this also means that, if you’re doing a phrase query (i.e. “breast cancer” with the quotes), your document must literally match something like “breast cancer breast breast” in order to work.

Huh? What’s going on here? Well, it turns out that the SynonymFilterFactory isn’t expanding your multi-word synonyms the way you might think. Intuitively, if we were to represent this as a finite-state automaton, you might think that Solr is building up something like this (ignoring plurals):

What you reasonably expect.

But really it’s building up this:

The spaghetti you actually get.

And your poor, unlikely document must match all four terms in sequence. Yikes.

Similarly, the mm parameter (minimum “should” match) in the DisMax and EDisMax query parsers will not work as expected. In the example above, setting mm=100% will require that all four terms be matched:

+((breast breast breast breast breast cancer cancer) (cancer neoplasm neoplasms tumor tumors) breast breast)~4

 

The IDF of rare synonyms will be boosted

Even if you don’t have multi-word synonyms, the Solr docs mention a second good reason to avoid query-time expansion: unintuitive IDF boosting. Consider our “dog,” “hound,” and “pooch” example. In this case, a query for any one of the three will be expanded into:

+(dog hound pooch)

 

Since “hound” and “pooch” are much less common words, though, this means that documents containing them will always be artificially high in the search results, regardless of the query. This could create havoc for your poor users, who may be wondering why weird documents about hounds and pooches are appearing so high in their search for “dog.”

Index-time expansion supposedly fixes this problem by giving the same IDF values for “dog,” “hound,” and “pooch,” regardless of what the document originally said.

Multi-word synonyms won’t be matched in queries

Finally, and most seriously, the SynonymFilterFactory will simply not match multi-word synonyms in user queries if you do any kind of tokenization. This is because the tokenizer breaks up the input before the SynonymFilterFactory can transform it.

For instance, the query “cancer of the breast” will be tokenized by the StandardTokenizationFactory into [“cancer”, “of”, “the”, “breast”], and only the individual terms will pass through the SynonymFilterFactory. So in this case no expansion will take place at all, assuming there are no synonyms for the individual terms “cancer” and “breast.”

Edit: I’ve been corrected on this. Apparently, the bug is in the Lucene query parser (LUCENE-2605) rather than the SynonymFilterFactory.

Other problems

I initially followed Solr’s suggestions, but I found that index-time synonym expansion created its own issues. Obviously there’s the problem of ballooning index sizes, but besides that, I also discovering an interesting bug in the highlighting system.

When I searched for “breast cancer,” I found that the highlighter would mysteriously highlight “breast cancer X Y,” where “X” and “Y” could be any two words that followed “breast cancer” in the document. For instance, it might highlight “breast cancer frauds are” or “breast cancer is to.”

Highlighting bug.

After reading through this Solr bug, I discovered it’s because of the same issue above concerning how Solr expands multi-word synonyms.

With query-time expansion, it’s weird enough that your query is logically transformed into the spaghettified graph above. But picture what happens with index-time expansion, if your document contains e.g. “breast cancer treatment options”:

Your mangled document.

This is literally what Lucene thinks your document looks like. Synonym expansion has bought you more than you bargained for, with some Dada-esque results! “Breast tumor the options” indeed.

Essentially, Lucene now believes that a query for “cancer of the breast” (4 tokens) is the same as “breast cancer treatment options” (4 tokens) in your original document. This is because the tokens are just stacked one on top of the other, losing any information about which term should be followed by which other term.

Query-time expansion does not trigger this bug, because Solr is only expanding the query, not the document. So Lucene still thinks “cancer of the breast” in the query only matches “breast cancer” in the document.

Update: there’s a name for this phenomenon! It’s called “sausagization.”

Back to the drawing board

All of this wackiness led me to the conclusion that Solr’s built-in mechanism for synonym expansion was seriously flawed. I had to figure out a better way to get Solr to do what I wanted.

In summary, index-time expansion and query-time expansion were both unfeasible using the standard SynonymFilterFactory, since they each had separate problems:

Index-time

  • Index size balloons.
  • Synonyms don’t work instantly; documents must be re-indexed.
  • Synonyms cannot be instantly replaced.
  • Multi-word synonyms cause arbitrary words to be highlighted.

Query-time

  • Phrase queries do not work.
  • IDF values for rare synonyms are artificially boosted.
  • Multi-word synonyms won’t be matched in queries.

I began with the assumption that the ideal synonym-expansion system should be query-based, due to the inherent downsides of index-based expansion listed above. I also realized there’s a more fundamental problem with how Solr has implemented synonym expansion that should be addressed first.

Going back to the “dog”/”hound”/”pooch” example, there’s a big issue usability-wise with treating all three terms as equivalent. A “dog” is not exactly the same thing as a “pooch” or a “hound,” and certain queries might really be looking for that exact term (e.g. “The Hound of the Baskervilles,” “The Itchy & Scratchy & Poochy Show”). Treating all three as equivalent feels wrong.

Also, even with the recommended approach of index-time expansion, IDF weights are thrown out of whack. Every document that contains “dog” now also contains “pooch”, which means we have permanently lost information about the true IDF value for “pooch”.

In an ideal system, a search for “dog” should include documents containing “hound” and “pooch,” but it should still prefer documents containing the actual query term, which is “dog.” Similarly, searches for “hound” should prefer “hound,” and searches for “pooch” should prefer “pooch.” (I hope I’m not saying anything controversial here.) All three should match the same document set, but deliver the results in a different order.

Solution

My solution was to move the synonym expansion from the analyzer’s tokenizer chain to the query parser. So instead of expanding queries into the crazy intercrossing graphs shown above, I split it into two parts: the main query and the synonym query. Then I combine the two with separate, configurable weights, specify each one as “should occur,” and then wrap them both in a “must occur” boolean query.

So a search for “dog” is parsed as:

+((dog)^1.2 (hound pooch)^1.1)

 

The 1.2 and the 1.1 are the independent boosts, which can be configured as input parameters. The document must contain one of “dog”, “hound,” or “pooch”, but “dog” is preferred.

Handling synonyms in this way also has another interesting side effect: it eliminates the problem of phrase queries not working. In the case of “breast cancer” (with the quotes), the query is parsed as:

+(("breast cancer")^1.2 (("breast neoplasm") ("breast tumor") ("cancer ? breast") ("cancer ? ? breast"))^1.1)

 

(The question marks appear because of the stopwords “of” and “the.”)

This means that a query for “breast cancer” (with the quotes) will also match documents containing the exact sequence “breast neoplasm,” “breast tumor,” “cancer of the breast,” and “cancer of breast.”

I also went one step beyond the original SynonymFilterFactory and built up all possible synonym combinations for a given query. So, for instance, if the query is “dog bite” and the synonyms file contains:

dog,hound,pooch
bite,nibble

 

… then the query will be expanded into:

dog bite
hound bite
pooch bite
dog nibble
hound nibble
pooch nibble

 

Try it yourself!

The code I wrote is a simple extension of the ExtendedDisMaxQueryParserPlugin, called the SynonymExpandingExtendedDisMaxQueryParserPlugin (long enough name?). I’ve only tested it to work with Solr 3.5.0, but it ought to work with any version that has EDisMax.

Edit: the instructions below are deprecated. Please follow the “Getting Started” guide on the Github page instead.

Here’s how you can use the parser:

  1. Drop this jar into your Solr’s lib/ directory.
  2. Add this definition to your solrconfig.xml:
  3. <queryParser name="synonym_edismax" class="solr.SynonymExpandingExtendedDismaxQParserPlugin">
      <!-- TODO: figure out how we wouldn't have to define this twice -->
      <str name="luceneMatchVersion">LUCENE_34</str>
      <lst name="synonymAnalyzers">
        <lst name="myCoolAnalyzer">
          <lst name="tokenizer">
            <str name="class">solr.StandardTokenizerFactory</str>
          </lst>
          <lst name="filter">
            <str name="class">solr.ShingleFilterFactory</str>
            <str name="outputUnigramsIfNoShingles">true</str>
            <str name="outputUnigrams">true</str>
            <str name="minShingleSize">2</str>
            <str name="maxShingleSize">4</str>
          </lst>
          <lst name="filter">
            <str name="class">solr.SynonymFilterFactory</str>
            <str name="tokenizerFactory">solr.KeywordTokenizerFactory</str>
            <str name="synonyms">my_synonyms_file.txt</str>
            <str name="expand">true</str>
            <str name="ignoreCase">true</str>
          </lst>
        </lst>
        <!-- add more analyzers here, if you want -->
      </lst>
    </queryParser>
    

    The analyzer you see defined above is the one used to split the query into all possible alternative synonyms. Synonyms that are exactly the same as the original query will be ignored, so feel free to use expand=true if you like.

    This particular configuration (StandardTokenizerFactory + ShingleFilterFactory + SynonymFilterFactory) is just the one that I found worked the best for me. Feel free to try a different configuration, but something really fancy might break the code, so I don’t recommend going too far.

    For instance, you can configure the ShingleFilterFactory to output shingles (i.e. word N-grams) of any size you want, but I chose shingles of size 1-4 because my synonyms typically aren’t longer than 4 words. If you don’t have any multi-word synonyms, you can get rid of the ShingleFilterFactory entirely.

    (I know that this XML format is different from the typical one found in schema.xml, since it uses lst and str tags to configure the tokenizer and filters. Also, you must define the luceneMatchVersion a second time. I’ll try to find a way to fix these problems in a future release.)

  4. Add defType=synonym_edismax to your query URL parameters, or set it as the default in solrconfig.xml.
  5. Add the following query parameters. The first one is required:
  6. Param Type Default Summary
    synonyms boolean false Enable or disable synonym expansion entirely. Enabled if true.
    synonyms.analyzer String null Name of the analyzer defined in solrconfig.xml to use. (E.g. in the example above, it’s myCoolAnalyzer). This must be non-null, if you define more than one analyzer.
    synonyms.originalBoost float 1.0 Boost value applied to the original (non-synonym) part of the query.
    synonyms.synonymBoost float 1.0 Boost value applied to the synonym part of the query.
    synonyms.disablePhraseQueries boolean false Enable or disable synonym expansion when the user input contains a phrase query (i.e. a quoted query).

Future work

Note that the parser does not currently expand synonyms if the user input contains complex query operators (i.e. AND, OR, +, and ). This is a TODO for a future release.

I also plan on getting in contact with the Solr/Lucene folks to see if they would be interested in including my changes in an upcoming version of Solr. So hopefully patching won’t be necessary in the future.

In general, I think my approach to synonyms is more principled and less error-prone than the built-in solution. If nothing else, though, I hope I’ve demonstrated that making synonyms work in Solr isn’t as cut-and-dried as one might think.

As usual, you can fork this code on GitHub!