relatedness calculator | Read the Tea Leaves

Posts Tagged ‘relatedness calculator’

8 Sep

Relatedness Calculator: Part Deux

Posted by Nolan Lawson in NLP, Webapps. Tagged: grails, relatedness calculator, ui design, webapps. 2 comments

Recently I added a ton of features and improvements to the Relatedness Calculator. Not only is the new app faster, lighter, prettier, and lower-cholesterol, but it’s also made me a savvier web developer in the process. How about that! I find myself actually learning a lot about web development, and slowly correcting my shameful n00b errors from the first version.

Version One

Now, I think the first version of the Relatedness Calculator was pretty cool. Neato, even. It solved the original problem that got me excited about this project in the first place, which was:

How closely related am I to my grandma’s cousin’s daughter?

— A dude on a message board

Answering this question in an automated way required writing a parser, defining a hell of a lot of English relationships, and writing an algorithm to calculate the relatedness coefficient for arbitrarily chained relationships (e.g. “X’s X’s X’s (…) X”).

Nothing could be simpler.

As it turns out, there’s a lot you have to know about genealogy in order to write such a system. At the minimum, you have to read something like Richard Dawkins’ essay on genesmanship from The Selfish Gene. But then there are also a lot of non-obvious things you learn that go beyond genealogy 101.

For instance, I learned that there’s a whole class of relationships that can be expressed with “half” – it’s not just half-brothers and half-sisters. There are half-cousins, half-uncles, half-second cousins, and even half-great-nieces too. What they all have in common is having two common ancestors in the family tree, which get reduced to just one in the “half” versions. Neat, huh?

There’s also a sort of “relatedness addition” that goes into calculating the relatedness coefficients for chained relations. If, for instance, I was parsing “uncle’s daughter,” I discovered I could take the common ancestors for “uncle” and the common ancestors for “daughter,” then use what the math geeks call matrix addition to get the right set of common ancestors. And it worked! I have no idea if it’s justified mathematically, but it made my unit tests pass, so that’s all I care about. (Man, am I a computer programmer, or what?)

There are also some relations that don’t make sense when you chain them together – for instance, “cousin’s brother” or “father’s son.” Examples like these kept screwing up the relatedness addition, because in fact they were redundant ways of expressing a much simpler concept. I.e., “cousin’s brother” is really just a cousin, and “father’s son” is really just you, or possibly your brother. The app had to handle all these errors to avoid barfing up nonsensical results, such as arbitrarily reduced relatedness coefficients for “father’s son’s father’s son’s father’s son’s…” etc.

So essentially, version one was an academic exercise. I barely paid any attention to the UI, and instead just made sure that the core logic was bulletproof. But all that changed with the second version.

User Confusion

From looking at the logs, I could see that users were typing a lot of unexpected queries that completely broke the system:

step-cousin twice removed on my mother’s side

john’s uncle’s wife

identical twin

father’s 2nd wife’s son

4th great-grandfather’s granddaughter

The worst part about a user typing something the system doesn’t understand is that they get a big fat error page. Most users will probably leave the site the first time something like that happens. After all, they went to all the trouble of typing a query out, and the system puked. Who would want to waste their time on a site like that?

Yes, my anonymous Internet application knows all about your friend John.

Obviously some of these queries are things I can fix – “identical twin” and “twice removed” are all features I added in version two. But other things don’t really make any sense in terms of calculating relatedness. Who is “John”? How am I supposed to know if you’re biologically related to his wife? And you’re not biologically related to your step-cousins, either!

I decided the biggest problem here was not that the system was throwing error messages, but rather that users didn’t know what kind of input the system expected. This is a classic autocompletion problem. People hate staring at a blank page. So starting with version two, I added a nice little autocomplete box.

Showing the space of possible queries goes a long way to help increase the user’s comprehension of the system. “Do I have to start with the word ‘my’?” “Do I have to put a space after the word ‘great’?” “Do I have to type ‘grandfather’ or can I just put ‘gramps’?” All these questions are answered in real time, as you type. It also makes playing around with the system rewarding and fun.

Memory usage

One of the biggest problems I ran into after adding autocompletion was OutOfMemoryErrors. I had used a simple trie object to suggest autocompletions, but this was really taxing the available memory on my Amazon EC2 Micro instance (613 MB).

So the first thing I did was add a 1G swap file on the OS. This had the effect of degrading the performance, while avoiding a complete application crash. I knew I still had some work to do.

I deduced that the new trie object was the culprit, given that the system didn’t crash until after the first autocomplete request was made. So using NetBeans to profile the memory usage, I slowly started slimming down the trie.

The original trie used a classic node-based data structure, similar to a linked list:

public class Trie<T> {
    private TrieNode root = new TrieNode();

    private class TrieNode {

        T value;
        Map<Character, TrieNode> next = Maps.newHashMap();

    }

A parameterized Trie<T> could hold any object in the T, and out of laziness I was storing the entire String that led to a leaf TrieNode. So first off I axed that, since the String itself could just be reconstructed from the breadcrumb trail of Characters anyway.

Next, I cut the memory usage of the Trie roughly in half by replacing all instances of HashMap<Character, TrieNode> with a custom SparseCharArray<TrieNode>. My SparseCharArray is basically just an Object array that has some of the behavior of a Map when I need it. I had already used a similar data structure for my Japanese Name Converter.

In the end, the new trie greatly improved the memory usage of the application, to the point where I didn’t even really need the swap space anymore. (But it’s nice to keep, just in case!)

Browser optimization

Having very little experience in building webapps, I didn’t have the faintest clue how to optimize one. Luckily there are a lot of smart folks who have already figured most of this stuff out, and who publish nice, condensed best-practice guides. Also, the debugging tools are incredibly slick these days, and between Firebug and Chrome’s Developer Tools, I had everything I needed to get under the hood and start tinkering around.

As it turned out, the biggest drain on the application’s performance was just resource management. Grails automates a lot of that, and it’s nice when it works, but when it doesn’t, you have to get your hands dirty. I eventually noticed I had a misbehaving plugin that was requesting Javascript resources that didn’t exist, so I was getting 302 (“moved temporarily”) redirects. On my teensy Micro instance running in Oregon, halfway across the globe from me, this was adding a nearly half-second roundtrip for each request. Once I fixed the links, though, the resources could be cached in the browser and therefore loaded in almost 0 time after the first request.

Another basic problem with redirects came from the root of the app itself. Linking to “/relatedness-calculator” instead of “/relatedness-calculator/” (i.e. without the slash), amazingly, also added about half a second to the request, which would occur anytime the user clicked the logo. So the addition of a single character saved me half a second of response time.

De-uglification

I don’t think there’s much that needs to be said here. Just take a look at the before and after pictures:

Before.

After.

Yep, I finally broke down and learned how to use Gimp. Bubbly, glossy Web 2.0 text for the win!

Another subtle visual change I made was to the family tree graph. To make the relationships in the graph clearer, I added emphasis to the nodes that were most relevant to the user’s query. For instance, in the case of “grandpa’s cousin,” the system draws a green border around “you,” “your grandpa,” and “your grandpa’s cousin.”

I don’t think I’ve ever met a single one of my grandpa’s cousins.

Conclusion

In the end, I’m pretty pleased with the changes I made to the Relatedness Calculator, and I’m proud of the end product. The only thing I find that I’m missing is the thrill I always got from Android development, from getting my code immediately into users’ hands and finding out whether an app was a winner or a dud. With Android development, users always seemed to quickly find my apps and give me feedback on them, even if I never did any marketing.

For instance, even my all-time least popular app, App Tracker, has 4,000 downloads, 50 ratings, 30 comments, and a handful of emails I’ve received from interested users. But with the Relatedness Calculator, it’s been out for three months and I haven’t heard a peep about it. It doesn’t seem to rank highly in Google’s search results, and according to the logs it’s only gotten about 250 query hits.

So I suppose the next step with the Relatedness Calculator is to figure out how to get people to use the damn thing. “Search engine optimization” and all that. My first instinct is to start emailing around to genealogy sites to see if any of them would be interested in linking to me, or maybe even running a standalone version on their own servers.

Alternatively, I could just make an Android version and see if my clout on the Play Store helps push it up in the search results. But I’m hesitant to do that, because this isn’t a product that makes a whole lot of sense as a mobile app. And plus, the mobile site works just fine.

In any case, I don’t plan on abandoning the Relatedness Calculator anytime soon. It’s a nice, simple project that gives me an opportunity to write some gnarly regexes while learning a thing or two about web development. And anyway, I’ve got an app that can figure out your relatedness to your identical twin’s children. How cool is that?

Read the Tea Leaves Software and other dark arts, by Nolan Lawson

Posts Tagged ‘relatedness calculator’

Relatedness Calculator: Part Deux

Version One

User Confusion

Memory usage

Browser optimization

De-uglification

Conclusion

Recent Posts

About Me

Archives

Tags

Links