Posts Tagged ‘nlp’

Better synonym handling in Solr

Update: Download the plugin on Github.

It’s a pretty common scenario when working with a Solr-powered search engine: you have a list of synonyms, and you want user queries to match documents with synonymous terms. Sounds easy, right? Why shouldn’t queries for “dog” also match documents containing “hound” and “pooch”? Or even “Rover” and “canis familiaris”?

A Rover by any other name would taste just as sweet.

As it turns out, though, Solr doesn’t make synonym expansion as easy as you might like. And there are lots of good ways to shoot yourself in the foot.

The SynonymFilterFactory

Solr provides a cool-sounding SynonymFilterFactory, which can be a fed a simple text file containing comma-separated synonyms. You can even choose whether to expand your synonyms reciprocally or to specify a particular directionality.

For instance, you can make “dog,” “hound,” and “pooch” all expand to “dog | hound | pooch,” or you can specify that “dog” maps to “hound” but not vice-versa, or you can make them all collapse to “dog.” This part of the synonym handling is very flexible and works quite well.

Where it gets complicated is when you have to decide where to fit the SynonymFilterFactory: into the query analyzer or the index analyzer?

Index-time vs. query-time

The graphic below summarizes the basic differences between index-time and query-time expansion. Our problem is specific to Solr, but the choice between these two approaches can apply to any information retrieval system.

Index-time vs. query-time expansion.

Your first, intuitive choice might be to put the SynonymFilterFactory in the query analyzer. In theory, this should have several advantages:

  1. Your index stays the same size.
  2. Your synonyms can be swapped out at any time, without having to update the index.
  3. Synonyms work instantly; there’s no need to re-index.

However, according to the Solr docs, this is a Very Bad Thing to Do(™), and apparently you should put the SynonymFilterFactory into the index analyzer instead, despite what your instincts would tell you. They explain that query-time synonym expansion has two negative side effects:

  1. Multi-word synonyms won’t work as phrase queries.
  2. The IDF of rare synonyms will be boosted, causing unintuitive results.
  3. Multi-word synonyms won’t be matched in queries.

This is kind of complicated, so it’s worth stepping through each of these problems in turn.

Multi-word synonyms won’t work as phrase queries

At Health On the Net, our search engine uses MeSH terms for query expansion. MeSH is a medical ontology that works pretty well to provide some sensible synonyms for the health domain. Consider, for example, the synonyms for “breast cancer”:

breast neoplasm
breast neoplasms
breast tumor
breast tumors
cancer of breast
cancer of the breast


So in a normal SynonymFilterFactory setup with expand=”true”, a query for “breast cancer” becomes:

+((breast breast breast breast breast cancer cancer) (cancer neoplasm neoplasms tumor tumors) breast breast)


…which matches documents containing “breast neoplasms,” “cancer of the breast,” etc.

However, this also means that, if you’re doing a phrase query (i.e. “breast cancer” with the quotes), your document must literally match something like “breast cancer breast breast” in order to work.

Huh? What’s going on here? Well, it turns out that the SynonymFilterFactory isn’t expanding your multi-word synonyms the way you might think. Intuitively, if we were to represent this as a finite-state automaton, you might think that Solr is building up something like this (ignoring plurals):

What you reasonably expect.

But really it’s building up this:

The spaghetti you actually get.

And your poor, unlikely document must match all four terms in sequence. Yikes.

Similarly, the mm parameter (minimum “should” match) in the DisMax and EDisMax query parsers will not work as expected. In the example above, setting mm=100% will require that all four terms be matched:

+((breast breast breast breast breast cancer cancer) (cancer neoplasm neoplasms tumor tumors) breast breast)~4


The IDF of rare synonyms will be boosted

Even if you don’t have multi-word synonyms, the Solr docs mention a second good reason to avoid query-time expansion: unintuitive IDF boosting. Consider our “dog,” “hound,” and “pooch” example. In this case, a query for any one of the three will be expanded into:

+(dog hound pooch)


Since “hound” and “pooch” are much less common words, though, this means that documents containing them will always be artificially high in the search results, regardless of the query. This could create havoc for your poor users, who may be wondering why weird documents about hounds and pooches are appearing so high in their search for “dog.”

Index-time expansion supposedly fixes this problem by giving the same IDF values for “dog,” “hound,” and “pooch,” regardless of what the document originally said.

Multi-word synonyms won’t be matched in queries

Finally, and most seriously, the SynonymFilterFactory will simply not match multi-word synonyms in user queries if you do any kind of tokenization. This is because the tokenizer breaks up the input before the SynonymFilterFactory can transform it.

For instance, the query “cancer of the breast” will be tokenized by the StandardTokenizationFactory into [“cancer”, “of”, “the”, “breast”], and only the individual terms will pass through the SynonymFilterFactory. So in this case no expansion will take place at all, assuming there are no synonyms for the individual terms “cancer” and “breast.”

Edit: I’ve been corrected on this. Apparently, the bug is in the Lucene query parser (LUCENE-2605) rather than the SynonymFilterFactory.

Other problems

I initially followed Solr’s suggestions, but I found that index-time synonym expansion created its own issues. Obviously there’s the problem of ballooning index sizes, but besides that, I also discovering an interesting bug in the highlighting system.

When I searched for “breast cancer,” I found that the highlighter would mysteriously highlight “breast cancer X Y,” where “X” and “Y” could be any two words that followed “breast cancer” in the document. For instance, it might highlight “breast cancer frauds are” or “breast cancer is to.”

Highlighting bug.

After reading through this Solr bug, I discovered it’s because of the same issue above concerning how Solr expands multi-word synonyms.

With query-time expansion, it’s weird enough that your query is logically transformed into the spaghettified graph above. But picture what happens with index-time expansion, if your document contains e.g. “breast cancer treatment options”:

Your mangled document.

This is literally what Lucene thinks your document looks like. Synonym expansion has bought you more than you bargained for, with some Dada-esque results! “Breast tumor the options” indeed.

Essentially, Lucene now believes that a query for “cancer of the breast” (4 tokens) is the same as “breast cancer treatment options” (4 tokens) in your original document. This is because the tokens are just stacked one on top of the other, losing any information about which term should be followed by which other term.

Query-time expansion does not trigger this bug, because Solr is only expanding the query, not the document. So Lucene still thinks “cancer of the breast” in the query only matches “breast cancer” in the document.

Update: there’s a name for this phenomenon! It’s called “sausagization.”

Back to the drawing board

All of this wackiness led me to the conclusion that Solr’s built-in mechanism for synonym expansion was seriously flawed. I had to figure out a better way to get Solr to do what I wanted.

In summary, index-time expansion and query-time expansion were both unfeasible using the standard SynonymFilterFactory, since they each had separate problems:


  • Index size balloons.
  • Synonyms don’t work instantly; documents must be re-indexed.
  • Synonyms cannot be instantly replaced.
  • Multi-word synonyms cause arbitrary words to be highlighted.


  • Phrase queries do not work.
  • IDF values for rare synonyms are artificially boosted.
  • Multi-word synonyms won’t be matched in queries.

I began with the assumption that the ideal synonym-expansion system should be query-based, due to the inherent downsides of index-based expansion listed above. I also realized there’s a more fundamental problem with how Solr has implemented synonym expansion that should be addressed first.

Going back to the “dog”/”hound”/”pooch” example, there’s a big issue usability-wise with treating all three terms as equivalent. A “dog” is not exactly the same thing as a “pooch” or a “hound,” and certain queries might really be looking for that exact term (e.g. “The Hound of the Baskervilles,” “The Itchy & Scratchy & Poochy Show”). Treating all three as equivalent feels wrong.

Also, even with the recommended approach of index-time expansion, IDF weights are thrown out of whack. Every document that contains “dog” now also contains “pooch”, which means we have permanently lost information about the true IDF value for “pooch”.

In an ideal system, a search for “dog” should include documents containing “hound” and “pooch,” but it should still prefer documents containing the actual query term, which is “dog.” Similarly, searches for “hound” should prefer “hound,” and searches for “pooch” should prefer “pooch.” (I hope I’m not saying anything controversial here.) All three should match the same document set, but deliver the results in a different order.


My solution was to move the synonym expansion from the analyzer’s tokenizer chain to the query parser. So instead of expanding queries into the crazy intercrossing graphs shown above, I split it into two parts: the main query and the synonym query. Then I combine the two with separate, configurable weights, specify each one as “should occur,” and then wrap them both in a “must occur” boolean query.

So a search for “dog” is parsed as:

+((dog)^1.2 (hound pooch)^1.1)


The 1.2 and the 1.1 are the independent boosts, which can be configured as input parameters. The document must contain one of “dog”, “hound,” or “pooch”, but “dog” is preferred.

Handling synonyms in this way also has another interesting side effect: it eliminates the problem of phrase queries not working. In the case of “breast cancer” (with the quotes), the query is parsed as:

+(("breast cancer")^1.2 (("breast neoplasm") ("breast tumor") ("cancer ? breast") ("cancer ? ? breast"))^1.1)


(The question marks appear because of the stopwords “of” and “the.”)

This means that a query for “breast cancer” (with the quotes) will also match documents containing the exact sequence “breast neoplasm,” “breast tumor,” “cancer of the breast,” and “cancer of breast.”

I also went one step beyond the original SynonymFilterFactory and built up all possible synonym combinations for a given query. So, for instance, if the query is “dog bite” and the synonyms file contains:



… then the query will be expanded into:

dog bite
hound bite
pooch bite
dog nibble
hound nibble
pooch nibble


Try it yourself!

The code I wrote is a simple extension of the ExtendedDisMaxQueryParserPlugin, called the SynonymExpandingExtendedDisMaxQueryParserPlugin (long enough name?). I’ve only tested it to work with Solr 3.5.0, but it ought to work with any version that has EDisMax.

Edit: the instructions below are deprecated. Please follow the “Getting Started” guide on the Github page instead.

Here’s how you can use the parser:

  1. Drop this jar into your Solr’s lib/ directory.
  2. Add this definition to your solrconfig.xml:
  3. <queryParser name="synonym_edismax" class="solr.SynonymExpandingExtendedDismaxQParserPlugin">
      <!-- TODO: figure out how we wouldn't have to define this twice -->
      <str name="luceneMatchVersion">LUCENE_34</str>
      <lst name="synonymAnalyzers">
        <lst name="myCoolAnalyzer">
          <lst name="tokenizer">
            <str name="class">solr.StandardTokenizerFactory</str>
          <lst name="filter">
            <str name="class">solr.ShingleFilterFactory</str>
            <str name="outputUnigramsIfNoShingles">true</str>
            <str name="outputUnigrams">true</str>
            <str name="minShingleSize">2</str>
            <str name="maxShingleSize">4</str>
          <lst name="filter">
            <str name="class">solr.SynonymFilterFactory</str>
            <str name="tokenizerFactory">solr.KeywordTokenizerFactory</str>
            <str name="synonyms">my_synonyms_file.txt</str>
            <str name="expand">true</str>
            <str name="ignoreCase">true</str>
        <!-- add more analyzers here, if you want -->

    The analyzer you see defined above is the one used to split the query into all possible alternative synonyms. Synonyms that are exactly the same as the original query will be ignored, so feel free to use expand=true if you like.

    This particular configuration (StandardTokenizerFactory + ShingleFilterFactory + SynonymFilterFactory) is just the one that I found worked the best for me. Feel free to try a different configuration, but something really fancy might break the code, so I don’t recommend going too far.

    For instance, you can configure the ShingleFilterFactory to output shingles (i.e. word N-grams) of any size you want, but I chose shingles of size 1-4 because my synonyms typically aren’t longer than 4 words. If you don’t have any multi-word synonyms, you can get rid of the ShingleFilterFactory entirely.

    (I know that this XML format is different from the typical one found in schema.xml, since it uses lst and str tags to configure the tokenizer and filters. Also, you must define the luceneMatchVersion a second time. I’ll try to find a way to fix these problems in a future release.)

  4. Add defType=synonym_edismax to your query URL parameters, or set it as the default in solrconfig.xml.
  5. Add the following query parameters. The first one is required:
  6. Param Type Default Summary
    synonyms boolean false Enable or disable synonym expansion entirely. Enabled if true.
    synonyms.analyzer String null Name of the analyzer defined in solrconfig.xml to use. (E.g. in the example above, it’s myCoolAnalyzer). This must be non-null, if you define more than one analyzer.
    synonyms.originalBoost float 1.0 Boost value applied to the original (non-synonym) part of the query.
    synonyms.synonymBoost float 1.0 Boost value applied to the synonym part of the query.
    synonyms.disablePhraseQueries boolean false Enable or disable synonym expansion when the user input contains a phrase query (i.e. a quoted query).

Future work

Note that the parser does not currently expand synonyms if the user input contains complex query operators (i.e. AND, OR, +, and ). This is a TODO for a future release.

I also plan on getting in contact with the Solr/Lucene folks to see if they would be interested in including my changes in an upcoming version of Solr. So hopefully patching won’t be necessary in the future.

In general, I think my approach to synonyms is more principled and less error-prone than the built-in solution. If nothing else, though, I hope I’ve demonstrated that making synonyms work in Solr isn’t as cut-and-dried as one might think.

As usual, you can fork this code on GitHub!

Building an English-to-Japanese name converter

Update: I made a Japanese Name Converter web site!

The Japanese Name Converter was the first Android app I ever wrote.  So for me, it was kind of a “hello world” app, but in retrospect it was a doozy of a “hello world.”

The motivation for the app was pretty simple: what was something I could build to run on an Android phone that 1) lots of people would be interested in and 2) required some of my unique NLP expertise?  Well, people love their own names, and if they’re geeks like me, they probably think Japanese is cool.  So is there some way, I wondered, of writing a program that could automatically transliterate any English name into Japanese characters?

The task

The problem is not trivial.  Japanese phonemics and phonotactics are both very restrictive, and as a result any loanword gets thoroughly mangled as it passes through the gauntlet of Japanese sound rules.  Some examples are below:

beer = biiru (/bi:ru/)
heart = haato (/ha:to/)
hamburger = hanbaagaa (/hanba:ga:/)
strike (i.e. in baseball) = sutoraiku (/sutoraiku/)
volleyball = bareebooru (/bare:bo:ru/)
helicopter = herikoputaa (/herikoputa:/)

English names go through the same process:

Nolan = nooran (/no:ran/)
Michael = maikeru (/maikeru/)
Stan = sutan (/sutan/)

(Note for IPA purists: the Japanese /r/ is technically an alveolar flap, and therefore would be represented phonetically as [ɾ].  The /u/ is an unrounded [ɯ].)

Whole lotta changes going on here.  To just pick out some of the highlights, notice that:

  1. “l” becomes “r” – Japanese, like most non-Indo-European languages, makes no distinction between the two.
  2. Japanese phonotactics only allow one coda – “n.”  So no syllables can end on any consonant other than “n,” and no consonant clusters are allowed except for those starting with “n.”  All English consonant clusters have to be epenthesized with vowels, usually “u” but sometimes “i.”
  3. English syllabic “r” (aka the rhotacized schwa, sometimes written [ɚ]) becomes a double vowel /a:/.  Yep, they use the British, r-less pronunciation.  Guess they didn’t concede everything to us Americans just because we occupied ’em.

All this is just what I’d have to do to convert the English names into romanized Japanese (roomaji).  I still haven’t even mentioned having to convert this all into katakana, i.e. the syllabic alphabet Japanese uses for foreign words!  Clearly I had my work cut out for me.

Initial ideas

The first solution that popped into my head was to use Transformation-Based Learning (aka the Brill tagger).  My idea was that you could treat each individual letter in the English input as the observation and the corresponding sequence in the Japanese output as the class label, and then build up rules to transform them based on the context.  It seemed reasonable enough.  Plus, I would benefit from the fact that the output labels come from the same set as the input labels (if I used English letters, anyway).  So for instance, “nolan” and “nooran” could be aligned as:


Three of the above pairs are already correct before I even do anything.  Off to a good start!

Plus, once the TBL is built, executing it would be dead simple.  All of the rules just need to be applied in order, amounting to a series of string replacements.  Even the limited phone hardware could handle it, unlike what I would be getting with a Markov model.  Sweet!  Now what?

Well, the first thing I needed was training data.  After some searching, I eventually found a calligraphy web site that listed about 4,000 English-Japanese name pairs, presumably so that people could get tattoos they’d regret later.  After a little wget action and some data massaging, I had my training data.

By the way, let’s take a moment to give a big hand to those unsung heroes of machine learning – the people who take the time to build up huge, painstaking corpora like these.  Without them, nothing in machine learning would be possible.

First Attempt

My first attempt started out well.  I began by writing a training algorithm that would generate rules (such as “convert X to Y when preceded by Z”) or (“convert A to B when followed by C”) from each of the training pairs.  Each rule was structured as follows:

Antecedent: a single character in the English string
Consequence: any substring in the Japanese string (with some limit on max substring length)
Condition(s): none and/or following letter and/or preceding letter and/or is a vowel etc.

Then I calculated the gain (in terms of total Levenshtein, or edit distance improvement across the training data) for each rule.  Finally, ala Brill, it was just a matter of taking the best rule at each iteration, applying it to all the strings, and continuing until some breaking point.  The finished model would just be the list of rules, applied in order.

Unfortunately, this ended up failing because the rules kept mangling the input data to the point where the model was unable to recover, since I was overwriting the string with each rule.  So, for instance, the first rule the model learned was “l” -> “r”.  Great!  That makes perfect sense, since Japanese has no “l.”  However, this caused problems later on, because the model now had no way of distinguishing syllable-final “l” from “r,” which makes a huge difference in the transliteration.  Ending English “er” usually becomes “aa” in Japanese (e.g. “spencer” -> “supensaa”), but ending “el” becomes “eru” (e.g. “mabel” -> “meeberu”).  Since the model had overwritten all l’s with r’s, it couldn’t tell the difference. So I scrapped that idea.

Second Attempt

My Brill-based converter was lightweight, but maybe I needed to step things up a bit?  I wondered if the right approach here would be to use something like a sequential classifier or HMM.  Ignoring the question of whether or not that could even run on a phone (which was unlikely), I tried to run an experiment to see if it was even a feasible solution.

The first problem I ran into here was that of alignment.  With the Brill-based model, I could simply generate rules where the antecedent was any character in the English input and the consequence was any substring of the Japanese input.  Here, though, you’d need the output to be aligned with the input, since the HMM (or whatever) has to emit a particular class label at each observation.  So, for instance, rather than just let the Brill algorithm discover on its own that “o” –> “oo” was a good rule for transliterating “nolan” to “nooran” (because it improved edit distance), I’d need to write the alignment algorithm myself before inputting it to the sequential learner.

I realized that what I was trying to do was similar to parallel corpus alignment (as in machine translation), except that in my case I was aligning letters rather than words.  I tried to brush up on the machine translation literature, but it mostly went over my head.  (Hey, we never covered it in my program.)  So I tried a few different approaches.

I started by thinking of it like an HMM, in which case I’m trying to predict the the output Japanese sequence (j) given the input English sequence (e), where I could model the relationship like so:

P(j|e) = \frac{P(e|j) P(j)}{P(e)} (by Bayes’ Law)

And, since we’re just trying to maximize P(j|e), we can simplify this to:

argmax(P(j|e))\hspace{3 mm}\alpha\hspace{3 mm}argmax(P(e|j) P(j))

Or, in English (because I hate looking at formulas too): The probability of a Japanese string given an English string is proportional to the probability of the English string given the Japanese string multiplied by the probability of the Japanese string.

But I’m not building a full HMM – I’m just trying to figure out the partitioning of the sequence, i.e. the P(e|j) part.  So I modeled that as:

P(e|j) = P(e_0|j_0) P(e_1|j_1) ... P(e_n|j_n)

Or, in English: The probability of the English string given the Japanese string equals the product of all the probabilities of each English character given the probability of its corresponding Japanese substring.

Makes sense so far, right?  All I’m doing is assuming that I can multiply the probabilities of the individual substrings together to get the total probability. This is pretty much the exact same thing you do with Naive Bayes, where you assume that all the words in a document are conditionally independent and just multiply their probabilities together.

And since I didn’t know j_0 through j_n (i.e. the Japanese substring partitionings, e.g n|oo|r|a|n), my task boiled down to just generating every possible partitioning, calculating the probability for each one, and then taking the max.

But how to model P(e_n|j_n), i.e. the probability of an English letter given a Japanese substring?  Co-occurrence counts seemed like the most intuitive choice here – just answering the question “how likely am I to see this English character, given the Japanese substring I’m aligning it with?”  Then I could just take the product of all of those probabilities.  So, for instance, in the case of “nolan” -> “nooran”, the ideal partitioning would be n|oo|r|a|n, and to figure that out I would calculate count(n,n)/count(n) * count(o,oo)/count(o) * count(l,r)/count(l) * count(a,a)/count(a) * count(n,n)/count(n), which should be the highest-scoring partitioning for that pair.

But since this formula had a tendency to favor longer Japanese substrings (because they are rarer), I leveled the playing field a bit by also multiplying the conditional probabilities of all the substrings of those substrings.  (Edit: only after reading this do I realize my error was in putting count(e) in the denominator, rather than count(j).  D’oh.) There!  Now I finally had my beautiful converter, right?

Well, the pairings of substrings were fine – my co-occurrence heuristic seemed to find reasonable inputs and outputs.  The final model, though, failed horribly.  I used Minorthird to build up a Maximum Entropy Markov Model (MEMM) trained on the input 4,000 name pairs (with Minorthird’s default Feature Extractor), and the model performed even worse than the Brill one!  The output just looked like random garbage, and didn’t seem to correspond to any of the letters in the input.  The main problem appeared to be that there were just too many class labels, since an English letter in the input could correspond to many Japanese letters in the output.

For instance, the most extreme case I found is the name “Alex,” which transliterates to “arekkusu.”  The letter “x” here corresponds to no less than five letters in the output – “kkusu.”  Now imagine how many class labels there must have been, if “kkusu” was one of them.  Yeah, it was ridiculous. Classification tends to get dicey when you have more than ten labels. I’d argue that even three is pushing it, since the sweet spot is really two (binary classification).

Also, it was at this point that I realized that trying to do MEMM decoding on the underpowered hardware of a phone was pretty absurd as it is.  Was I really going to bundle the entire Minorthird JAR with my app and just hope it would work without throwing an OutOfMemoryError?

Third Attempt

So for my third attempt, I went back to the drawing board with the Brill tagger.  But this time, I had an insight.  Wasn’t my whole problem before that the training algorithm was destroying the string at each step?  Why not simply add a condition to the rule that referenced the original character in the English string?  For instance, even if the first rule converts all l’s to r’s, the model could still “see” the original “l,” and thus later on down the road it could discover useful rules like ‘convert “er” to “eru” when the original string was “el”, but convert  “er” to “aa” when the original string was “er”‘.  I immediately noticed a huge difference in the performance after adding this condition to the generated rules.

That was basically the model that led me all the way to my final, finished product.  There were a few snafus – like how the training algorithm takes up an ungodly amount of memory, so I had to optimize since I was running it on my laptop with only 2GB of memory. I also only used a few rule templates and I even cut the training data from 4,000 to little over 1,000 entries, based on which names were more popular in US census data.  But ultimately, I think the final model was pretty good.  Below are my test results, using a test set of 47 first and last names that were not in the training data (and which I mostly borrowed from people I know).

holly -> horii (gold: hoorii)
anderson -> andaason
damon -> damon (gold: deemon)
clinton -> kurinton
lambert -> ranbaato
king -> kingu
maynard -> meinaado (gold: meenaado)
lawson -> rooson
bellow -> beroo
butler -> butoraa (gold: batoraa)
vorwaller -> boowaraa
parker -> paakaa
thompson -> somupson (gold: tompuson)
potter -> pottaa
hermann -> haaman
stacia -> suteishia
maevis -> maebisu (gold: meebisu)
gerald -> jerarudo
hartleben -> haatoreben
hanson -> hannson (gold: hanson)
brubeck -> buruubekku
ferrel -> fereru
poolman -> puoruman (gold: puuruman)
bart -> baato
smith -> sumisu
larson -> raason
perkowitz -> paakooitsu (gold: paakowitsu)
boyd -> boido
nancy -> nanshii
meliha -> meria (gold: meriha)
berzins -> baazinsu (gold: baazinzu)
manning -> maningu
sanders -> sandaasu (gold: sandaazu)
durup -> duruppu (gold: durupu)
thea -> sia
walker -> waokaa (gold: wookaa)
johnson -> jonson
bardock -> barudokku (gold: baadokku)
beal -> beru (gold: biiru)
lovitz -> robitsu
picard -> pikaado
melville -> merubiru
pittman -> pitman (gold: pittoman)
west -> wesuto
eaton -> iaton (gold: iiton)
pound -> pondo
eustice -> iasutisu (gold: yuusutisu)
pope -> popu (gold: poopu)

Baseline (i.e. just using the English strings without applying the model at all):
Accuracy: 0.00
Total edit distance: 145

Model score:
Accuracy: 0.5833333333333334
Total edit distance: 28

(I print out “gold” and the correct answer only for the incorrect ones.)

The accuracy’s not very impressive, but as I kept tweaking the features, what I was really aiming for was low edit distance, and 28 was the lowest I was able to achieve on the test set.  So this means that, even when it makes mistakes, the mistakes are usually very small, so the results are still reasonable.  “Meinaado,” for instance, isn’t even a mistake – it’s just two ways of writing the same long vowel (“mei” vs. “mee”).

Anyway, many of the mistakes can be corrected by just using postprocessing heuristics (e.g. final “nn” doesn’t make any sense in Japanese, and “tm” is not a valid consontant cluster).  I decided I was satisfied enough with this model to leave it as it is for now – especially given I had already spent weeks on this whole process.

This is the model that I ultimately included with the Japanese Name Converter app.  The app processes any name that is not found in the built-in dictionary of 4,000 names, spits out the resulting roomaji, applies some postprocessing heuristics to obey the phonotactics of Japanese (like in the “nn” example above), converts the roomaji to katakana, and displays the result on the screen.

Of course, because it only fires when a name is outside the set of 4,000 relatively common names, the average user may actually never see the output from my TBL model. However, I like having it in the app because I think it adds something unique.  I looked around at other “your name in Japanese” apps and websites, but none of them are capable of transliterating any old arbitrary string.  They always give an error when the name doesn’t happen to be in their database.  At least with my app, you’ll always get some transliteration, even if it’s not a perfect one.

The Japanese Name Converter is currently my third most popular Android app, after Pokédroid and Chord Reader, which I think is pretty impressive given that I never updated it.  The source code is available at Github.