I've been working on a personal project that requires me to be able to accept an arbitrarily inflected Russian word and convert it to a form that can be found in a dictionary.
кота -> кот
At first I was trying to write Python code that would do this directly based on grammar rules. That was going OK, but was hard. I was at the point where I could generate several possibilities and then look them all up in the dictionary to see which ones were real and which ones were silly.
кото -> silly кот -> real
Gratuitous Soviet Tank Photo
Recently, I had the idea that maybe I could simplify things, and speed them up, by reversing my tactic. For a word list, I'm using the "frequency dictionary of Russian words." My code generates a table that maps all possible inflections for each word in the list to a form I can look up in a translation dictionary. After just a couple of hours of hacking, I was able to find words to lookup in a dictionary for about 90% of the words in the sample lenta.ru article I was using to test. I had probably spent 10 times that much effort on the previous version of my code.
The wise man, Paul Quist, once told me, while he was helping me bleed the clutch on a 1975 Peugot station wagon: "if it's not working one way, try it the other way around."
comments powered by Disqus