Log in

No account? Create an account
Search on romanized Indic text is broken. This is a billion dollar opportunity - Arvind Narayanan's journal [entries|archive|friends|userinfo]

Search on romanized Indic text is broken. This is a billion dollar opportunity [Dec. 11th, 2008|10:31 am]
Arvind Narayanan
[Tags|, , , , ]

I was trying to google the lyrics to totakashtakam (properly spelled tOTakAshTakam, but let's ignore that). I didn't know the name of the song, but I knew some words from it. I was lucky—it took only a dozen or so different spelling combinations to find what I wanted.

Sanskrit is an extreme example, but searching is a huge problem with all Indian languages. The reason is that Indic words are romanized according to an idiosyncratic set of rules (hence the odd capitalization above.) Naturally, people make spelling mistakes, and do so in fairly predictable ways. Predictable if you're Indian, that is. Currently, search engines don't take this into account, and therefore searching for Indic words is a disaster.

That's not all—Indic words are inflected much more than in English, so recognizing inflections is even more important to provide relevant results. There's also sandhi, which is extremely common in Sanskrit-derived (i.e, most Indian) languages.

It's not hard to incorporate this knowledge into search. The problem is that nobody's even trying, because they haven't realized there's money in it. In Russia, a search engine called Yandex is apparently eating Google's lunch because Google can't do Russian inflections. A billion Indian people are rapidly getting online; it's about time someone made the world better and enriched themselves at the same time.