|Nearest Neighbors using Beam Search
||[Sep. 3rd, 2008|02:55 am]
Every website with a catalog of some sort (youtube, amazon, netflix, google news, and about a zillion others) needs an algorithm to decide which items are similar to each other. When you're watching a video on youtube, for instance, the system needs to decide which videos to show on the sidebar. Mathematically, this is the problem of finding the "k nearest neighbors" for each item in your catalog.
This is considered a hard problem. The obvious way of doing it (comparing every item to every other item) has quadratic complexity, and just won't work when you have 10 million items no matter how much hardware you throw at it. There's a paper on how Amazon does it, and one on how Google News does it. They both use very complex algorithms. The problem is difficult enough that there's a whole company dedicated to solving it.
Although I've read the papers mentioned above, I approached the problem with a blank slate, and it paid off: I hit upon the idea of viewing the problem of finding k-nearest neighbors as a state space search problem. The novelty here is in picking an abstraction that is supposed to be used when the state space is exponential and applying it to a situation when it is only linear. Of the dozens of state space search algorithms available, I went with my intuition that "beam search", a slightly cleverer version of greedy hill climbing, would work. And it did (I've tested it on real data.)
Beam search, as applied to this problem, maintains a "beam" of k possible candidates at any step, starting from k randomly selected candidates. During the next step, it examines the k neighbors of each of these k candidates and picks the best k out of the k^2 choices. It repeats this a logarithmic number of times, and returns the beam at the end of the process.
Here's the bigger picture: I start by initializing the nearest-neighbor lists of every item to k randomly selected items. Then, I pick an arbitrary item, run beam search, and update its nearest-neighbor list. I iterate this process indefinitely. In a log-linear number of iterations, the nearest-neighbor lists converge to the optima. In my tests, I find the global optimum for 80-90% of the items (which is more than enough for my application), and close enough neighbors for the remaining. I can trade off the success ratio with the beam width and search depth.
There are several neat things about this solution. First, it is dead simple conceptually and the basic version is about 20 lines of code. Second, it is ridiculously fast (logarithmic time per item), and consumes no extra memory beyond maintaining the nearest-neighbor lists. Third, it is an online algorithm: when a new item comes in, I just run beam search on that item and then go right back to running beam search on randomly picked items. I plan to run it as a permanent, low priority process on the server. Fourth, the online-ness maps neatly to what is known as "random-restart" in hill climbing. Fifth, and get this, the accuracy improves as more items come in. This is because more "points on the hill" get populated, and therefore the algorithm is less likely to get stuck in local maxima.
I'm pretty excited about this. If you've managed to read this far, two questions: is the idea novel? is it useful/interesting enough to merit a more formal writeup? The relationship between beam width and success rate is one interesting technical question that I haven't explored yet, and would be worth looking into.