Log in

No account? Create an account
Nearest Neighbors using Beam Search - Arvind Narayanan's journal [entries|archive|friends|userinfo]

Nearest Neighbors using Beam Search [Sep. 3rd, 2008|02:55 am]
Arvind Narayanan
[Tags|, , , ]

Every website with a catalog of some sort (youtube, amazon, netflix, google news, and about a zillion others) needs an algorithm to decide which items are similar to each other. When you're watching a video on youtube, for instance, the system needs to decide which videos to show on the sidebar. Mathematically, this is the problem of finding the "k nearest neighbors" for each item in your catalog.

This is considered a hard problem. The obvious way of doing it (comparing every item to every other item) has quadratic complexity, and just won't work when you have 10 million items no matter how much hardware you throw at it. There's a paper on how Amazon does it, and one on how Google News does it. They both use very complex algorithms. The problem is difficult enough that there's a whole company dedicated to solving it.

Although I've read the papers mentioned above, I approached the problem with a blank slate, and it paid off: I hit upon the idea of viewing the problem of finding k-nearest neighbors as a state space search problem. The novelty here is in picking an abstraction that is supposed to be used when the state space is exponential and applying it to a situation when it is only linear. Of the dozens of state space search algorithms available, I went with my intuition that "beam search", a slightly cleverer version of greedy hill climbing, would work. And it did (I've tested it on real data.)

Beam search, as applied to this problem, maintains a "beam" of k possible candidates at any step, starting from k randomly selected candidates. During the next step, it examines the k neighbors of each of these k candidates and picks the best k out of the k^2 choices. It repeats this a logarithmic number of times, and returns the beam at the end of the process.

Here's the bigger picture: I start by initializing the nearest-neighbor lists of every item to k randomly selected items. Then, I pick an arbitrary item, run beam search, and update its nearest-neighbor list. I iterate this process indefinitely. In a log-linear number of iterations, the nearest-neighbor lists converge to the optima. In my tests, I find the global optimum for 80-90% of the items (which is more than enough for my application), and close enough neighbors for the remaining. I can trade off the success ratio with the beam width and search depth.

There are several neat things about this solution. First, it is dead simple conceptually and the basic version is about 20 lines of code. Second, it is ridiculously fast (logarithmic time per item), and consumes no extra memory beyond maintaining the nearest-neighbor lists. Third, it is an online algorithm: when a new item comes in, I just run beam search on that item and then go right back to running beam search on randomly picked items. I plan to run it as a permanent, low priority process on the server. Fourth, the online-ness maps neatly to what is known as "random-restart" in hill climbing. Fifth, and get this, the accuracy improves as more items come in. This is because more "points on the hill" get populated, and therefore the algorithm is less likely to get stuck in local maxima.

I'm pretty excited about this. If you've managed to read this far, two questions: is the idea novel? is it useful/interesting enough to merit a more formal writeup? The relationship between beam width and success rate is one interesting technical question that I haven't explored yet, and would be worth looking into.

[User Picture]From: patrickwonders
2008-09-03 10:37 am (UTC)
I don't really have the breadth to say how objectively novel this is.
It was totally new to me though. And, I like it a lot.
(Reply) (Thread)
From: (Anonymous)
2008-09-03 05:26 pm (UTC)
Ditto. I have only read a small number of papers in this area, but I have never seen this idea before.

I think the idea is very elegant!
(Reply) (Parent) (Thread)
[User Picture]From: mmk
2008-09-03 06:45 pm (UTC)

Lets talk more

A couple of issues:
- is initial randomization optimal

- what can you do to ensure coverage (or some approximation thereof)

- how are you accounting for hubs or authorities differently?
(think LoTR/Star Wars/HP for movies as a hub for example)
asked differently, does a uniform k make sense for all nodes?

Interesting though. I think that this might be worth pursuing on multiple data sets for broader understanding.
(Reply) (Thread)
[User Picture]From: arvindn
2008-09-03 08:19 pm (UTC)

Re: Lets talk more

initial randomization -- i simplified it a bit. the way i actually did it was to pick 100 random items and the best k out of those.

but the real answer is that the "initial" condition doesn't really exist: from now on, as i get new items, i'll update the lists in an online fashion.

if i were to start from scratch again, i would do it by starting with only k+1 items, each item being a neighbor of each other items, and then add items one by one.

coverage -- it improves with beam width, mainly. i'm not sure you can have any provable guarantees on coverage.

hubs -- nothing special now, but the plan is to compute the popularity of each item, and bias the similarity scores in favor of more popular items. that way you don't end up recommending a bunch of obscure crap.

if you're talking about the recommendations *for* hubs, not sure why i should do anything differently.

multiple data sets -- definitely. the main thing is that my metric is content based, and i don't know if it will work if the metric is coming from user data, i.e, collaborative filtering. i can't see a reason why it wouldn't though. hmm, maybe i should try it on the netflix data.

thanks for the comments.
(Reply) (Parent) (Thread)
[User Picture]From: iliada
2008-09-05 09:49 pm (UTC)
You may want to check out Kleinberg's paper: http://www.cs.cornell.edu/home/kleinber/stoc97-nn.pdf

In particular, it says that it is helpful to store pointers to objects that are far away, in addition to the closest neighbors.

A more recent paper with a more combinatorial slant is this one: http://yury.name/papers/goyal2008disorder.pdf
(Reply) (Thread)
[User Picture]From: arvindn
2008-09-06 06:35 am (UTC)
thanks. the kleinberg paper (and many others like it) looks at the euclidean version. nevertheless, the idea of looking at objects that are far away is a good one. reminds me of kleinberg's social networks paper with (i believe) tomkins where they show that a small fraction edges with far-away nodes reduces the graph diameter. i did think of this but i haven't figured out a good way to incorporate this into my algorithm yet.

the combinatorial abstraction is really the only way to go when you're looking at the complexity of real world data. the notion of disorder constant is intriguing. i need to test it on my data. i'll read that paper in more detail. thanks!
(Reply) (Parent) (Thread)