Thoughts on privacy and anonymity [Dec. 13th, 2007|10:26 am]
Today there is an article about the Netflix work on the front page of Wired. There's been a blitz of publicity lately. I thought this would be a good time to address one of the most common issues that's raised on conjunction with our work, namely "what's the remedy?"

The scary story.
Vitaly and I have given a couple of talks on this topic, putting our Netflix paper in the context of a much larger body of anonymity-breaching papers that have come out in the last half-decade. The common thread is that in most types of published databases describing people, as long as there is enough entropy per-record, anonymity is toast.

This encompasses most interesting databases about people. The Netflix dataset, for instance, is half a gig compressed, and there are fewer than half a million people, which works out to almost 10,000 bits per user. You can do a more rigorous analysis, but the conclusion is the same.

At present we have no tools to remedy this situation. You can try to perturb the data all you want, it's not going to help. This isn't just if you're making the database publicly available -- it's perhaps even more pertinent to scenarios like securing your data from the prying eyes of your own employees, or sharing data with business partners.

What can be done?
1. Stop looking for legislative remedies. Here's an example of a stupid law: "in any released sample of your medical records, for each patient there should be at least 10 other patients with the same name." (what if you also know the patient's date of birth?) Figuring out exactly when privacy is breached is a hard math problem even for computer scientists. (example pdf) Attempting to come up with a legal definition is like attempting to legislate the value of pi.

2. It's not about covering your ass. This is the unfortunate attitude almost every company takes: privacy is an annoying legal requirement and you should do just enough so that it doesn't become a PR problem. Ironically, that's a good way to ensure that it will become a PR problem.

3. Be upfront about what you can and cannot protect. Guess what -- if Netflix told users all their ratings would be public, more than 90% of users wouldn't have changed their behavior in any way. From the perspective of getting data for collaborative filtering, this isn't a significant drawback. On the other hand, the small percentage of users who do care would have chosen to stay away from submitting ratings on sensitive titles online, and everybody's happy.

4. Start putting some money into privacy research/development. There is an active research area on private computation on databases -- where you don't air your linen for the whole world to see, but instead run smart computational protocols by which you keep your data at all times, but someone can do research/statistics/whatatever on it by making database queries, while you make sure they don't learn anything sensitive. (example)

More research needs to be done before it can be applied on the scale you need for modern datasets. And turning the research into actual tools is an uncharted area that needs money to be accomplished. If corporations started taking actual privacy seriously, they'd realize that this is where they need to be spending money, not on lawyers and PR people.

