Arvind Narayanan's journal [entries|archive|friends|userinfo]

How to protect your password from keyloggers [Feb. 8th, 2010|11:49 am]
[Tags|, , , , , ]

Yesterday the Interwebs got to point and laugh at a hilarious customer service e-mail from American Express on password security. Every single paragraph in it was either wrong, or worse, not even wrong — it didn't rise to the level of coherence where the words 'correct' and 'wrong' are applicable.

There is one particular sentence that I want to talk about:
We discourage the use of special characters because hacking softwares can recognize them very easily.
Presumably, this means that keyloggers can detect that you're typing a password by observing that the sequence of keypresses has high entropy. I believe this is an actual technique that's used to identify password-like strings from a disk dump (although I'm unable to find the reference right now). However, I didn't think it made sense in the keylogging context, and indeed someone who says he's looked at a lot of keylogger data confirms that detecting when a password is typed is fairly trivial, regardless of what kind of characters your password uses.

So is there any hope for those of us who need to login in a context where we suspect there might be a keylogger afoot, such as an Internet cafe or grandma's computer? Dinei Florencio and Cormac Herley of Microsoft Research discovered a simple trick to do just that. The idea is not to make it hard to detect when you're typing your password, but rather what the password is. I will let them describe it (with minor edits for brevity):
The trick lies in the fact that keyloggers employ very low level OS calls. The keylogger sees everything, but it doesn’t understand what it sees. The browser also sees everything, but it doesn’t use everything that it sees: it does not know what to do with keys that are typed anywhere other than the text entry fields, and lets them fall on the floor. The keylogger has no easy way to determine which keys are used by the browser and which fall on the floor.

Between successive keys of the password we will enter random keys. The string that the keylogger receives will contain the password, but embedded in so much random junk that discovering it is infeasible. We are ex-ploiting the difficulty from the OS layer of determining how the GUI of an an application handles events.
Got it? All you need to do is: after typing each password character, click to focus away from the password field, type a random character or two, then focus back to the password field.

The authors point out that: 1. all current keyloggers fail against this technique. 2. if everybody started doing this (which they won't), then keyloggers will find a way around it. I can think of many workarounds, but they are all difficult to implement (and have counter-workarounds). The strategy here is not to outrun the bear, but to outrun the slowest person. It simply isn't worth malware authors' time to go after the 1% (or less) of people who will use this. I have benefited from this technique myself when logging in from other people's computers (not that I don't trust them, I don't trust their OS :-)

Let me end by pointing out that this kind of research is essentially unpublishable in the scientific community (indeed, it was 'pushlished' as a poster). Ironically, it is more useful to society than the vast majority of published papers. Of course, I'm not saying that journals and conferences should start accepting 'tricks' in place of deep reseach. Rather, there should be a way of measuring and rewarding impact outside of the monomaniacal publication/citation-count system. (For those of you who are already tired of me talking about this, I'm going to be harping on it for a long time to come.)
Link29 comments | Leave a comment

On middlemen [Jan. 18th, 2010|08:28 pm]
[Tags|, , , , , , , ]

Here are a couple of comments I posted elsewhere, and I'm reposting them here because I noticed there's a common theme, which I will highlight.
Up until a few months ago I had the same reaction as most tech people on the subject of URL shorteners: they are messing with the Web and I wish they would die. Now I have a very different opinion. But first let me deal with the two common objections:

1. A URL shortener is an added point of failure. I wonder if people had a similar complaint back in the day when DNS started to become popular. When you add one more level of indirection you always add one more point of failure. I think it’s pretty likely that as the technology matures, the probability of failure will essentially vanish. We’ll see.

Linkrot in general used to be a big problem in the early days of the web, but doesn’t happen nearly as much anymore.

2. Short URLs hide information from users. If you’re building a consumer-centric website, the additional effort in expanding all the short urls is tiny, so I don’t see what the problem is here.

Now I get to the real point of this post: the one huge advantage of URL shorteners is that due to the extra layer of redirection, interesting things can happen at that layer, like publicly available click stats. Bit.ly already does this. Instant, public stats are going to be an incredible enabler of the real-time web. The significance seems to have gone more or less unnoticed so far, but I think it will become apparent in the months to come.
I've since noticed that goo.gl also plans public stats; whether or not the average web developer is aware of this trend, the big G is right on top of it.

On the topic of the Australian government forcing Google to censor search results there:
This issue has been popping up in different countries all over the world in the last few months. Most notably in China, of course, but also in India, France and Italy, where Google executives are in fact facing the possibility of jail time. See here and here.

The role of Internet middlemen in enforcing copyright was a key legal issue in the last decade. It appears that censorship will be the analogous issue for the next decade. We seem to have reached a relatively happy middle ground with copyright—middlemen have some responsibility, but a strong form of copyright protection has proved unenforceable, forcing many industries to innovate or die. We can hope that a similar thing will happen with censorship, with oppressive governments either collapsing or being forced to allow free speech.
I think I can distill several general ideas from these comments, most of which I hope will be controversial :-)
  • While the Internet has destroyed some middlemen (e.g., real estate agents), it has in fact created more middlemen.
  • One reason there are more middlemen is that these middlemen can be stacked on on top of the other, as in the bit.ly example.
  • Although Internet middlemen have haters both in the tech world and in the industries they replaced, I think middlemen are a good thing in general. They are good because they skim relatively little off the top, and because they can interface with each other in an automated way, without humans feeling the pain of dealing with multiple levels of middlemen.
  • Needless to say, they also tend to be vastly more efficient and cheaper than the pre-Internet middlemen they replace. In some cases, they might even play a role in fundamentally reshaping society and making it freer.
I certainly appreciate the dangers of middlemen, especially that they might become gatekeepers or monopolies. Indeed, I've attacked Craigslist on these grounds, and I don't like Apple's app store either. But the dangers need to be considered in balance against the benefit they provide, and opposing digital middlemen a priori because of the possibility of future abuse of power is silly.
LinkLeave a comment

Harvesting email addresses surreptitiously [Jan. 7th, 2010|08:42 pm]
[Tags|, ]

I realized there's a simple way to harvest the Gmail address of anyone who visits your web page, assuming they're logged in to Google:

Embed a world-visible iframe on your page pointing to a document on Google docs. In a separate backend process, poll the doc every few seconds (while logged in yourself) to retrive the list of people viewing it. (This list is displayed in the Google docs UI, so it has to be available; I haven't yet figured out the appropriate URL to query, which probably involves executing a bunch of Javascript.)

Have I missed anything? Is this widely known? I wonder if anyone's doing it.

Edit. I looked at a document in Firebug and the URLs are of the form
http://spreadsheets.google.com/fm/bind?hl=en&fmcmd=80&
id=tFMKXV2J2J5xVFq9dRpRrfg.12613261761483303999.1726472472965475770
&VER=6&lsq=1262925136228000&tfe=jc_78&gsessionid=eFoiXLiGEmA&RID=rpc&
SID=D50CBF4594E07F04&CI=0&AID=24&TYPE=xmlhttp&zx=w6yldwukwptv&t=1
(That link has now expired; there's a session ID in there.)

The result seems to be a JSON list that encodes all the operations that need to be performed on the front-end. I presume this behavior is part of the GWT (Google Widget Toolkit). I've verified that email addresses are sent as part of the result of that query. Now all I need to figure out is how to construct that URL given a document. A simpler alternative would be to write a browser plugin. Anyone interested in helping me demonstrate this?

There are APIs that allow you to harvest a bunch of information about a person given their email address. I think the most powerful (malicious) use of this hack would be to identify a visitor within a few seconds, and exploit the fact that social engineering attacks are much more likely to succeed if you address the person by name and/or know some details about them.
Link6 comments | Leave a comment

The socialite game [Nov. 26th, 2009|11:09 pm]
[Tags|, , , ]

Socialites are often famous for being famous. Although this description is used pejoratively, a variety of human activities center around social games where one gains power by associating with those who are already powerful. Power is defined purely in terms of the network of relationships, with no inherent measure of skill, quality or worth of an individual.

I will try to formulate this game mathematically. The hope is that the analysis will have some ability to predict and/or explain human social structures and behavior.

Consider a graph with N nodes or players. Players compete by creating or destroying edges in the graph. Mutual agreement is required for an edge to exist. Players have limited social capital, i.e., there is a bound on the degree of each node. All players have the same degree bound, which captures the fact that there is no intrinsic measure of strength of a node.

The goal of each player is to maximize their (relative) centrality in the network. This can be captured in various ways, but a natural measure is Eigenvector centrality, which is basically PageRank.

Unfortunately, as presented, the game is trivial because of the algebraic fact that the only Eigenvector with no negative values of the adjacency matrix of a regular connected graph is (1, 1, ... 1). Therefore socialites cannot exist—everyone is equally popular.

I see two main ways to make the model more realistic and make socialites possible: one is to make edges directional (and weighted.) This captures asymmetric power relationships between individuals, and enables hierarchical tree structures, among other things. The other is to remove the restriction that every player has the same degree bound. This would let us ask, "if some people have more social capital than others, can they leverage it to become vastly more influential?"

I played around with these alternatives for a few hours, but couldn't find what I wanted—a model that is halfway between the studies of game theorists such as a network creation game (which is too abstract to imply anything about human networks) and the techniques of social anthropologists, such as the well known 1977 Karate Club study by Zachary (which is not abstract enough to prove anything of general importance).

So I'm just going to leave this here on the off chance that it might inspire someone to take it forward.

[P.S. Two concepts that I think are important are stability and connectedness. A graph is k-stable if no k players can collude to improve each of their scores. By connectedness I refer to the standard graph-theoretic concept; I think it is important to study which types of of initial conditions result in connected vs. disconnected social graphs, i.e., co-operative vs. competing power structures.]
Link1 comment | Leave a comment

The death of the printed book is closer than you think [Nov. 26th, 2009|04:21 pm]
[Tags|, , , ]

I've been saying for a while that the e-books are going to take over soon. Let me elaborate on that, now that I have some data to back up my claims.

First, let's get the obvious stuff out of the way. The Kindle seems to be following roughly the same adoption curve as the iPod. Barely two years after it was first released, everyone my age has at least played with one or knows someone who has one. Amazon has been pushing it massively and adoption is only going to accelerate with the recent price cut, international availability, and the emergence of serious competition.

Bezos announced back in May that 35% of book sales are on the Kindle when a Kindle version is available. How can that be, when penetration is still small in absolute terms? It's because Kindle owners are disproportionately voracious readers.

But my real point is about digital-only books. Let's ponder the consequence of the above 35% figure. An author reveals her numbers from being on the NYTimes bestseller list. The most striking number to me is the fact that her royalties are only 6-8%. I assume that the number is roughly the same for sales of the Kindle version. (If anyone knows otherwise, please let me know.)

On the other hand, Amazon shares 35% of revenue with the author for self-published books. In one sense that's unfairly low: Apple for instance shares 70% of revenue with app publishers. Still, it is five times higher than royalties from a traditional book publisher.

Let's do some math. Consider a typical book that sells for $14.99 in print and $9.99 on the Kindle (the only thing that matters here is the ratio of 3:2. If the ratio is closer to 1, then my argument is even stronger.) Let's assume that half of all sales are via Amazon, and the rest are through physical bookstores. That means that Kindle sales are 35% × ½ = 17.5% of the total. Let's say an author were to self-publish digitally with Amazon, and thus forgo all non-Kindle sales, but maintain the same volume of Kindle sales as they would get with a publisher. Their revenue with the self-publishing route would be 0.175 N * 9.99 * 0.35 = $ 0.61 N. With the traditional route, the revenue would be (0.175 N * 9.99 + 0.825N * 14.99) * 0.07 = $ 0.99 N.

Conclusion: Kindle penetration is already three-fifths of the way to the crucial tipping point, where kicking out your publisher generates more royalties. This is no doubt a simplified model, and ignores several factors:
  • The publisher gives you an advance, which might make it attractive to authors without a financial cushion.
  • The publisher has an advertising budget to spend on your book.
  • On the other hand, without a publisher, web-savvy authors would be better able to leverage social media to get the word out about their books.
  • Amazon will probably start sharing more revenue with authors, making the self-publishing route even more attractive.
  • On the other hand, publishers will probably increase royalties for Kindle sales, due to the same pressures.
  • I've ignored the fact that you can self-publish on multiple platforms, although none are as yet competitive with the Kindle.
In spite of the shortcomings and shortcuts, I think my model provides a good ballpark estimate, and I trust the prediction that we are close to a tipping point. Just as Radiohead generated a lot of free publicity (and hence extra sales) by breaking ground with their "pay what you want" model, the first major author to self-publish will generate a lot of publicity. I predict that we're no more than a couple of years away from this happening.

Far more interesting than what digital books will do to the head of publishing is what they will do to the tail. The idea of the struggling author trying to land a deal is a cultural staple, but one that exists purely because publishers have had a monopoly on distribution channels. With ebooks, someone who thinks they are a great writer doesn't have to wait and beg to be discovered—they can find out for themselves by self-publishing, promoting their work on Facebook and Twitter, and seeing what kind of response they get.

People will continue to read printed books for a long time, just as some people still watch movies on VHS. But the printed book will be "dead" in a few short years in the sense that the bulk of the adoption curve, the pragmatic majority, will have moved on. For the first time in history, the discovery of writing talent will depend more on skill and persistence than on luck. And the notion of the book itself will morph to occupy an entire spectrum—traditional linear, textual narrative on the one end and videogame-like interactive, graphical narrative on the other.

I can't wait.

Update. Here is another author who reveals his numbers, including Kindle royalties and self-publishing revenues. His calculations are similar to mine, as are his conclusions:
I don't think I'll ever take a print contract for less than $30,000 per book, because I'm confident I could make more money [with ebooks] over the course of six years than I could with a publisher over six years.

Isn't that bizarre?

For the bestselling author, this is all still very trivial. These numbers are chump change compared to the advances they get.

But for the midlist author, I'm beginning to think it's possible to make a living without print contracts.

I've struggled mightily to break into print. And I've made a nice chunk of change on my print novels.

Now I'm hoping those novels go out of print, so I can get my rights back.

I never would have guessed my mindset would change so dramatically in so short a time.
Link18 comments | Leave a comment

Putting your money where your mouth is [Nov. 15th, 2009|01:22 pm]
[Tags|, , ]

I often marvel at the human capacity for self-deception, one aspect of which is the fact that people often haven't a clue what they really believe. There's no better way to demonstrate this than to get them to bet on something they claim to be a sure thing.

Dick Lipton tells the depressing story of trying to get another prominent computer scientist (Ken Steiglitz) to bet on P != NP. Steiglitz started out claiming the odds were a million to one, but when forced to bet, he wouldn't take odds longer than two to one.

Two to one. So a respected scientist was 500,000 times less certain of his opinion than he claimed to be — on a question that was within his sphere of competence. What does that tell us about the certitude that we each feel on issues relating to the economy or the environment? We're simply lying to ourselves.

A few years ago, when intrade was popular, I went around asking people why they weren't cashing in by betting on their 'sure' beliefs. If they were right, they could double their money in mere months. It was amusing to listen to people's rationalizations and excuses.

I think it would be fun exercise to bet on one's views, regardless of whether one feels certain of them or not.
Link10 comments | Leave a comment

Detecting a tech bubble [Nov. 9th, 2009|12:28 pm]
[Tags|, ]

The graph below compares the NASDAQ with the Dow over the last decade and a half. The tech bubble sticks out like a sore thumb, which makes sense since the NASDAQ is technology-dominated. On the other hand, the current recession affected both indices in exactly the same way, since it wasn't tech-related.


Back in 2007, the Richter Scales made a famous video which made the point that we were in another tech bubble that was about to pop. It was scary, and at the time I didn't know whether to believe it or not. But now I wonder if the NASDAQ — DJIA spread might be a simple and reasonably reliable indicator.

I'm not very knowledgeable about finance and the economy; any thoughts?
Link5 comments | Leave a comment

Funny bug [Nov. 9th, 2009|09:26 am]
[Tags|]

It's winter, and naturally, my bathroom floor is cold in the mornings. When I step into the shower, and warm water hits me, it apparently confuses the hell out of the temperature sensors in my feet, and I feel like I'm standing on ice and fire at the same time.
LinkLeave a comment

Bookmarklet to bypass NYT registration [Nov. 8th, 2009|09:12 pm]
[Tags|, , , , ]

Bugmenot stopped working for me yesterday for no discernible reason. The New York Times is the only website I use it for, so I whipped up a little piece of Javascript to bypass the registration on NYT.
 javascript: ( function()
 {
   window.location.href =
     "http://www.google.com/url?sa=t&url="
     + window.location.href.split('URI=')[1]
 })()
The way to use it is to create a bookmarklet and paste the code above as the URL. When you get to a NYT "registration required" page, click the bookmarklet and it will let you in.

Even though the NYT seems by far the most web-savvy of the major newspapers, they fundamentally fail to understand the web and won't survive for long in their current form. The same goes for the rest of the newspaper industry. Numbers don't lie. In the mean time, enjoy the bookmarklet.
Link2 comments | Leave a comment

We are still a society of nature-worshippers [Oct. 9th, 2009|08:43 pm]
[Tags|]

Evolution has come up with many, many clever designs over the eons, and engineers have a lot to learn from studying nature. On the other hand, it is equally true that on average the designs in nature are riddled with inefficiencies, bugs and tortuous mechanisms at every level of complexity. Most people don't realize this, and are in fact repelled by the idea.

Since I've been learning about the human genome as part of my current research project, I often find myself explaining to other computer scientists how some aspect of genetics works. At some point they interrrupt me to ask, "but wouldn't it be way more efficient to instead do... ". When I tell them it certainly would, but that evolution has never managed to figure it out, they are usually surprised. But it's true — evolution has explored only a tiny, tiny part of the design space. I find it ironic that if only scientists did a better job of pointing out all the ways in which nature has failed to find good designs, the man on the street would have an easier time believing that there is no intelligent designer.

In particular, co-operative strategies never occur in nature unless it is a game-theoretic "stable equilibrium" — that is, even from a purely selfish perspective, it must be advantageous to follow the strategy (this is a slight oversimplification.) Worse, the strategy needs to be beneficial not to the individual, but rather to the genes. This is an highly unintuitive idea to wrap one's head around, and it is very easy to fall back into fallacious ways of reasoning even after learning it. (The Selfish Gene is still the best and most enjoyable text on this, 33 years after publication.)

I could go on talking in the abstract, but I'm not going to change anyone's mind. Let me instead leave you with a fun excerpt to chew on from Lions: A Photo Essay that provides insight into the reasons why we idolize nature:
Anybody who has seen a documentary "knows" that lions hunt cooperatively to bring down prey. Unfortunately, nobody seems to have told the lions this. Indeed, for many years field biologists who study lions have realized that cooperative hunting is an illusion. ... So, how come the Discovery Channel says they are cooperative? Partly it is because these figures are buried in Appendix B of Shaller's book, or in dense academic papers. Mostly it is because the story of cooperative strategy in hunting is so endearing to people. Especially to film editors, which means we are destined to seeing cooperation in every nature documentary. The cases where the hunting fails due to lack of cooperation end up on the cutting room floor.

Watching lions hunt, the trends are quite obvious. The primary reason that groups of lions are no more effective than two by themselves is that typically only two lions do the actual hunting. They all make a show of hunting, but in the cases I watched, in several different prides, there were always a couple females that were the most aggressive and took the lead. The others hang back for the hard part then rush up at the end after the worst danger is over. Their primary goal is to be at the kill early so they can eat, not to actually help. Field studies have confirmed that lions do not seem to keep track of this and punish slackers.

Lions can seem quite inept at hunting, because they have no way to communicate complicated information. The Discovery Channel case happens when one lion flushes prey in past another for a perfect catch. More often, what happens is than one lion blows it and scares the game too early, or flushes it in the opposite direction. After watching hunt after hunt fail, you soon decide that lions are not very coordinated. Indeed their only saving grace is that the buffalo can't communicate very well either.
And this is just delicious:
Lions don't wait to kill the animal before starting the process of eating it—as soon as the buffalo stops thrashing, lions start to eat. This is much harder than it sounds however, because the hide is very thick. The prime spot to start is always claimed by the dominant female, or if the male is there, he takes the prime spot. The prime spot is not what you might think—it is the rectum. Believe it or not, the king of beasts starts his dinner by carefully licking the rectum clean. Since the buffalo defecates while dying this is a bit messy. The lion then works very hard to gnaw through the skin and get an incision open.
Sorry, couldn't help it :-P
Link29 comments | Leave a comment

Haskell [Sep. 16th, 2009|03:07 am]
[Tags|, , , ]

I started learning Haskell today. After a couple of hours of learning the syntax, I decided to dive in and write a function to shuffle a list, because linear-time permutation is non-trivial in pure functional languages. (The standard idiom translates to a quadratic algorithm because it is fundamentally destructive.)

I think I can get linear-time by assuming that the associative array operations in Data.Map are O(1), but in reality they have a logarithmic cost, since they are implemented as binary trees unlike Python's dictionaries which use hash tables. Besides, this was literally day 1 so I wasn't comfortable jumping into Data.Map.

So I ended up with something not particularly satisfactory, but I did learn some Haskell in the process. Any comments/criticism greatly appreciated.

Edit. Does anyone know if linear-time permutation is even possible in a functional setting? I thought I found a paper that did that but it turned out to be a simpler problem.
Link3 comments | Leave a comment

Bigger delicious.com dataset -- 1.25 million entries [Sep. 14th, 2009|03:55 pm]
[Tags|, , , ]

As promised, here's a much bigger dataset of delicious.com bookmarks with around 1.25 million entries, weighing about 170 megs gzipped. I'm no longer collecting data, so this will probably be the final release. It includes every bookmark in a more-or-less contiguous period spanning the last 10 days or so.
LinkLeave a comment

Folksonomy dataset for NLP (delicious.com bookmarks) [Sep. 7th, 2009|04:03 am]
[Tags|, , , ]

Several of my current projects have to do with Natural Language Processing. Over the weekend, I built a topic categorization engine; one of the datasets that I used was bookmarks from delicious.com. I was mainly interested in the tags and the relationships between them.

Delicious doesn't make this data available for bulk download, but they do have an RSS feed of site-wide bookmarking activity. The average rate is slightly over 1 new (public) bookmark per second. I've been slurping up the feed for the last couple of days, so I have a dataset of around 200k bookmarks. You can grab it here. The format is JSON, one entry per line — trivial to parse.

I plan to leave my bot running until I have at least a million entries, at which point I will do another release. If anyone wants the source, I'd be happy to share, although it shouldn't take more than half an hour to write it yourself.

To learn more about the nifty data-mining possibilities of social-bookmarking data, see folksonomy. If you find the data useful, drop me a note and let me know. Enjoy.
Link4 comments | Leave a comment

How to sample in constant time [Jul. 30th, 2009|01:24 am]
[Tags|, ]

I can't remember the last time I felt like a bigger idiot than I do right now.

I spent like forever writing up my ugly solution to the constant time sampling puzzle. With exercises for the reader, no less. Dear god, the pomposity. Just as I was about to post it, I found this extremely elegant solution that can be explained in 30 seconds (slides 41-48).

Of course, I never thought I'd discovered something new; that would have been downright delusional. On the contrary, I thought this must be folklore knowledge, and therefore perhaps never actually published. I did some Googling before I started writing and even asked a couple of statisticians, but got no answer. I guess I just didn't look hard enough.

On another note, this is also a sad reminder of how much my puzzle solving skills have degenerated since my IMO days.

At any rate, I encourage you to have a hearty laugh at my expense.

Edit. Fixed broken link.
Link14 comments | Leave a comment

Programming puzzle [Jul. 24th, 2009|10:40 am]
[Tags|, , ]

If you're a programmer, I suggest giving this puzzle a shot because you're likely to encounter this problem a few times in your programming career :-)

You're given N objects and a probability distribution function over these objects specified as a an array [x_1, x_2 ... x_N] of arbitrary nonnegative real numbers. Write a function that generates independent samples from this distribution. An O(N) pre-processing step is allowed, but after that, each new sample should be generated in expected constant time.

Any takers? Too easy?

Edit. Assume that you can generate a uniform real number in the interval [0, 1] at unit cost.
Link9 comments | Leave a comment

Theory blog aggregator: stats and directions [Jul. 20th, 2009|12:08 am]
[Tags|]

Traffic to the Theory of Computing Blog Aggregator has more than doubled over the past year, and is now at 16,000 visits/month.


Visits / month


Incidentally, the readership is tech-savvy, as you might expect — more than half the visits come from Firefox, Safari is the second most popular, MSIE barely breaks 10% and Chrome is almost tied with MSIE. Opera traffic is a rounding error, and bots aren't shown on the chart.


Visits by browser: Jun 2009


Given that Theory of Computing is a fairly niche community, the amount of traffic is surprising for a site that was slapped together with a weekend's worth of effort. Combined with the fact that the RSS reader is dying, it seems clear to me that topic aggregators have a great future.

So I'm wondering if it's time to update the aggregator with some new features. The problem with this is that I already worked on a snazzy Web 2.0 UI (hit '?' for keyboard shortcuts), which turned out to be much worse for usability than the current avatar.

Without going that route, here are some of the things I'm thinking of doing:
  • Write special-purpose code to import comments from Wordpress blogs. Currently the aggregator imports Blogger comments but not Wordpress, because the latter works in a different way. Blogger and wordpress cover the vast majority of the blogs, so that's all I really care about.
  • Polish up the recent comments feature (see the Web 2.0 UI link above for a demo) and implement it on the main site.
  • Import TOC-related tweets. I will use a special hash-tag, (#cstheory?) to import. This will be combined with either a blacklist or a whitelist of twitter users, in case anyone tries to abuse it.
  • Set up a bunch of other aggregators.
Thoughts?
Link5 comments | Leave a comment

Amazon is not the enemy [Jul. 17th, 2009|05:40 pm]
[Tags|, , , ]

Everyone who's flipping out about the Orwellian Kindle fiasco needs to relax.

Yes, Amazon did something stupid. Yes, DRM is bad for consumers and bad for society. But Amazon is not the enemy here, the publishers are. Right now, the balance of power is still with the publishing houses, so Amazon needs to play with them. When it comes to DRM, people at Amazon "get it." They're techies, after all. If they do evil things, it's because their hands are tied. It is the publishers who are pushing for DRM. Market forces are going to push them out of existence, and they will do all they can to prolong their misery.

Dramatic as the current incident may be, the same story has already played out with Apple and the iTunes store. Apple bent over backwards for the labels in order to sign them up to sell their wares online, but once they came on board, the power shifted to the technology companies, and Jobs turned around and attacked DRM (the "Thoughts on Music" letter.) By now, just two years later, everyone agrees that music DRM is on its way out and the future of recorded music is Free.

So please do express your outrage, but make sure you have the right target :-) Do complain to Amazon about how much DRM sucks, go write reviews of the product on their site and others, but don't "boycott" the Kindle. That's just plain counterproductive — books aren't magically going to become DRM-free without first becoming digital, and there's no way that's going to happen except under whatever terms the publishers choose to impose.

There's still quite a ways to go before all the major publishers succumb to the pressure of growing Kindle sales and start offering their content digitally. For that to happen, the Kindle userbase needs to grow a lot. Paradoxically, now, more than any other time, the Kindle needs your support.

Oh, and while I'm at it, can we stop making silly comparisons between the Kindle and Sony's or whoever's two-bit book reader that no one's ever heard of? Without the EVDO network and Amazon's catalog behind it, it's not even the same category of device as fas as the average person is concerned. Thanks.
Link8 comments | Leave a comment

Dreams and the self [Jun. 22nd, 2009|09:58 am]
[Tags|, , ]

In last night's dream, a T-Rex was trying to eat me. It went on for the longest time, it was horrible. I wish the T-Rex had cornered me at some point and said "game over, motherfucker!" so that I could have gone on to the next dream or whatever. But that's the funny thing, apparently you can never die on your own dream. I certainly never have.

I frequently fly in my dreams (this has been by far the most common recurrent motif); I hear that a significant proportion of the population does this as well. Do you? I wonder what it means for humanity as a whole.

Another thing that fascinates me about dreams is that since your conscious processing is subverted, you can pretty much simulate multiple entities in a way that you can't when you're awake. For example, this dude I was once talking to in a dream sat with his hands firmly planted in his pockets, and I was wondering why, until he pulled out a gun on me much later. Of course, my brain knew all along why he had his hands in his pockets -- both my character and the adversary are being simulated inside my head -- but it chose to keep that fact from me, just for shits and giggles. Or sometimes my character would need to sum a sequence of numbers, and find to his surprise that the sum is exactly (say) 100. While I can of course sum numbers in my head, quickly coming up with a sequence of random-looking numbers that sum to 100 seems beyond my wakeful arithmetical ability. The only conclusion is that my brain was working on the sum for a prolonged period of time -- and choose not to tell me.

The amazing, amazing thing is that even though almost everyone goes through these experiences when asleep, almost no one realizes that their waking consciousness is also a similarly fragile illusion. Even seeing the breakdown of coherent consciousness in other people -- such as patients with a severed corpus callosum, who develop two distinct personalities -- doesn't seem to help. Nor do studies showing that our consciousness is merely "informed" of the decisions we make, said decisions actually having been made in the subsonscious several seconds before we are even aware of making them. People have a remarkable ability to ignore any evidence that contradicts their model of the self.

If it isn't enough that most of the problems in the world are caused by our outsized egos and our little worlds that revolve around ourselves, chew on the fact that the basis for ego rests fundamentally in a fallacy :-)
Link5 comments | Leave a comment

Bioinformatics needs to be democratized [Jun. 15th, 2009|12:39 pm]
[Tags|, , , ]

I've started working with genome data, so I've been giving myself a crash course in genetics, especially the human genome, over the past week or two. I'm almost up to speed — I can now understand paragraphs like this without needing to look anything up:
We identified a 47 Kb interval containing an Alu insertion polymorphism (DXS225) and four microsatellites in complete linkage disequilibrium in a low recombination rate region of the long arm of the human X chromosome. This haplotype block was studied in 667 males from the HGDP-CEPH Human Genome Diversity Panel.
Just two weeks ago I would have been completely lost. That isn't your basic Mendelian genetics either, it's a research paper from 2007.

Learning a new set of concepts over a short period of time by immersing yourself in it is an intense experience, and one that I thoroughly enjoy. That's why privacy research has been so rewarding for me — it has given me a chance to read tons of papers in law, economics, sociology and now genetics, not to mention many, many areas of computer science.

My main aim was to understand the math behind meiosis. While it isn't very hard, it is still an area of active research, and our knowledge of it is incomplete. Consequently, there doesn't seem to be a way to get to it without cutting through layers of biology. Even the really basic stuff, like chromosomal crossover, is generally explained in a tedious way using observations of traits in plants and fruitflies. This is because of an accident of scientific history — until DNA sequencing became a reality, the only way to learn what happens during meiosis was to observe the results when animals mate and make inferences about the genome based on that. It's like the story of the blind men and the elephant.

More generally, bioinformatics seems to be populated by people whose background is primarily in genetics rather than in computer science. When computer scientists do study the genome, they end up spending their time on something inane like proving the NP-hardness of some computation on genomic sequences, apparently oblivious to the fact that human DNA does not consist of worst-case strings constructed by malicious adversaries! Even on basic information-theoretic questions such as the amount of entropy in the human genome, the best I could find is a marginally related, speculative blog post.

Someone needs to write "The Facts of Life, for Computer Scientists." I'm confident that any competent python hacker with a solid knowledge of algorithms and statistics could read that document, learn the basics in a few hours, download some data from HapMap, install a library of useful tools with a single apt-get command, and start producing useful code and generating interesting hypotheses, all in the course of an evening!

Bioinformatics needs to be democratized. There is a gigantic amount of data available, but the people who are producing the data aren't necessarily the ones best equipped to play with it. On the other hand, there is a huge community of hackers who would like to do just that, but don't realize how easy it is. If you could get these two groups to talk to each other, amazing things can happen.
Link16 comments | Leave a comment

Why are there so many theory bloggers? [Jun. 9th, 2009|09:35 am]
[Tags|, ]

Theory seems to have by far the most bloggers of any subfield of Computer Science; I'm sure I'm not the only one wondering if this is more than coincidence. Here are a few possible reasons:

Theory is cohesive. Most pairs of theorists find each other's work somewhat interesting. At the same time, the field is unlike most of Computer Science, with its emphasis on proof and disregard of experiment. I doubt that this level of cohesion can be found elsewhere in CS: "systems" is too vague and broad, while most other areas — AI, data mining, information retrieval, semantic web, databases, logic, formal methods, programming languages, compilers, architecture and graphics — fall on a massive spectrum.

Missing from that list are crypto, security and privacy. Cryptographers, in my experience, are generally horrified by the prospect of saying anything publicly that isn't heavily peer-reviewed, so that's out. Security and privacy are highly interdisciplinary, so that's not a cohesive subfield either.

In a small way, my blog aggregator may have contributed to the cohesiveness of theory bloggers by fostering a sense of community.

Theory doesn't get press. Except for the occasional journalist making an amusingly feeble attempt to explain P =? NP to the lay public in the context of the Clay Math Institute prize, theory stays out of the press because it doesn't generate pageviews.

In most other fields, important papers have a non-zero probability of getting written about (in the case of graphics papers announcing new techniques, it is virtually guaranteed since they are accompanied by jaw-dropping animations.) Consequently, theorists need a way of spreading the word about papers that are important/interesting. Word of mouth and best paper awards only go so far in the 21st century.

Theory is hard. Don't get me wrong — I'm sure research in other fields is just as hard to perform, but in my opinion, theory papers are particularly hard to read, especially for newcomers, simply because of the high degree of abstraction. This gives theory authors a strong incentive to explain and motivate their papers in more readily understood terms, and blogging is a great way to do that.

All that said, a big part of it is probably pure chance. Specifically, Lance Fortnow's pioneering blog may have convinced many theorists, by setting an example, to shed the belief that blogging is a frivolous, vulgar activity indulged in solely by the unwashed masses, far too undignified for solemn researchers such as themselves :-)
Link7 comments | Leave a comment

navigation
[ viewing | most recent entries ]
[ go | earlier ]