Arvind Narayanan's journal [entries|archive|friends|userinfo]

StumbleUpon Considered Harmful [Jun. 24th, 2011|10:03 pm]
[Tags|, ]

About a week ago I noticed a large, anomalous traffic spike on one of my articles over at These visitors seemed to bounce immediately, not viewing any other pages, and were much less likely to engage with the page in any way, such as commenting. Numerically, this traffic source contributed about 75% of the total for that article, but only 2 of the 64 tweets came during the time window of the spike, which means that these visitors were about 100 times less likely to engage with the article as others. An admittedly crude measurement, but even if it's within an order of magnitude, it means that this is a "poor quality traffic source" in SEO parlance—an extraordinarily poor one.

Then I glanced at the referer chart and noticed that the source was At once it all made sense.

Let me explain. The average StumbleUpon user turns to the service when they're bored, so bored they can't even go to the trouble of endlessly clicking on links on web pages like most of us do. Instead they click repeatedly on the "Stumble" button which takes them to random web pages supposedly somewhat tailored to their interests. They're not in it to read the articles (any more than someone who's flipping through the pages of Playboy is in it to read the articles). Instead they're in it for the tiny dopamine spike that they get each time they land on a new page.[1]

Nine times out of ten, such a user will bounce immediately after looking at the title of your article, deciding that it's not something they're interested in. If they do start reading, a further nine times out of ten they'll bounce somewhere into the second paragraph. If you don't believe me, try using the product, and see how quickly you find yourself doing the same thing.

Before I go on to make my point, I should say that this is nothing more than a minor annoyance to me personally. I'm an academic; I'm not trying to monetize my site. And 33bits is a blog, so I don't pay hosting costs. The only reason I'm annoyed is that when I look at my stats page to see what sorts of articles my readers are most interested in, I have to mentally discount the articles that got StumbleUpon traffic. But anyone who pays hosting costs for their blog and is trying to make money (or spread an idea, or whatever) might want to take note of the following.
The architecture of StumbleUpon is fundamentally exploitative of the quid-pro-quo nature of free websites. A pageview from a StumbleUpon visitor costs just as much in bandwidth, but is a couple of orders of magnitude less likely to result in any sort of engagement. Your website wasn't meant to be viewed in a frame, so don't let it.
Even though StumbleUpon has only 10 million users, this is a bigger problem than might seem at first sight. The recommendations that the system makes are voting-based, so the mechanics of popularity and the resulting traffic patterns are essentially the same as with Digg and Reddit, although the engagement numbers are very, very different. This means that most days you'll probably see no StumbleUpon traffic, but one day you'll get unlucky and the resulting spike will dominate your traffic and costs for that month, but you'll have nothing to show for it.

I would recommend some framebusting and User-Agent sniffing code to politely tell StumbleUpon users to go somewhere else, but whatever you do, don't put a Stumble button on your pages!

[1] I'm sure there's a sizeable fraction of users for whom the collaborative filtering aspect works well, and who consequently actually read the articles and engage with the sites. But even if half the users fall into this group (although I doubt it's anywhere near that high), most of the traffic generated by StumbleUpon users to any given site is going to be low quality because the dopamine junkies make 100x more clicks.
Link1 comment | Leave a comment

Thoughts on the future of the real-time web [Aug. 12th, 2010|03:03 am]
[Tags|, , , ]

I've been doing a lot of thinking over the last few months about the real-time web as part of a project I'm working on. Here are a few things about it that I think are underappreciated:

0. There's an item 0 that I wanted to get out of the way first just so we're on the same page. Real-time doesn't automatically mean "Twitter."

1. Real-time content is going to look a lot prettier once good filtering tools are in place. Publish-then-filter has become the standard paradigm for new web communication tools, and the real time web is no exception. Just as the web was full of random crap before we had Google to filter it, real-time content is full of spam these days, but the filtering tools are soon going to catch up.

2. Related to the previous point: the pagerank model doesn't work for real-time content for obvious reasons -- there isn't enough time to wait to see what other people say about a piece of content. Instead, what is needed is to measure the reputation of sources, so that when a source emits some content you immediately have a way to rank it. A new class of algorithms are going to become as important or more important than pagerank, and user identities and reputation are going to be a key component of it.

3. There are two distinct reasons why low latency is valuable to users. Some types of events/information are inherently time-sensitive. The street food location via Twitter phenomenon is a good example, as are various other time-sensitive offers.

For other types of information, the value of timeliness is relative. For example, I follow Susan Polgar's chess blog, where she posts puzzles. To have a prayer of being the first to post a solution you have to be among the first to be notified of new posts (especially since the puzzles are kinda lame ;-).

4. A more serious and better known example of the value of relative timeliness is stock prices. The competitive advantage of low latency in high-frequency trading is so high that there are incredibly complex and powerful infrastructures that have been set up where the speed of light is a bottleneck. The important thing to realize is that this is driven not by the speed of events in the external world but purely due to competition within the trading network. A similar phenomenon is happening for information on the web, although less dramatically.

5. From the system design point of view, the biggest benefit of real-time isn't so much the low latency it delivers as the fact that it uses a publish-subscribe model as opposed to polling periodically for updates. For a large-scale system, the efficiency gains are incredible.

6. The above two factors -- the competitive value of low latency and the efficiency gains of publish-subscribe are together leading to the real-time-ification of update propagation of the web at an accelerating rate. Within a few years we will be able to build a "uberhose" -- a real-time stream that aggregates essentially all human activity on the web. Combined with effective routing and filtering tools, the applications will be limited only by imagination.

Exciting times.

7. My regular readers and/or those who know me IRL may have noticed that I am as usual childishly optimistic about the future of technology. I happened to write something yesterday about the problem of hoaxes exacerbated by real-time meme propagation, so I thought I'd throw in a link to that "for balance."
Link1 comment | Leave a comment

De-anonymization, network neutrality [Nov. 13th, 2008|03:53 am]
[Tags|, , , , , , ]

A couple of things that have nothing to do with each other:

Lending Club is a peer-to-peer loans company that has been publishing "anonymized" financial data about their customer's loans. For each customer (there are nearly 5,000 in the dataset), their income, credit rating, and a variety of other sensitive data are posted. There's also the text from the loan application for each customer; here's a representative example:
My husband’s lawyer has told us that we need $5000 up front to pay for his child custody case. We are going to file for primary custody. Right now he has no visitation rights according to their divorce agreement. His ex-wife has been evicted twice in the four months and is living with 2 of their 3 daughters in a two bedroom apartment with her boyfriend. She has no job or car and the only money they have is what we give them in child support and she blows all of it on junk. We have a 2000+ square foot house, both have stable jobs, and our own cars. Both girls(12 and 15 years old) are allowed to go and do whatever they please even though they are failing classes at school. We are clearly the better situation for them to be raised in but we simply do not have that much money all at once. We would be able to pay around $200 per month for repayment.
Users thought they were providing this information anonymously. I've published an analysis of how to de-anonymize the data in a variety of different ways.

The other thing is Network Neutrality, which I'm opposed to, as you can probably guess. Or rather, I'm opposed to Net Neutrality regulation, which I don't think will make the Internet any more neutral and has a huge potential to end in disaster. Most people who support regulation know very little about the issue beyond "OMG evil companies are going to take over the Internets!!" You might want to take a look at this detailed Cato Institute study and introductory blog post by Tim Lee arguing that there's nothing wrong with the Internet and that regulation is likely to lead to a huge mess. Particularly interesting to me is his analysis of how the Interstate Commerce Commission had the opposite of the intended effect on the railroads a hundred years ago.

Perhaps the most sensible thing anyone has ever said on Net Neutrality comes from an essay by Ed Felten:
The present situation, with the network neutrality issue on the table in Washington but no rules yet adopted, is in many ways ideal. ISPs, knowing that discriminating now would make regulation seem more necessary, are on their best behavior; and with no rules yet adopted we don’t have to face the difficult issues of line-drawing and enforcement. Enacting strong regulation now would risk side-effects, and passing toothless regulation now would remove the threat of regulation. If it is possible to maintain the threat of regulation while leaving the issue unresolved, time will teach us more about what regulation, if any, is needed.
Bias disclosure: both Tim Lee and Ed Felten are friends of mine. The Cato institute is a libertarian think-tank.
LinkLeave a comment

Traditional media is going to die [Sep. 11th, 2008|11:54 pm]
[Tags|, , , , ]

The BBC's online news coverage is a perfect example of a news organization doing pretty much everything wrong. For a start, look at their headlines. Here are three that I'm looking at right now in my RSS reader:

"Royal alarm."
"Fair play."
"Birthday state."

What the hell does any of that mean?

It's clear what's going on: the writer is someone who is used to being constrained by the space limitations of the print medium, and made no effort whatsoever to adapt their headline-writing strategy to the Internet. As an online publisher, you're competing desperately for people's attention; no living human being is going to click on those idiotic meaningless words.

We're just getting warmed up. Next up, story selection. Here's a typical kind of story that the BBC never fails to report:

"Six dead as bus overturns in Bangladesh."

This is the perfect example of a completely useless article. No offense to Bangladeshis, but what would be news is if no buses overturned for a whole week in Bangladesh. Reporting a random bus accident is of absolutely no use to anyone. Except perhaps the relatives of those dead, but there are far better ways of getting out that information than broadcasting it to the entire world. Usually the article doesn't even include the names of the dead! Nor do they even try to get a human angle, they just print the "facts." Ugh.

CNN might be biased and shallow in comparison, but that's still way better than being completely irrelevant.  Bias isn't even all that bad -- readers understand that pretty much everything you read on the Internet is someone's opinion. Instead of seeing contrasting viewpoints reported side-by-side in the same article, they just end up getting it from multiple news sources.

At any rate, let's see how the BBC handles bias. Their solution: a near-total reliance on quotes. They won't print a damn thing on their own no matter how obvious it is. In a recent article on the Tata Nano mess in India, this is roughly what they said a couple of days ago, in quintessential British fashion: "a spokesperson for the company said there was a possibility that planned October rollout would not be met."

No shit, Sherlock. The Tata company hadn't even gotten started building the factory that they were going to use to assemble their cars, because the government took back the land it gave them. Everyone in India knew a week before the article ran that there was no way in hell the Nano was going to roll out in October.

American TV news has adapted to the times by becoming a center of entertainment. That's perfectly fine as a business model, although it's sad that most people haven't realized that what they're watching doesn't really have much to do with the news. Print media, on the other hand, has had a particularly hard time of adapting to technological change.

The situation is really simple, and everyone who is native to the the Internet media business understands it quite well. The free flow of information has meant that news has shifted from an "information economy" to an "attention economy." In other words, news used to be a seller's market with readers clamoring for information; now they're inundated with it and publishers are the ones clamoring for attention -- a buyer's market.

Traditional media has a set of values and best practices that are completely unviable in the new world. Furthermore, they have become ossified and seem unable to adapt. Numbers don't lie: nothing seems to be able to stem the free fall of advertising revenues of newspapers.

If I have time, I will do a followup post analyzing how blogs are getting it right, with Techcrunch as the perfect example of a publication that stands in such dramatic contrast with BBC News. It was started just three years ago by one guy, and while still technically a blog, has grown into something far more useful, lucrative, and powerful, and now entertains ambitions of becoming a media empire of its own.

(According to the article linked in the picture, even online ad revenues for newspapers have gone down, for the first time. I don't think there's going to be any sort of turnaround.)
LinkLeave a comment

TV = 10,000 entire wikipedias per year! [May. 2nd, 2008|02:00 am]
[Tags|, , , , ]

Short and brilliant talk by Clay Shirky on New Media earlier this week, titled "Where do people find the time?", that's been making its way through the Internets. My favorite quotes:
Desperate Housewives essentially functioned as a kind of cognitive heat sink, dissipating thinking that might otherwise have built up and caused society to overheat.
In response to a TV producer being contemptuous of World of Warcraft:
However lousy it is to sit in your basement and pretend to be an elf, I can tell you from personal experience it's worse to sit in your basement and try to figure if Ginger or Mary Ann is cuter.
Link1 comment | Leave a comment

Five levels of writing [Dec. 1st, 2007|09:06 pm]
[Tags|, , ]
[Current Mood |amusedamused]

The level of formality of my writing falls into one of five categories depending on the context.

Technical writing. Super-boring, stilted, abstract. No cleverness allowed. Any references to popular culture are immediately shot down (yes, I've tried.) In The Mathematical Experience, Davis and Hersch conjecture that the purpose of mathematical writing is to "hide any sign that either the author or the intended reader is a human being". I think they hit the nail on the head! Not all academic disciplines have norms as extreme as that, but they come pretty close.

I also have to say "we" instead of "I" even when I'm the only author. WHAT IS UP WITH THAT? I've never been able to figure it out. And I just can't bring myself to do it. Fortunately I've never had to write an actual paper as the only author, so the real test is going to be when I write my thesis.

"Normal" writing, such as this blog post or an email to a group of people. This is the level that I really enjoy; I have the freedom to write whatever I want but the responsibility to put some thought into what I write so as not to bore or affront the reader.

Informal writing, such as 1-on-1 emails or comments to blog posts. I don't expect more than one or two people to read these, so I do no more than a cursory review of what I wrote.

At this and the following levels of writing, I also dispense with capital letters. Using them makes text slightly easier to read, but it's harder on my wrists (especially with my current tendinitis), and when the intended audience is O(1) the effort doesn't seem justified.

Text messages (and to an extent emails composed from my phone.) I freely resort to omg and lol, but fortunately I don't have to strip all the vowels from my words because my Treo has decent autocomplete :)

Youtube comments fall into a category all on their own. It is imperative in this medium of communication to establish that you have an IQ of no more than 60. Gratuitous insult-flinging is also encouraged. Needless to say, capital letters are out of the question. Spelling any word with eight or more letters correctly is frowned upon.

<lol>I haven't quite mastered the art yet. I believe my comments stick out a little bit to the experienced youtuber. But I think I'm getting better. For instance, I no longer cite papers from PubMed, DBLP or the arXiv in support of my conclusions. When someone makes a statement about the sexual orientation of my favorite political candidate, I do not point that it does not follow from his stated political positions. Nor to I bother to question the relevance of the argument to the issue at hand. I have learned that the correct response is instead to question the morality of my opponent's mother. A retort that somehow involved the Jews would also be appropriate.</lol>

It's a pretty interesting phenomenon. I'm sure anthropologists of the future will find youtube comments of our time an object of great curiosity worthy of study and providing many insights into human behavior.

Of course, I have to link to xkcd on youtube.
Link7 comments | Leave a comment

TV is dead. About time! [Nov. 12th, 2007|11:58 pm]
[Tags|, , , ]

I'm not alone in thinking that the writers' strike is just the latest symptom of the crumbling power of the studios. Marc Andreessen takes things to the logical extreme and suggests that an extended strike would be the best thing that could happen to consumers in the long run: Rebuilding Hollywood in Silicon Valley's image.

Now couple this with Chris Soghoian's eye-opener TV Torrents: When 'piracy' is easier than legal purchase. (Chris is the guy that got a lot of publicity for the fake NorthWest boarding pass generator and other security breaches.) The main thrust of the article is that
  • virtually all TV content is illegally available online as Bittorrent RSS feeds
  • there are players like miro which aggregate all these sources and slap an elegant GUI on top.
The result is that watching TV on your computer is enormously more convenient than turning on the idiot-box, and if you've tried it you can never go back again.

I've been trying Miro for a while. It's been slightly unstable on Ubuntu Feisty, but works like a breeze on Gutsy. The interface is heaven. Forget the illegal stuff. I find there's enough legal content available that I have enough for several hours of happy TV watching per day. Combined with occasionally renting TV shows from Netflix, I don't miss cable at all.

Of course, I'm slightly ahead of the curve in ditching TV; I think mainstream news media is a propaganda machine, I'd rather have a lobotomy than sit through commercials and Discovery channel is my idea of an engrossing TV show. On the other hand, the average person needs to watch Jack Bauer and they need to watch him NOW. So TV's dead for me, but not yet quite dead for everyone.

But we're getting there. Streaming content from your computer to your TV is getting ever easier; if the writer's strike continues, most people's favorite TV shows are going to get canceled; and the mobile video is starting to come to its own. By the end of next year, there's a good chance we're going to see hordes of deserters. Of course, there'll still be people watching TV the old fashioned way, just like people still use dialup, but the business model is already starting to topple.
Link3 comments | Leave a comment

A whole lotta buzzwords [Sep. 19th, 2007|07:03 pm]
[Tags|, , , , , ]

I've been busy writing a web app. For real. Which is great, but after a night or two of nonstop web programming I can't stop seeing DOM elements everywhere even when I'm not staring at the computer. I look around my room and it's full of DIV's. If I see objects lying around on a table, I briefly go "oops, need to decrease the CELLPADDING on that," before pulling myself together. So you can imagine my shock when I walked into the office kitchen and saw this:

Fortunately it turned out to be just a bottle of dishwashing liquid and not a hallucination.

It amuses me when people still talk of AJAX as being "the wave of the future" or whatever. AJAX has been the wave of everything for 3 years now. 90% of today's desktop apps have no reason to exist, and if they haven't migrated to the web it's pure inertia and incompetence.

That said, javascript is a pile of dung. The language is terribly limited for the kind of things you're trying to do on the Internet. (No builtin to walk a DOM tree? Seriously?) And if you try to use more than one library, you're going to get namespace collisions! Holy crap, what century are we living in? Add to that the flaky implementations, the CSS incompatibilities and the horror of debugging* and you're looking at a system that makes BASIC programming seem elegant.

I can't wait to get done with the frontend already and retreat back to my server-side python nirvana.

P.S: there's an excellent rant from yesterday by Joel Spolsky on the messed-up AJAX situation.

*What the hell did people even do before Firebug existed?
Link6 comments | Leave a comment

What's wrong with IM? [Apr. 2nd, 2007|02:55 am]
[Tags|, ]
[Current Mood |confusedbewildered]

Instant messaging is incredibly convenient -- unlike email, it's immediate, and unlike phone calls, you can ignore a message if you're busy. But yet I find myself talking to only a dozen or so people by IM with any regularity, in spite of the fact that I'm almost always signed on and between my various accounts, I have like 150 people on my friend lists. Most of my friends just never seem to sign on.


In particular, my 60-or-so mutual LJ friends, why are you not using your LJ Talk service? (Well, other than the 2 or 3 of you that do.) It's free, it's open and interoperable, and it auto-imports your friends list. What more could you ask for?
Link10 comments | Leave a comment

LJ OpenID loophole [Mar. 25th, 2007|03:24 am]
[Tags|, , ]
[Current Mood |surprisedsurprised]

I get several anonymous spam comments on this blog every day, but they don't show up because they're screened by default. If you ask LJ to screen anonymous comments, it doesn't screen comments from OpenIDs (although the comment posting form claims it does). This is silly, because anyone, including a spammer, can set up an OpenID identity server. That's sorta the whole point of OpenID.

Even better, there are anonymous OpenID servers around, which provide disposable IDs with no authentication. Go ahead, try it out, post a reply to this page by selecting OpenID in the From: field, and "" as the URL. It won't ask you for any sort of password, and the comment will show up even though I'm screening anonymous comments. Kinda silly, isn't it?

Shows you how stupid spammers must be if they haven't figured this out yet.
Link11 comments | Leave a comment

[ viewing | most recent entries ]
[ go | earlier ]