Folksonomy dataset for NLP (delicious.com bookmarks) [Sep. 7th, 2009|04:03 am]
Arvind Narayanan
Several of my current projects have to do with Natural Language Processing. Over the weekend, I built a topic categorization engine; one of the datasets that I used was bookmarks from delicious.com. I was mainly interested in the tags and the relationships between them.

Delicious doesn't make this data available for bulk download, but they do have an RSS feed of site-wide bookmarking activity. The average rate is slightly over 1 new (public) bookmark per second. I've been slurping up the feed for the last couple of days, so I have a dataset of around 200k bookmarks. You can grab it here. The format is JSON, one entry per line — trivial to parse.

I plan to leave my bot running until I have at least a million entries, at which point I will do another release. If anyone wants the source, I'd be happy to share, although it shouldn't take more than half an hour to write it yourself.

To learn more about the nifty data-mining possibilities of social-bookmarking data, see folksonomy. If you find the data useful, drop me a note and let me know. Enjoy.

From: fixious
2009-09-11 05:51 am (UTC)
Cool, thanks. I did a class project using delicious data some years back, and it'll be fun to visit that kind of stuff again.
From: (Anonymous)
2009-09-11 06:40 am (UTC)
did you have to slurp it as well or was there a better way to get it?
[User Picture]From: arvindn
2009-09-11 06:41 am (UTC)
crap, that was me above.
From: fixious
2009-09-11 07:32 am (UTC)
Yup, slurped.
