|Folksonomy dataset for NLP (delicious.com bookmarks)
||[Sep. 7th, 2009|04:03 am]
Several of my current projects have to do with Natural Language Processing. Over the weekend, I built a topic categorization engine; one of the datasets that I used was bookmarks from delicious.com. I was mainly interested in the tags and the relationships between them.
Delicious doesn't make this data available for bulk download, but they do have an RSS feed of site-wide bookmarking activity. The average rate is slightly over 1 new (public) bookmark per second. I've been slurping up the feed for the last couple of days, so I have a dataset of around 200k bookmarks. You can grab it here. The format is JSON, one entry per line — trivial to parse.
I plan to leave my bot running until I have at least a million entries, at which point I will do another release. If anyone wants the source, I'd be happy to share, although it shouldn't take more than half an hour to write it yourself.
To learn more about the nifty data-mining possibilities of social-bookmarking data, see folksonomy. If you find the data useful, drop me a note and let me know. Enjoy.