News Curation Datasets

This page contains supplementary material for the paper:
"Aggregating Content and Network Information to Curate Twitter User Lists". Proc. 4th ACM RecSys workshop on Recommender systems and the social web (RSWeb'12), 29-36. [PDF] [BibTeX]

Summary

User content curation is becoming an important source of preference data, as well as providing information regarding the items being curated. One popular approach involves the creation of lists. On Twitter, these lists might contain user accounts relevant to a particular topic, whereas on a community site such as the Internet Movie Database (IMDb), this might take the form of lists of sharing common characteristics. While list curation implicitly involves substantial combined effort on the part of users, researchers have rarely looked at mining the outputs of this kind of crowdsourcing activity. Here we study a large collection of movie lists from IMDb. We apply network analysis methods to a graph that reflects the degree to which pairs of movies are "co-listed", that is, assigned to the same lists. This allows us to uncover a more nuanced categorisation of movies that goes beyond simple metadata, such as genre or era.

Data

To examine the information provided by user-curated movie lists, we constructed a new dataset from IMDb during July 2013. Collection was restricted to lists covering items such as feature films, documentaries, and TV shows/episodes. From the initial set of 121k lists and 249k movies, we constructed a co-listed graph (i.e. a graph of movies co-assigned to the same lists). We subsequently normalise and threshold this graph to produce a normalised co-listed graph. Details of the normalisation process are described in our paper above.

Download

We make the pre-processed versions of our graph data available here. The data is for further non-commercial and research purposes only:

>> Movielists dataset (2013-08-21) [14MB]

The ZIP archive contains two weighted undirected graphs in GraphML format :

imdb-colisted.graphml: Complete co-listed graph, with no normalisation. Each node corresponds to a movie. An edge exists between two movies if they are assigned to one or more lists together. The weight on each edge indicates the number of lists that they share.
imdb-normalised.graphml: Normalised co-listed graph, thresholded at 0.1. This graph was used in the analysis described in our paper.

In both graphs, movie nodes are identified by their unique IMDb IDs ttXXXXXXX (e.g. tt1375666 = "Inception"). Each node also has a "title" attribute, indicating the movie's title.

Contact

For further information please contact Derek Greene.