Stability Analysis + Topic Modeling Data

This page contains supplementary material for the paper:
D. Greene, D. O'Callaghan, P. Cunningham. (2014), "How Many Topics? Stability Analysis for Topic Models". Proc. European Conference on Machine Learning (ECML'14). [PDF]  [BibTeX]  


Summary

Despite the diversity of topic modeling algorithms that have been proposed, a common challenge in successfully applying these techniques is the selection of an appropriate number of topics for a given corpus. Choosing too few topics will produce results that are overly broad, while choosing too many will result in the over-clustering of a corpus into many small, highly-similar topics. In this paper, we propose a term-centric stability analysis strategy to address this issue, the idea being that a model with an appropriate number of topics will be more robust to perturbations in the data.


Data

For evaluation purposes, we created a number of text corpora that have annotated "ground truth" category labels for documents. Details of these corpora are as follows:

Corpus Documents  Terms   Labels  Description
bbc 2,225 3,121 5 General news articles from the BBC. See here for more details.
bbc-sport 737 969 5 Sports news articles from the BBC. See here for more details.
guardian-2013 6,520 10,801 6 New corpus of news articles published by The Guardian during 2013.
irishtimes-2013 3,246 4,832 7 New corpus of news articles published by The Irish Times during 2013.
nytimes-1999 9,551 12,987 4 A subset of the New York Times Annotated Corpus from 1999.
nytimes-2003 11,527 15,001 7 As above, with articles from 2003.
wikipedia-high 5,738 17,311 6 Subset of a Wikipedia dump from January 2014, where articles are assigned labels based on their high level WikiProject.
wikipedia-low 4,986 15,441 10 Another Wikipedia subsetfrom January 2014. Articles are labeled with fine-grained WikiProject sub-groups.


Download

Pre-processed versions of six of the corpora are made available here for research purposes only.

>> Download Pre-processed text corpora (35MB)  

Unfortunately due to licensing restrictions, we are unable to make the New York Times corpora available. The complete corpus is available from here. To recreate our corpora, the subset of document IDs that we used for nytimes-1999 and nytimes-2003 is provided here, where each ID is prefixed by its category label in the ground truth.


File formats

The datasets have been pre-processed as follows: stop-word removal and low term frequency filtering (count < 20) were applied to the data, then log TF-IDF and L2 document length normalization. The files contained in the archive above have the following formats:


Contact

For further information please contact Derek Greene.