Stability Analysis + Topic Modeling Data

This page contains supplementary material for the paper:
D. Greene, D. O'Callaghan, P. Cunningham. (2014), "How Many Topics? Stability Analysis for Topic Models". Proc. European Conference on Machine Learning (ECML'14). [PDF] [BibTeX]

Summary

Despite the diversity of topic modeling algorithms that have been proposed, a common challenge in successfully applying these techniques is the selection of an appropriate number of topics for a given corpus. Choosing too few topics will produce results that are overly broad, while choosing too many will result in the over-clustering of a corpus into many small, highly-similar topics. In this paper, we propose a term-centric stability analysis strategy to address this issue, the idea being that a model with an appropriate number of topics will be more robust to perturbations in the data.

Data

For evaluation purposes, we created a number of text corpora that have annotated "ground truth" category labels for documents. Details of these corpora are as follows:

Corpus	Documents	Terms	Labels	Description
bbc	2,225	3,121	5	General news articles from the BBC. See here for more details.
bbc-sport	737	969	5	Sports news articles from the BBC. See here for more details.
guardian-2013	6,520	10,801	6	New corpus of news articles published by The Guardian during 2013.
irishtimes-2013	3,246	4,832	7	New corpus of news articles published by The Irish Times during 2013.
nytimes-1999	9,551	12,987	4	A subset of the New York Times Annotated Corpus from 1999.
nytimes-2003	11,527	15,001	7	As above, with articles from 2003.
wikipedia-high	5,738	17,311	6	Subset of a Wikipedia dump from January 2014, where articles are assigned labels based on their high level WikiProject.
wikipedia-low	4,986	15,441	10	Another Wikipedia subsetfrom January 2014. Articles are labeled with fine-grained WikiProject sub-groups.

Download

Pre-processed versions of six of the corpora are made available here for research purposes only.

>> Download Pre-processed text corpora (35MB)

Unfortunately due to licensing restrictions, we are unable to make the New York Times corpora available. The complete corpus is available from here. To recreate our corpora, the subset of document IDs that we used for nytimes-1999 and nytimes-2003 is provided here, where each ID is prefixed by its category label in the ground truth.

File formats

The datasets have been pre-processed as follows: stop-word removal and low term frequency filtering (count < 20) were applied to the data, then log TF-IDF and L2 document length normalization. The files contained in the archive above have the following formats:

*.mtx: The document-term matrix, represented as a sparse coordinate matrix in Matrix Market format.
*.terms: List of terms in the corpus, with each line corresponding to a column of the sparse data matrix.
*.docs: List of document identifiers, with each line corresponding to a row of the sparse data matrix.
*.labels: Assignment of documents to the "ground truth" label, where each line corresponds to a different category label.

Contact

For further information please contact Derek Greene.