European Parliament Speeches

Replication Materials

Data and code is provided here to replicate results from the paper:
 —D. Greene, J. P. Cross. "Exploring the Political Agenda of the European Parliament Using a Dynamic Topic Modeling Approach". Political Analysis, 2016. [Paper]  [BibTeX]

Data

All data is provided for personal use or for further non-commercial use, and all rights, including copyright, are European Union, 2016 (Source: European Parliament). The EP releases this data under the following terms:

As a general rule, the reuse (reproduction or use) of textual data and multimedia items which are the property of the European Union (identified by the words ' European Union, [year(s)] ? Source: European Parliament' or ' European Union, [year(s)] ? EP' ) or of third parties ( External source, [year(s)]), and for which the European Union holds the rights of use, is authorised, for personal use or for further non-commercial or commercial dissemination, provided that the entire item is reproduced and the source is acknowledged.

Data downloads made available under the ODBLv1.0 license:

  • >> europarl-data-speeches.zip: Archive of 211,302 English language European Parliament speeches in plain text format, one file per speech, where the filename is the unique speech identifier from the Europarl website. Speeches are arranged in sub-directories based on their date.
  • >> europarl-metadata.zip: Contains (a) a tab-separated file containing metadata for all speeches, linked by the unique speech identifiers; (b) a tab-separated file containing metadata for MEPs who delivered the speeches.
  • >> europarl-word2vec-model.zip: A pre-trained Word2vec word embedding model, generated on the set of all text files, which was created using the Python Gensim package using the default parameters.

Software

Python code for specifically apply topic modeling to the above European Parliament speech data is provided below, made available under the Apache 2.0 license. The README file in the archive describes the steps required to replicate our results.

This code has been tested with Python 2 (version 2.7.11), with the following third party modules installed. These can be installed via Pip or Anaconda:

Please note that the scikit-learn implementation of NMF was re-implemented in version 0.17 of the package, which can result in marginally different results for the topic models described in our paper. We recommend using 0.16 if seeking to reproduce our results exactly.

For more general purpose dynamic topic modeling on other text datasets, the dynamic-nmf package is also available, which is compatible with both Python 2.x and Python 3.x.

Results & Analysis

Figure 1 in the paper and the figures found in Appendix A can be replicated from the following file:

Figure 2 in the paper was produced using the following data:

Stata code for replicating the analysis in Sections 6.2 and 6.3, along with the relevant data derived from our topic model, can be found in the following file: