Curatr is a new online platform which provides access to the British Library Digital Collection, developed by the ERC-funded VICTEUR project at the UCD English, Drama and Film, in collaboration with researchers at the SFI Insight Centre for Data Analytics, as part of its Cultural Analytics research initiative. The platform hosts digitised versions of all English-language books from the British Library collection, corresponding to over thirty-five thousand unique titles, both fiction and non-fiction, from 1700 to 1899. When we take into account multi-volume works, this consists of over forty-six thousand unique volumes of text. The platform also incorporates the first digitised version of the topical classification index of books used by the British Library from 1823-1985.
The system includes a searchable index on the equivalent of over 12 million individual pages of text, which can be searched and sorted by author, title, year, and the actual full-text of the volumes themselves. This allows researchers to identify content relating to specific themes within little known or very long, unwieldy texts. This is further supported by additional functionality based on modern natural language processing techniques. This includes content-based recommendation methods and the visualisation of the relationships between concepts in the collection through the use of semantic networks.
Curatr also supports the creation and export of smaller sub-corpora, defined thematically, chronologically, and by classification. This addresses the common requirement for humanities scholars to engage in online document curation, without the need for extensive technical training. Since creating an appropriate lexicon of words for curation can often be a tedious and time-consuming process, we expedite this by using a custom word embedding model to identify other potentially relevant words which are semantically similar to the original "seed" words provided by the user. The resulting lexicon can be used to filter the entire collection to produce a much smaller set of texts for closer inspection.
There are inevitable variations in the legibility of earlier texts and those in non-standard formats. Therefore, a key use of Curatr is to assist researchers to identify original texts relevant to their work for consultation in situ in the library. The next phase of the project will seek to integrate Curatr with other relevant online cultural resources, such as records originating from popular lending libraries in the nineteenth-century.
This work is funded by the European Research Council (ERC), and is being undertaken by members of the UCD School of English, Drama and Film, in collaboration with researchers from the SFI Insight Centre for Data Analytics at the UCD School of Computer Science. For more details, please contact us via e-mail or follow us on Twitter. Curatr by UCD Centre for Cultural Analytics is licensed under a Creative Commons BY-NC-ND 4.0 Licence. Background image is by Steven Cadman.