Cendari: Leveraging Natural Language Processing for Research in Historical Archives

Alexander O'Connor (TCD), Natasa Bulatovic (MPDL), Patrice Lopez (INRIA), Nadia Boukhelifa (INRIA), Carsten Thiel (UGoe)

Alex.OConnor@scss.tcd.ie

Acknowledgments: The research leading to these results has received funding from the European Union's Seventh Framework Programme [FP7/2007-2013] under grant agreement n° 284432. This work is partially supported by Science Foundation Ireland through CNGL: The Centre for Global Intelligent Content (Grant 12/CE/I2267)

Cendari aims to be at the cross-roads between Archivists, Historians and Technologists. The focus is on resource discovery and use from less-resourced archives.

Use Cases

Broadly, the medieval research is more vertical, while the modern (WWI) research casts a very wide net. However, many of these differences at the surface translate into similar need: the ability to find the right document, and the things connected to it. Paradoxically, the mediaeval work was more digital.

To integrate digital archival resources for medieval and modern history, leveraging extant networks and projects to enhance the discoverability and usability of the resources.

There is a tremendous amount of support for what the project is trying to do, both in terms of content, knowledge but perhaps more importantly in terms of service-level infrastructure.

Recognising what is desirable, what is useful and what is practical.

The over-riding goal of Cendari is to work from what users need, and what needs users express. It's a big challenge to realise that different constituencies (historians, archivists, technologists) both within the project and outside of it have often-competing goals, and often poor awareness of both what is possible, and what is desirable.

From the workshop report: "The major difference was that archivists and information professionals have a very different relationship with the materials and with users, and they must work to mediate between the two."

The Not- and Never-Will-Be- Digitised

A key realisation was that the materials available digitally were a very small proportion of the overall materials available. Another interesting aspect was the desire to consider the archive or collection itself as a subject of research.

Multilingual Named Entity Recognition

The data in this case is digitised paper finding aids for the Bulgarian state archives. How do you locate fonds relevant to WW1 from these documents, without using extensive hand-built background knowledge?

Named Entity Recognition for the Digital Humanities

Entity Identification from existing datasets (Random Tree Forest)
Customisation API
Large number of classes (currently 26)
Inter-language known entity detection using Freebase, Wikipedia
Commonness, Global vs. Local Context, NER-features, concept-features
User Defined entities (Further disambiguation challenges)

Dialogue about strengths
& (more importantly) Weaknesses

Performance: currently 88% f-score on the 4-classes NER CoNLL corpus

Unify the complex questions of access control, entity versioning and manuscript access into a uniform API. Control identifiers for entities, users, fonds.
A key aspect is to

EAD, EAG, XML, PDF, TXT, TIFF, DOC

Paper records and finding aids

Language & Culture

A complex ingestion workflow is required to render the data uniformly accessible, while retaining integrity.

Facilitate crowd-sourcing by sharing the concepts and entities, while keeping the actual work private.

Archival Research Guide

Methodology
Theme
Content
Sources
People

https://wiki.cendari.dariah.eu/wiki/Private_Memory_of_the_WW1 (Requires project access)

Cendari & the Europeana API

We abstract some of the complexity of data management away, so as to allow curation of knowledge from different specific tools. Personal data and annotations are stored in separate named graphs from group or global data. After an editorial process, data can migrate from the personal, to the general, and ultimately to background knowledge. Each version of the data is saved.

Users need tools that fit their existing workflows
Cannot assume the availability of full text
Users are happy for others to contribute, but may not be so keen to share themselves
PDFs are everywhere
XML is a variable creature