Your browser doesn't support the features required by impress.js, so you are presented with a simplified version of this presentation.
For the best experience please use the latest Chrome, Safari or Firefox browser.
Cendari: Leveraging Natural Language Processing for Research in
Historical Archives
Alexander O'Connor (TCD), Natasa Bulatovic (MPDL), Patrice Lopez (INRIA), Nadia Boukhelifa (INRIA), Carsten Thiel (UGoe)
@uberalex
Alex.OConnor@scss.tcd.ie
Acknowledgments: The research leading to these results has received funding from the European Union's Seventh Framework Programme [FP7/2007-2013] under grant agreement n° 284432. This work is partially supported by Science Foundation Ireland through CNGL: The Centre for Global Intelligent Content (Grant 12/CE/I2267)
Cendari aims to be at the cross-roads between Archivists, Historians and Technologists. The focus is on resource discovery and use from less-resourced archives.
Use Cases
Broadly, the medieval research is more vertical, while the modern (WWI) research casts a very wide net. However, many of these differences at the surface translate into
similar need: the ability to find the right document, and the things connected to it. Paradoxically, the mediaeval work was more digital.
To integrate digital archival resources for medieval and modern history, leveraging extant networks and projects to enhance the discoverability and usability of the resources.
There is a tremendous amount of support for what the project is trying to do, both in terms of content, knowledge but perhaps more importantly in terms of service-level infrastructure.
Recognising what is desirable, what is useful and what is practical.
The over-riding goal of Cendari is to work from what users need, and what needs users express. It's a big challenge to realise that different constituencies (historians, archivists, technologists) both within the project and outside of it have often-competing goals, and often poor awareness of both what is possible, and what is desirable.
From the workshop report: "The major difference was that archivists and information
professionals have a very different relationship with the materials and with users, and they must
work to mediate between the two."
The Not- and Never-Will-Be- Digitised
A key realisation was that the materials available digitally were a very small proportion of the overall materials available. Another interesting aspect was the desire to consider the archive or collection itself as a subject of research.
Multilingual Named Entity Recognition
The data in this case is digitised paper finding aids for the Bulgarian state archives. How do you locate fonds relevant to WW1 from these documents, without using extensive hand-built background knowledge?
Named Entity Recognition for the Digital Humanities
- Entity Identification from existing datasets (Random Tree Forest)
- Customisation API
- Large number of classes (currently 26)
- Inter-language known entity detection using Freebase, Wikipedia
- Commonness, Global vs. Local Context, NER-features, concept-features
- User Defined entities (Further disambiguation challenges)
Dialogue about strengths
& (more importantly) Weaknesses
Performance: currently 88% f-score on the 4-classes NER CoNLL corpus
Unify the complex questions of access control, entity versioning and manuscript access into a uniform API. Control identifiers for entities, users, fonds.
A key aspect is to
EAD, EAG, XML, PDF, TXT, TIFF, DOC
Paper records and finding aids
Language & Culture
A complex ingestion workflow is required to render the data uniformly accessible, while retaining integrity.
Facilitate crowd-sourcing by sharing the concepts and entities, while keeping the actual work private.
Archival Research Guide
- Methodology
- Theme
- Content
- Sources
- People
Cendari & the Europeana API
We abstract some of the complexity of data management away, so as to allow curation of knowledge from different specific tools. Personal data and annotations are stored in separate named graphs from group or global data. After an editorial process, data can migrate from the personal, to the general, and ultimately to background knowledge. Each version of the data is saved.
- Users need tools that fit their existing workflows
- Cannot assume the availability of full text
- Users are happy for others to contribute, but may not be so keen to share themselves
- PDFs are everywhere
- XML is a variable creature