Volume 11 Number 4
Semantic Enrichment of a Multilingual Archive with Linked Open Data
Abstract
This paper introduces MERCKX, a Multilingual Entity/Resource Combiner & Knowledge eXtractor. A case study involving the semantic enrichment of a multilingual archive is presented with the aim of assessing the relevance of natural language processing techniques such as named-entity recognition and entity linking for cultural heritage material. In order to improve the indexing of historical collections, we map entities to the Linked Open Data cloud using a language-independent method. Our evaluation shows that MERCKX outperforms similar tools on the task of place disambiguation and linking, achieving over 80% precision despite lower recall scores. These results are encouraging for small and medium-size cultural institutions since they demonstrate that semantic enrichment can be achieved with limited resources.
1. Introduction
2. Related Work
3. Case study
# | Term | Hits | Category |
1. | Zillebeke | 398 | Location |
2. | Passendale | 351 | Location |
3. | Westouter | 259 | Location |
4. | leper | 197 | Location |
5. | oorlog | 178 | Concept |
6. | Reninghelst | 163 | Location |
7. | Bikschote | 149 | Location |
8. | Merkem | 127 | Location |
9. | Geluveld | 125 | Location |
10. | Wijtschate | 121 | Location |
4. MERCKX: A Knowledge Extractor
4.1. Downloading resources
<http://dbpedia.org/resource/Autism> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Disease> <http://dbpedia.org/resource/Aristotle> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Philosopher> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/resource/Alabama> <http://dbpedia.org/ontology/AdministrativeRegion> <http://dbpedia.org/resource/Alabama> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Place>
4.2. Mapping labels to URIs
<http://dbpedia.org/resource/South_Africa> <http://www.w3.org/2000/01/rdf-schema#label> "Afrique du Sud"@fr <http://dbpedia.org/resource/Andorra> <http://www.w3.org/2000/01/rdf-schema#label> "Andorre"@fr <http://dbpedia.org/resource/Angola> <http://www.w3.org/2000/01/rdf-schema#label> "Angola"@fr <http://dbpedia.org/resource/Saudi_Arabia> <http://www.w3.org/2000/01/rdf-schema#label> "Arabie saoudite"@fr
Afrique du Sud dbr:South_Africa
Andorre dbr:Andorra
Angola dbr:Angola
Arabie saoudite dbr:Saudi_Arabia
- Load the label files for each language, one by one (EN > NL > FR).
- Check for each label if it corresponds to the chosen type (dbo:Place).
- If the label already exists, check if the type remains the same ("Avant"@nl is already listed as a place, but is "Avant"@fr also a place?).
- If the type is the same, update the URI (yes > URI FR replaces URI NL).
- If the type is different – i.e. multilingually ambiguous – remove the label (no > suppress “Avant” from the file).[16]
URIs | EN | NL | FR | ALL |
735,062 | 709,357 | 194,208 | 186,483 | 857,911 |
{
"Afrique de Sud" : "dbr:South_Africa",
"Andorre" : "dbr:Andorra",
"Angola" : "dbr:Angola",
"Arabie saoudite" : "dbr:Saudi_Arabia",
}
4.3. Tokenizing, spotting, and annotating
5. Evaluation
5.1. Gold-standard corpus
5.1.1. Sample selection
187 198 Bouvancourt
199 205 Fismes
561 565 Pévy
626 640 East Yorkshire
1076 1082 Trigny
1145 1151 Muizon
1200 1205 Vesle
5.1.2. Cohen’s kappa
Lang. | Both | A | B | None | Tot | Pr(a) | Pr(e) | K |
EN | 20 | 2 | 2 | 678 | 702 | .994 | .939 | .906 |
FR | 197 | 46 | 8 | 13422 | 13673 | .996 | .968 | .877 |
NL | 384 | 13 | 27 | 15387 | 15811 | .997 | .950 | .949 |
gsc3.txt 187 198 Bouvancourt
gsc3.txt 199 205 Fismes
gsc3.txt 561 565 Pévy
gsc3.txt 626 640 East Yorkshire
gsc3.txt 1076 1082 Trigny
gsc3.txt 1145 1151 Muizon
gsc3.txt 1200 1205 Vesle
5.2. Benchmarking
SAM requires an annotation’s position to exactly match the reference, besides requiring the entity annotated to match the reference entity. ENT ignores positions and only evaluates whether the entity proposed by the system matches the reference. [Cornolti et al. 2013]
5.2.1. DBpedia Spotlight
The spotting stage recognizes in a sentence the phrases that may indicate a mention of a DBpedia resource. Candidate selection is subsequently employed to map the spotted phrase to resources that are candidate disambiguations for that phrase. The disambiguation stage, in turn, uses the context around the spotted phrase to decide for the best choice amongst the candidates. The annotation can be customized by users to their specific needs through configuration parameters [ ...]. [Mendes et al. 2011]
5.2.2. Zemanta
5.2.3. Babelfy
5.3. Results
System | Precision | Recall | F-score | |||
Raw | Corr | Raw | Corr | Raw | Corr | |
Spotlight | .466 | .468 | .192 | .207 | .272 | .287 |
Zemanta | .887 | .898 | .333 | .371 | .485 | .525 |
Babelfy | .656 | .688 | .376 | .446 | .478 | .541 |
MERCKX | .712 | .744 | .488 | .559 | .579 | .638 |
System | Precision | Recall | F-score | |||
Raw | Corr | Raw | Corr | Raw | Corr | |
Spotlight | .235 | .287 | .190 | .251 | .210 | .268 |
Zemanta | .867 | .888 | .278 | .362 | .421 | .515 |
Babelfy | .662 | .711 | .321 | .399 | .433 | .511 |
MERCKX | .782 | .805 | .443 | .517 | .566 | .629 |
5.3.1 Quantitative analysis
5.3.2. Qualitative analysis
5.3.3. Impact of OCR
6. Conclusion and Future Work
Acknowledgements
Notes
Works Cited
Recommendations
DHQ is testing out three new article recommendation methods! Please explore the links below to find articles that are related in different ways to the one you just read. We are interested in how these methods work for readers—if you would like to share feedback with us, please complete our short evaluation survey. You can also visit our documentation for these recommendation methods to learn more.
SPECTER Recommendations
Below are article recommendations generated by the SPECTER model:
- Interlinking Text and Data with Semantic Annotation and Ontology Design Patterns to Analyse Historical Travelogues, 2023, Sandra Balck, Leibniz Institute for East and Southeast European Studies (IOS) Regensburg; Ingo Frank, Leibniz Institute for East and Southeast European Studies (IOS) Regensburg; Hermann Beyer-Thoma, Leibniz Institute for East and Southeast European Studies (IOS) Regensburg; Anna Ananieva, Leibniz Institute for East and Southeast European Studies (IOS) Regensburg
- Open Data in Cultural Heritage Institutions: Can We Be Better Than Data Brokers?, 2020, S.L. Ziegler, Louisiana State University Libraries
- Knowledge Organization and Cultural Heritage in the Semantic Web – A Review of a Conference and a Special Journal Issue of JLIS, 2018, Marcia Lei Zeng, Kent State University, Kent, Ohio, USA; Sophy Shu-Jiun Chen, Academia Sinica, Taiwan
- Towards a National Data Architecture for Cultural Collections: Designing the Australian Cultural Data Engine, 2024, Rachel Fensham, University of Melbourne; Australian Cultural Data Engine; Tyne Daile Sumner, Australian National University; Australian Cultural Data Engine; Nat Cutter, University of Melbourne; Australian Cultural Data Engine; George Buchanan, RMIT University; Rui Liu, University of Melbourne; Justin Munoz, Independent Scholar; James Smithies, Australian National University; Ivy Zheng, University of Newcastle; David Carlin, RMIT University; Erik Champion, University of South Australia; Hugh Craig, University of Newcastle; Scott East, University of New South Wales; Chris Hay, Flinders University; Lisa M. Given, RMIT University; John Macarthur, University of Queensland; David McMeekin, Curtin University; Joanna Mendelssohn, University of Melbourne; Deborah van der Plaat, University of Queensland
- A Named Entity Recognition Model for Medieval Latin Charters, 2021, Pierre Chastang, UVSQ-Université Paris-Saclay; Sergio Torres Aguilar, UVSQ-Université Paris-Saclay; Xavier Tannier, Sorbonne Université
DHQ Keyword Recommendations
Below are article recommendations generated by DHQ Keywords:
- Les Sganarelle de Molière : un nom, des syntaxes ?, 2018, Élodie Bénard, Université Paris-Sorbonne; Francesca Frontini, Université Paul Valéry Montpellier
- Supporting the Exploration of Online Cultural Heritage Collections: The Case of the Dutch Folktale Database, 2018, Iwe Everhardus Christiaan Muiser, University of Twente, Enschede / Meertens Institute, Amsterdam; Mariët Theune, University of Twente, Enschede; Ruud de Jong, University of Twente, Enschede; Nigel Smink, University of Twente, Enschede; Dolf Trieschnigg, MyDatafactory, Meppel; Djoerd Hiemstra, University of Twente, Enschede; Theo Meder, Meertens Institute, Amsterdam / University of Groningen, Groningen
- Open Data in Cultural Heritage Institutions: Can We Be Better Than Data Brokers?, 2020, S.L. Ziegler, Louisiana State University Libraries
- A Pedagogy for Computer-Assisted Literary Analysis: Introducing GALGO (Golden Age Literature Glossary Online), 2017, Nuria Alonso García, Providence College; Alison Caplan, Providence College; Brad Mering, Mervideo
- All and Each: A Socio-Technical Review of the Europeana Project, 2017, Rhiannon Stephanie Bettivia, University of Illinois, Urbana-Champaign; Elizabeth Stainforth, University of Leeds
TF-IDF Recommendations
Below are article recommendations generated by the TF-IDF Model:
- Linked data from TEI (LIFT): A Teaching Tool for TEI to Linked Data Transformation, 2022, Francesca Giovannetti, University of Bologna; Francesca Tomasi, University of Bologna
- A Named Entity Recognition Model for Medieval Latin Charters, 2021, Pierre Chastang, UVSQ-Université Paris-Saclay; Sergio Torres Aguilar, UVSQ-Université Paris-Saclay; Xavier Tannier, Sorbonne Université
- A Model for Representing Diachronic Terminologies: the Saussure Case Study, 2021, Silvia Piccini, Institute for Computational Linguistics; Andrea Bellandi, Institute for Computational Linguistics; Emiliano Giovannetti, Institute for Computational Linguistics
- Developing Geographically Oriented NLP Approaches to Sixteenth–Century Historical Documents: Digging into Early Colonial Mexico, 2020, Diego Jiménez–Badillo, Museo del Templo Mayor, Instituto Nacional de Antropología e Historia; Patricia Murrieta–Flores, Digital Humanities Hub–History Department, Lancaster University; Bruno Martins, The Instituto de Engenharia de Sistemas e Computadores, Investigação e Desenvolvimento em Lisboa, INESC–ID, University of Lisbon; Ian Gregory, Digital Humanities Hub–History Department, Lancaster University; Mariana Favila-Vázquez, Museo del Templo Mayor, Instituto Nacional de Antropología e Historia; Raquel Liceras-Garrido, Digital Humanities Hub–History Department, Lancaster University
- Transdisciplinary Analysis of a Corpus of French Newsreels: The ANTRACT Project, 2021, Jean Carrive, Institut National de l'Audiovisuel; Abdelkrim Beloued, Institut National de l'Audiovisuel; Pascale Goetschel, Centre d'Histoire Sociale des Mondes Contemporains; Serge Heiden, ENS Lyon; Antoine Laurent, Laboratoire d'Informatique de l'Université du Mans; Pasquale Lisena, EURECOM; Franck Mazuet, Centre d'Histoire Sociale des Mondes Contemporains; Sylvain Meignier, Laboratoire d'Informatique de l'Université du Mans; Bénédicte Pincemin, ENS Lyon; Géraldine Poels, Institut National de l'Audiovisuel; Raphaël Troncy, EURECOM