National Library of Finland · Centre for Digitization and Preservation
Assistant Professor in Digital Humanities, University of Helsinki
Language technology consultant for Digital Collections project
University of Helsinki, Department of Finnish, Nordic and Finno-Ugric Studies
Department of Linguistics and English Language, Lancaster University, UK
This is the source
Named Entity Recognition (NER), search, classification and tagging of names and
name-like informational elements in texts, has become a standard information
extraction procedure for textual data. NER has been applied to many types of
texts and different types of entities: newspapers, fiction, historical records,
persons, locations, chemical compounds, protein families, animals etc. In
general, the performance of a NER system is genre- and domain-dependent and also
used entity categories vary
This paper reports the first large scale results of NER in a historical Finnish OCRed newspaper collection. Results of this research supplement NER results of other languages with similar noisy data. As the results are also achieved with a small and morphologically rich language, they illuminate the relatively well-researched area of Named Entity Recognition from a new perspective.
Named-entity recognition in Finnish newspapers from 1771-1910
The National Library of Finland has digitized a large proportion of the
historical newspapers published in Finland between 1771 and 1910
The web service digi.kansalliskirjasto.fi is used, for example, by genealogists, heritage societies, researchers, and history-enthusiast laymen. There is also an increasing desire to offer the material more widely for educational use. In 2016 the service had about 18 million page loads. User statistics of 2014 showed that about 88.5% of the usage of the Digi came from Finland, but an 11.5% share of use was coming outside of Finland.
Digi is part of the growing global network of digitized newspapers and journals,
and historical newspapers are considered more and more as an important source of
historical knowledge. As the amount of digitized data accumulates, tools for
harvesting the data are needed to gather information and to add structure to the
unstructured mass
Our goal using NER is to provide users of Digi better means for searching and
browsing the historical newspapers (i.e. new ways to structure, access, and,
possibly, enhance information). Different types of names, especially person
names and names of locations, are frequently used as search terms in different
newspaper collections
Named Entity Recognition is a tool that needs to be used for some useful purpose.
In our case, extraction of person and place names is primarily a tool for
improving access to the Digi collection. After getting the recognition rate of
the NER tool to an acceptable level, we need to decide how we are going to use
extracted names in Digi. Some exemplary suggestions are provided by the archives
of La Stampa and Trove Names
If we consider possible uses of presently evaluated NER tools–FiNER, the FST, and
Connexor’s tagger–for our newspaper collection, they only perform basic
recognition and classification of names, which is the first stage of entity
handling
One more possible use for NER involves tagging and classifying images published in the newspapers. Most of the images (photos) have short title texts. It seems that many of the images represent locations and persons, with names of the objects mentioned in the image title. As image recognition and classifying low-quality print images may not be very feasible, image texts may offer a way to classify at least a reasonable part of the images. Along with NER, topic detection could also be done to the image titles. Image content tagging thus could be one clear application for NER.
Our main research question in this article is, how well or poorly names can be
recognized in an OCRed historical Finnish newspaper collection with readily
available software tools. The task has many pitfalls that will affect the
results. First, the word level quality of the material is quite low
We are using five readily available tagging tools for our task. By using a set of different types of tools we are able to pinpoint common failures in NE tagging our type of material. Observations of error analysis of the tools will help us possibly to improve further NE tagging of historical OCRed material of Finnish. Differences in tagging between these tools also help us to analyze further the nature of our material.
We will not provide a review of basic NER literature; those who are interested in
getting an overall picture of the topic, can start with Nadeau and Sekine
The structure of the paper is following: first we introduce our NER tools, our evaluation data and the tag set. Then we will show results of evaluations, analyze errors of different tools, and finally discuss the results and our plans for using NER with the online newspaper collection.
For recognizing and labelling named entities in our evaluation we use FiNER
software as a baseline NER tool. Our second main tool, SeCo’s ARPA, is a
different type of tool, mainly used for Semantic Web tagging and linking
entities
All our taggers have been implemented as analysers of modern Finnish, although
ARPA’s morphological engine is able to deal with 19th century Finnish, too. As
far as we know there is no NE tagger for historical Finnish available. Before
choosing FiNER and ARPA we also tried a commonly used trainable free statistical
tagger, Stanford NER, but were not able to get reasonable performance out of it
for our purposes although the software has been used successfully for other
languages than English. Dutch, French and German named entity recognition with
the Stanford NER tool has been reported in the Europeana historical newspaper
project, and the results have been good
As far as we now, besides the five tools evaluated in this paper, there are not
many other existing tools to do NER analysis for Finnish
FiNER
The focus of FiNER is in recognizing different types of proper names. Additionally, it can identify the majority of Finnish expressions of time and, for example, sums of money. FiNER uses multiple strategies in its recognition task:
The pattern-matching engine that FiNER uses, HFST Pmatch, marks leftmost
longest non-overlapping matches satisfying the rule set (basically a large
set of disjuncted patterns)
SeCo’s ARPA
The core benefits of the ARPA system lie in its dynamic, configurable nature.
In processing, ARPA combines a separate lexical processing step with a
configurable SPARQL-query-based lookup against an entity lexicon stored at a
Linked Data endpoint. Lexical processing for Finnish is done with a modified
version of Omorfi
As there was no evaluation collection for Named Entity Recognition of 19th
century Finnish, we needed to create one first. As evaluation data we used
samples from different decades out of the Digi collection. Kettunen and
Pääkkönen calculated, among other things, number of words in the data for
different decades
We aimed for an evaluation collection of 150,000 words. To emphasize the
importance of the 1870–1910 material we took 50K of data from time period
1900–1910, 10K from 1890–1899, 10K from 1880–1889, and 10K from 1870–1879.
The remaining 70K of the material was picked from time period of 1820–1869.
Thus the collection reflects most of the data from the century but is also
weighed to the end of the 19th century and beginning of 20th century.
Decade-by-decade word recognition rates in Kettunen and Pääkkönen show that
word recognition rate during the whole 19th century is quite even, variation
being maximally 10% units for the whole century
The final manually tagged evaluation data consists of 75,931 lines, each line
having one word or other character data. By character data we mean here that
the line contains misrecognized words that have a variable amount of OCR
errors. The word accuracy of the evaluation sample is on the same level as
the whole newspaper collection’s word level quality: about 73% of the words
in the evaluation collection can be recognized by a modern Finnish
morphological analyzer. The recognition rate in the whole index of the
newspaper collection is estimated to be in the range of 70–75%
FiNER uses fifteen tags for different types of entities, which is too fine a
distinction for our purposes. Our first aim was to concentrate only on
locations and person names because they are mostly used in searches of the
Digi collection–as was detected in an earlier log analysis where 80% of the
ca. 149,000 occurrences of top 1000 search term types consisted of first and
last names of persons and place names
After reviewing some of the FiNER tagged material, we included also three other tags, as they seemed important and were occurring frequently enough in the material. The eight final chosen tags are shown and explained below (Entity followed by Tag Meaning).
<EnamexPrsHum>
person<EnamexLocXxx>
general location<EnamexLocGpl>
geographical location<EnamexLocPpl>
political location (state, city
etc.)<EnamexLocStr>
street, road, street address<EnamexOrgEdu>
educational organization<EnamexOrgCrp>
company, society, union etc.<TimexTmeDat>
expression of timeThe final entities show that our interest is mainly in the three most
commonly used semantic NE categories: persons, locations, and organizations.
In locations we have four different categories and with organizations two.
Temporal expressions were included in the tag set due to their general
interest in the newspaper material. Especially persons and locations fulfill
content validity condition of an experimental unit
Manual tagging of the evaluation corpus was done by the third author, who had
previous experience in tagging modern Finnish with tags of the FiNER tagger.
Tagging took one month, and quality of the tagging and its principles were
discussed before starting based on a sample of 2000 lines of evaluation
data. It was agreed, for example, that words that are misspelled but are
recognizable for the human tagger as named entities would be tagged (cf. 50%
character correctness rule in
All the evaluation runs were performed with the tagged 75K evaluation set. This set was not used in configuration of the tools.
To get an idea how well FiNER recognizes names in general, we evaluated
it with a separate list of 75,980 names of locations and persons. We
included in the list modern first names and surnames, old first names
from the 19th century, names of municipalities, and names of villages
and houses. The list also contains names in Swedish, as Swedish was the
dominant language in Finland during most of the 19th century
FiNER recognized 55,430 names out of the list, which is 72.96%. Out of
these 8,904 were tagged as persons, 35,733 as
Among the names that FiNER does not recognize are foreign names, mostly
Swedish (also in Sami), names that can also be common nouns, different
compound names, and old names. Variation of
We evaluated performance of the NER tools using the a result is considered
correct only if the boundaries and classification are exactly as
annotated
We performed also a looser evaluation for all the taggers. In a looser evaluation the categories were treated so that any correct marking of an entity regardless its boundaries was considered a hit.
Detailed results of the evaluation of FiNER are shown in Table 1. Entities
<ent/>
consist of one word token,
<ent>
are part of a multiword entity and
</ent>
are last parts of multiword entities. An
example of a multipart name would be <EnamexPrsHum>
E.
<EnamexPrsHum>
Jansson</EnamexPrsHum>
<EnamexPrsHum/>
Results of the evaluation show that named entities are recognized quite badly
by FiNER, which is not surprising as the quality of the text data is quite
low. Recognition of multipart entities is mostly very low. Some part of the
entities may be recognized, but rest is not. Out of multiword entities
person names and educational organizations are recognized best. Names of
persons are the most frequent category. Recall of one part person names is
best, but its precision is low. Multipart person names have a more balanced
recall and precision, and their F-score is 40–45. If the three different
locations (
In a looser evaluation the categories were treated so that any correct
marking of an entity regardless its boundaries was considered a hit. Four
different location categories were joined to two: general location
<EnamexLocXxx>
Our third evaluation was performed for a limited tag set with tools of the SeCo’s ARPA. We first analyzed ARPA’s lexical coverage with the same word list that was used with FiNER. ARPA recognized in the recognition word list (of 75,980 tokens) 74,068 as either locations or persons (97.4 %). 67,046 were recognized as locations, and 37,456 as persons. 30,434 names were tagged as both persons and locations. Among the 1912 names that were not recognized by ARPA were the same kind of foreign names that were left unrecognized by FiNER. 13% of the unrecognized names were compounds with hyphen, such as Esa-Juha, Esa-Juhani. This type could be easily handled by ARPA with minor modifications to configuration. In general, the test showed that ARPA’s lexicons are more comprehensive than those of FiNER.
First only places were identified so that one location,
Table 4 describes the results of location recognition with ARPA. With one
exception (
A second improvement to the ARPA process arose from the observation that
while recall in the first test run was high, precision was low. Analysis
revealed this to be due to many names being both person names as well as
places. Thus, a filtering step was added, that removed 1) hits identified as
person names by the morphological analyzer and 2) hits that matched regular
expressions catching common person name patterns found in the data (I.
Lastname and FirstName LastName). However, sometimes this was too
aggressive, ending up, for example, in filtering out also big cities like
Finally, as the amount of OCR errors in the target dataset was identified to be a major hurdle in accurate recognition, experiments were made with sacrificing precision in favor of recall through enabling various levels of Levenshtein distance matching against the place name registries. In this test, the fuzzy matching was done in the query phase after lexical processing. This was easy to do, but doing the fuzzy matching during lexical processing would probably be more optimal as lemma guessing (which is needed because OCR errors are out of the lemmatizer’s vocabulary) is currently extremely sensitive to OCR errors–particularly in the suffix parts of words.
After the place recognition pipeline was finalized, a further test was done to see if the ARPA pipeline could be used for also person name recognition. The Virtual International Authority File was used as a lexicon of names as it contains 33 million names for 20 million people. In the first run, the query simply matched all uppercase words against both first and last names in this database while allowing for any number of initials to also precede such names matched. This way the found names cannot always be linked to strong identifiers, but for a pure NER task, recall is improved.
Table 5 shows results of this evaluation without fuzzy matching of names and Table 6 with fuzzy matching. Table 7 shows evaluation results with loose criteria without fuzzy matching and Table 8 loose evaluation with fuzzy matching.
Recall of recognition increases markedly in fuzzy matching, but precision deteriorates. More multipart location names are also recognized with fuzzy matching. In loose evaluation more tags are found but precision is not very good and thus the overall F-score is a bit lower than in the strict evaluation.
To sum up results of our two main tools, we show one more table where the main comparable results of FiNER and ARPA are shown in parallel. These are results of loose evaluations from Tables 2 and 7.
As one can see, FiNER perfoms slightly better with locations and persons than ARPA. The difference in F-scores is about 5 percentage units.
Here we report briefly results of three other systems that we evaluated.
These are Polyglot, a Finnish semantic tagger
Polyglot
Results of Polylot’s performance in a loose evaluation with three categories are shown in table 10.
As can be seen from the figures, Polyglot has high precision with persons and locations, but quite bad recall, and F-scores are thus about 10% units below FiNER’s performance and clearly below ARPA's performance. With corporations Polyglot performs very poorly. The reason for this is probably the fact that names of companies have changed and organizations taken out of Wikipedia do not contain old company names.
Our fourth tool is a general semantic tagger tool for Finnish. The Finnish Semantic Tagger (FST) has its origins in Benedict, the EU-funded language technology project, the aim of which was to discover an optimal way of catering to the needs of dictionary users in modern electronic dictionaries by utilizing state-of-the-art language technology. Semantic tagging in its rule-oriented form (vs. statistical learning) can be briefly defined as a dictionary-based process of identifying and labeling the meaning of words in a given text according to some classification. FST is not a NER tool as such; it has first and foremost been developed for the analysis of full text.
The Finnish semantic tagger was developed using the English Semantic
Tagger as a model. This semantic tagger was developed at the University
Centre for Corpus Research on Language (UCREL) at Lancaster University
as part of the UCREL Semantic Analysis System (USAS) framework
FST tags three different types of names: personal names, geographical
names, and other proper names. These are tagged with tags Z1, Z2, and
Z3, respectively
FST tagged the list of 75,980 names as follows: it marked 5,569 names with tags Z1-Z3. Out of these 3,473 were tagged as persons, 2,010 as locations and rest as other names. It tagged 47,218 words with the tag Z99, which is a mark for lexically unknown words. Rest of the words, 23,193, were tagged with tags of common nouns. Thus FST’s recall with the name list is quite low compared to FiNER and ARPA.
In Table 11 we show results of FST’s tagging of locations and persons in our evaluation data. As the tagger does not distinguish multipart names only loose evaluation was performed. We performed two evaluations: one with the words as they are, and the other with w to v substitution.
Substitution of
Connexor Ltd. has provided different language technology tools, and among
them is name recognitionusing linguistic and heuristic methods, the
names in the text can be tagged accurately
. Software’s name type
repertoire is large; at least 31 different types of names are
recognized. These are part of 9 larger categories like NAME.PER
(persons), NAME.PRODUCT (products), NAME.GROUP (organizations), NAME.GPE
(locations) etc. Boundaries of names are not tagged, so we perform only
a loose evaluation.
As earlier, our interest is mainly in persons and locations. Connexor’s
tags NAME.GPE, NAME.GPE.City, NAME.GPE.Nation, NAME.GEO.Land and
NAME.GEO.Water were all treated as <EnamexLocXxx>
.
NAME.PER, NAME.PER.LAW, NAME.PER.GPE, NAME.PER.Leader, NAME.PER.MED,
NAME.PER.TEO and NAME.PER.Title were all treated as
<EnamexPrsHum>
. All other tags were discarded.
Results of Connexor’s tagger are shown in Table 12.
Results show that Connexor’s NE tagger is better with locations–achieving the best overall F-score of all the tools–but also persons are found well. Recall with persons is high, but low precision hurts overall performance. Data inspection shows that Connexor’s tagger has a tendency to tag words beginning with upper case as persons. Locations are also mixed with persons many times.
If we consider results of FiNER and ARPA overall, we can make the following
observations. They both seem to find two part person names best, most of
which consist of first name and last name. In strict evaluation ARPA appears
better with locations than FiNER, but this is due to the fact that FiNER has
a more fine-grained location tagging. With one location tag FiNER performs
equally well as ARPA. In loose evaluation they both seem to find almost
equally well locations and persons, but FiNER gets slightly better results.
FiNER finds educational organizations best, although they are scarce in the
data. Corporations are also found relatively well, even though this category
is prone to historical changes. FiNER is precise in finding two part street
names, but recall in street name tagging is low. High precision is most
likely due to common part
Out of the other three tools we evaluated, the FST was able to recognize locations slightly better than FiNER or ARPA in loose evaluation when w/v variation was neutralized. Connexor’s tagger performed at the same level as FINER and ARPA in loose evaluation. Its F-score with locations was the best performance overall. Polyglot performed worst of all the systems.
We evaluated lexical coverage of three of our tools with a wordlist that contained 75,980 names. ARPA’s lexical coverage of the names was by far the best as it was able to recognize 97.4% of the names. FiNER recognized 73% of the names in this list and the FST recognized only about 7% of them as names. It marked about 62% of the names as unknown. Thus it seems that very high lexical coverage of names may not be the key issue in NER, as all three tools performed tagging of locations at the same level. The FST performed worst with persons although it had clearly more person names than locations in its lexicon.
One more caveat of performance is in order, especially with FiNER. After we
had achieved our evaluation results, we evaluated FiNER’s context
sensitivity with a small test. Table 13 shows effect of different contexts
on FiNER’s tagging for 320 names of municipalities. In the leftmost column
are results, where only a name list was given to FiNER. In the three
remaining columns, names of the municipality was changed from the beginning
of a clause to middle and end. Results imply that there is context
sensitivity in FiNER’s tagging. With no context at all results are worst,
and when the location is at the beginning of the sentence, FiNER also misses
more tags than in the other two positions. Overall it tags about two thirds
of the municipality names as locations (
Same setting was tested further with 15,480 last names in three different clause positions. Positional effect with last name tagging was almost nonexistent, but amount of both untagged names and locative interpretations is high. 39% of last names are tagged as PrsHum, 19.5% are tagged as LocXxx, and about 34.6% get no tag at all. The rest 7% are in varying categories. Tagging of last names would probably be better if first names were given together with last names. Isolated last names are more ambiguous.
We did not test effects of contextualization with other taggers, but it may have had a minor effect on all our results, as input text snippets were of different sizes (see section 2.3). Especially if first and last names are separated to different input snippets identification of person names may suffer.
Ehrmann et al.
To be able to estimate the effect of bad OCR on the results, we made some
additional trials with improved OCR material. We made tests with three
versions of a 500,000 word text material that is different from our NER
evaluation material but derives from the 19th century newspapers as well.
One version was manually corrected OCR, another an old OCRed version, and
third a new OCRed version. Besides character-level errors, word order errors
have been corrected in the two new versions. For these texts we did not have
a ground truth NE tagged version, and thus we could only count number of NE
tags in different texts. With FiNER total number of tags increased from
23,918 to 26,674 (+11.5% units) in the manually corrected text version.
Number of tags increased to 26,424 tags (+10.5% units) in the new OCRed text
version. Most notable were increases in the number of tags in the categories
Another clear indication of effect of the OCR quality on the NER results is
the following observation: when the words in all the correctly tagged FiNER
We also performed tagger specific error analysis for our tools. This analysis is not systematic because sources for the errors often remain unclear, but it shows some general tendencies of our tools. Besides general and tagger errors it also reveals some general characteristics of our data. The errors reported here can also be seen as common improvement goals for better NE tagging of our newspaper data.
Connexor’s NER tool seems to get misspelled person names many times right,
even though percentage of morphologically unrecognized words among the
locations is quite high in Fig. 1. Some of the rightly tagged but clearly
misspelled examples are
FiNER analyses some misspelled person names right, but previously mentioned
Closer examination of FiNER’s street name results shows that problems in
street name recognition are due to three main reasons: OCR errors in street
names, abbreviated street names, and multipart street names with numbers as
part of the name. In principle streets are easy to recognize in Finnish,
while they have most of the time common part -
Another similar case are first name initials which are used a lot in 19th
century Finnish newspaper texts. Names like
One common source of errors for all NE taggers originates from ambiguity of
some name types. Many Finnish surnames can be also names of locations,
either names of municipalities, villages or houses. These kind of names are
e.g.
We have shown in this paper evaluation results of NER for historical Finnish newspaper material from the 19th and early 20th century with two main tools, FiNER and SeCo’s ARPA. Besides these two tools, we briefly evaluated three other tools: a Finnish semantic tagger, Polyglot’s NER and Connexor’s NER. We were not able to train Stanford NER for Finnish. As far as we know, the tools we have evaluated constitute a comprehensive selection of tools that are capable of named entity recognition for Finnish although not all of them are dedicated NER taggers.
Word level correctness of the whole digitized newspaper archive is approximately
70–75%
NER experiments with OCRed data in other languages usually show some improvement
of NER when the quality of the OCRed data has been improved from very poor to
slightly better
As the word accuracy of our material is low, it would be expected that better recognition results would be achieved if the word accuracy was around 80–90% instead of 70–75%. Our tests with different quality texts suggest this too, as do the distinctly different unrecognition rates with correctly and incorrectly tagged words.
Better quality for our texts may be achievable in the near future. Promising
results in post-correction of the Finnish historical newspaper data have been
reported recently: two different correction algorithms developed in the
FIN-CLARIN consortium achieved correction rate of 20–35 %
Four of the five taggers employed in the experiments are rule-based systems
utilizing different kinds of morphological analysis, gazetteers, and pattern and
context rules. The only exception was Polyglot. However, while there has been
some recent work on rule-based systems for NER
On general level, there are a few lessons to be learned from our experiments for
those that are working with other small languages that do not have a
well-established repertoire of NER tools available. Some of them are well known,
but worth repeating, too. First and most self-evidently, bad OCR was found to be
the main obstacle for good quality NER once again. This was shown clearly in
section 2.10. The implications of this are clear: better quality data is needed
to make NER working well enough to be useful. Second, slightly surprisingly, we
noticed that differences in lexical coverage of the tools will not show that
much in the NER results. ARPA had clearly the best lexical coverage of the tools
and FST the worst coverage, but their NER performance with locations are quite
equal. This could imply that very large lexicons for a NE tagger are not
necessary and a good basic coverage lexicon is enough, but this could also be
language specific. Third, we showed that historical language can be processed to
a reasonable extent with tools that are made for modern language if nothing else
is available. However, if best possible results need to be achieved, more
historical data oriented tools need to be used. It is also possible that the
quite short time frame of our material enhances performance of our tools.
Fourth, results of The Finnish Semantic Tagger showed that NER does not need to
be only a task for dedicated NER tools. This has also been shown with modern
Finnish
Our main emphasis with NER will be to use the names with the newspaper collection as a means to improve structuring, browsing, and general informational usability of the collection. A good enough coverage of the names with NER also needs to be achieved for this use, of course. A reasonable balance of P/R should be found for this purpose, but also other capabilities of the software need to be considered. These lingering questions must be addressed if we are to connect some type of functional NER to our historical newspaper collection’s user interface.
First and third author were funded by the Academy of Finland, project Computational History and the Transformation of Public Discourse in Finland 1640–1910 (COMHIS), decision number 293 341.
Thanks to Heikki Kantola and Connexor Ltd. for providing the evaluation data of Connexor’s NE tagger.
Corresponding author: Kimmo Kettunen, ORCID: 0000-0003-2747-1382