Abstract
Digital humanities research that requires the digitization of medium-scale,
project-specific texts confronts a significant methodological and practical
question: is labour-intensive cleaning of the Optical Character Recognition
(OCR) output necessary to produce robust results through text mining analysis?
This paper traces the steps taken in a collaborative research project that aimed
to analyze newspaper coverage of a high-profile murder trial, which occurred in
New York City in 1873. A corpus of approximately one-half million words was
produced by converting original print sources and image files into digital
texts, which produced a substantial rate of OCR-generated errors. We then
corrected the scans and added document-level genre metadata. This allowed us to
evaluate the impact of our quality upgrade procedures when we tested for
possible differences in word usage across two key phases in the trial's coverage
using log likelihood ratio [Dunning 1993]. The same tests were run
on each dataset – the original OCR scans, a subset of OCR scans selected through
the addition of genre metadata, and the metadata-enhanced scans corrected to 98%
accuracy. Our results revealed that error correction is desirable but not
essential. However, metadata to distinguish between different genres of trial
coverage, obtained during the correction process, had a substantial impact. This
was true both when investigating all words and when testing for a subset of
“judgment words” we created to explore the murder’s emotive elements
and its moral implications. Deeper analysis of this case, and others like it,
will require more sophisticated text mining techniques to disambiguate word
sense and context, which may be more sensitive to OCR-induced errors.
Introduction
Digitized historical newspapers have enriched scholars’ capacity to pose research
questions about the past, but the quality of these texts is rarely ideal, due
especially to OCR errors. Although scholarly concern about this problem in
relation to the use of large, web-based historical resources is growing [
Hitchcock 2013], research that involves the digitization of
smaller corpora of original print documents confronts the same challenge. The
perennial question in digital humanities research – how accurate must digitized
sources be to produce robust results – arose in the course of our attempt to
interrogate the meanings of a highly publicized murder case prosecuted in New
York in 1873. This paper traces how we addressed this challenge: firstly by
searching for the techniques best suited to digitize and to interpret news
coverage of the trial; and secondly, by applying a statistical tool to test for
possible shifts in popular appraisals of the case.
Our project transformed over several years, from an individual, archive-based
inquiry into a text mining collaboration that required the conversion of
original news accounts into searchable texts through OCR. Our initial scans had
an error rate of 20%, which raised the prospect that text mining might not
substantially enhance our capacity to analyze word usage in the case’s coverage.
Consequently, we proceeded to reduce noise through manual correction of the
scans and through the addition of genre metadata, thereby creating three
datasets – the original OCR scans, a subset of the OCR scans selected through
the addition of genre metadata, and the metadata-enhanced scans corrected to 98%
accuracy. By analysing each dataset with the log likelihood ratio statistical
tool, we determined that the labour-intensive work of cleaning the data modestly
improved the reliability of our test – to establish whether or not popular
judgment of the case altered after controversial evidence was introduced in the
course of the murder trial. However, the addition of genre-related metadata
proved to be considerably more significant. Most importantly, the digitization
of the previously unsearchable primary sources did make it possible to pose a
research question that could not have been answered persuasively using
unsearchable texts.
As Tim Hitchcock argues, the development of digital history into a discipline
requires that we expose and evaluate the research processes that allow us to
compose “subtle maps of
meaning” from piles of primary sources [
Hitchcock 2013, 20]. Accordingly, we begin with an account of the murder case in
Section 1, and discuss how the digitization of its news coverage opened up new
ways to mine its meanings. In Section 2, we discuss the nature of our corpus and
the ways in which we produced machine-readable text using Adobe Photoshop
Lightroom 3 and ABBYY FineReader 11. In Section 3, we outline the nature of the
errors produced in the scanning processes, while Section 4 details the steps we
took to correct them and to add genre metadata post OCR. Section 5 explains how
we used the log likelihood ratio tool to analyze word frequency and to test the
use of judgment words in trial coverage, using our three different datasets. The
results of our tests appear in Section 6, which is followed by a discussion of
the possible future directions of text mining research based on small- to
medium-sized corpora. We conclude that text mining can enrich historians’
capacity to analyze large bodies of text, even in the presence of OCR-induced
errors. Supplementing digitized text with genre metadata permits a finer-grained
and more reliable analysis of historical newspapers. Investing the time to
produce clean data and metadata improves performance, and our study suggests
that this is essential for more sophisticated analysis, such as language
parsing. Finally, our project underlines the need for interdisciplinary teams to
ensure the integrity of the digital tools used, as well as the reliability of
their outputs’ interpretation.
From Historical Analysis to Digital Historical Research
The Walworth murder project began in 2003 as a humanistic enterprise conducted by
an historian of gender and criminal justice who read over four hundred newspaper
accounts of the case on microfilm, which were reproduced by printing out hard
copies of images. This body of unsearchable records was augmented as the number
of digitized newspapers available through open-source and proprietary online
databases grew exponentially over the 2000s, although many of those texts were
unsearchable image files.
[1] A research grant made
it possible in 2012 to digitize the entire body of primary sources (the
paper-based prints and PDF images of newspapers) through OCR scans.
[2]. This funding meant that hypotheses developed in the course of the
historian’s earlier close reading of the case could be tested in a collaboration
that included two hired computer scientists and a digital humanities
scholar.
The Walworth case’s extensive and sensational newspaper coverage indicated that
the murder of Mansfield Walworth, a second-rate novelist and
third-rate family man, stirred deep feelings, particularly because the killer
was his son, Frank. The murder provoked troubling questions: Was it legally or
morally excusable for a son to kill his father, no matter how despicable? And
why, in a family filled with lawyers and judges (including the murdered man’s
father, Judge Reuben Hyde Walworth) had the law not provided a
remedy [
O'Brien 2010]? The event occurred on 3 June 1873, when
Frank Walworth, a youth of nineteen, travelled to Manhattan to
confront his father, who was recently divorced from his mother. Mansfield
Walworth had sent a raft of letters to his ex-wife, full of murderous
threats and mad ravings. After intercepting these alarming letters, Frank
Walworth shot his father dead, then informed the police that he had
done so to save himself, his mother and his siblings. Was the shooter an
honourable son? A maudlin youth? Insane? Speculation swirled but the initial
response to the murder was one of shock: a refined young man from a highly
respectable white family had committed cold-blooded murder [
O'Brien 2010]. Scores of headlines announced that this was no
ordinary murder but a “PARRICIDAL
TRAGEDY.”
Our working hypothesis was that popular readings of the Walworth murder changed
as the trial progressed – from initial horror over the crime of parricide, to an
appreciation of domestic cruelty and the menacing nature of the victim. The
trial resembled a real-life domestic melodrama [
Powell 2004], and
its turning point occurred when the victim’s vile letters to his ex-wife were
read into evidence, as the defence attempted to verify the threat
Mansfield Walworth had presented to his son and ex-wife. At
this point in the trial, the dead man’s profane abuse was recorded by trial
reporters for the nation to read:
You have blasted my heart and think now as you always thought that you could
rob me of the sweet faces of my children and then gradually after a year or
two rob me of my little inheritance. You will see, you God damned bitch of
hell, I have always intended to murder you as a breaker of my heart. God
damn you, you will die and my poor broken heart will lie dead across your
God damned body. Hiss, hiss, I'm after you… I will kill you on
sight.
What was the impact of this obscene evidence on the public’s judgment of the
case? Did it change over the course of the trial, and if so, how? These were
research questions best addressed through the mining of the newspaper coverage,
a substantial corpus beyond the capacity of human assessment.
Although the volume of the Walworth case’s news coverage was modest compared to
large-scale, institutionally funded text mining projects, the standards set in
several benchmark historical newspaper text-mining research projects informed
our approach. Many, such as Mining the
Dispatch,
[3] use manually
double-keyed documentation, followed by comparison-based correction, to produce
datasets with 98+% accuracy. Mining the
Dispatch
used a large corpus of nineteenth-century U.S. newspapers to “explore – and encourage exploration of – the
dramatic and often traumatic changes as well as the sometimes surprising
continuities in the social and political life of Civil War Richmond.”
That project combined distant and close reading of every issue of one newspaper
(112,000 texts totalling almost 24 million words) “to uncover categories and discover
patterns in and among texts.”
[
Nelson 2010] As its director explained, high accuracy levels were necessary to combine
text mining with historical interpretation most productively: “the challenge is to toggle between
distant and close readings; not to rely solely on topic modelling and
visualizations.”
[
Nelson 2010] Studies that pursue similar objectives must first determine the best
methods to produce machine-readable text. The next section details the scanning
process preliminary to our analysis.
The Production of Machine-searchable Texts
Using a variety of sources, including photocopied prints of microfilmed
newspapers and PDF image files of stories sourced from several databases, we
gleaned 600 pages, comprising approximately 500,000 words of digitized text.
Figure 1 shows an example article ready for
scanning.
Since optimal machine-readable text was integral to our analysis of the murder
trial’s meanings, we reviewed the quality control methods used by ten large
public and private institutions, from scanning hardware and software through to
a diverse array of OCR software.
[4] We intended to use
non-proprietary software,
[5] but we ultimately selected ABBYY FineReader 11, since it allows for the
customisation of features.
[6] The first source of text
was photocopied pages printed in 2003 from microfilmed newspaper images. Since
online repositories of newspapers have subsequently become more numerous, we
were able to replace the poor quality text in microfilm scans with PDF image
files of the photocopies. However, this strategy proved to be too
time-consuming, since each of these page-scan PDFs contained upwards of 60
paragraphs of text (an average of 5,000 words in small print), spread across
upwards of 8 columns, which required laborious searches for references to the
Walworth murder trial. Consequently, this replacement strategy was used only for
the worst 10% of the photocopies.
All the scanned files viable to use from the existing scanned microfilm were
imported into Adobe Photoshop Lightroom 3. This process involved batch scanning
pages in black and white into 300DPI TIFFs according to newspaper, and then
placing them in physical folders by newspaper, in the same order in which they
had been scanned, to facilitate cross-referencing of particular pages with their
corresponding files. Within Lightroom each file was then manually cropped
one-by-one to select only those columns related to the Walworth murder.
[7] Although Lightroom is designed for working with large catalogues of
photographs rather than “photographs of text”, it allowed us to batch
process select images iteratively through non-destructive image editing, so that
tests could be made to determine which combination of image processing was
likely to achieve the highest OCR accuracy.
Training software to recognize patterns of font and content is one of the most
challenging aspects of OCR, as it draws on Artificial Intelligence to
“recognize” multitudes of shapes as belonging to
corresponding letters. Our project revealed that ABBYY FineReader’s training
capacity is limited. After we exported files produced through Lightroom,
newspaper by newspaper,
[8] we trained ABBYY to recognize each newspaper’s
fonts, and each file was further “cleaned up” by
straightening text lines and by correcting for perspective distortion. Although
ABBYY appeared to “recognize” frequently occurring words,
like the surname Walworth, the OCRd results produced variations, such as
“Wolwarth” and “Warworth”. It became evident that ABBYY cannot
recognize that all such variations of “Walworth” should have been converted
automatically to “Walworth”, given the high statistical likelihood that
they were in fact “Walworth”.
Our process exposed the sorts of image degradation common in the digitization of
historical newspapers, including: smudged, faded and warped text; ripped or
crumpled originals; image bleed from the reverse side of the paper; crooked and
curved text lines; and overexposed and underexposed microfilm scans. As a
result, ABBYY’s deficiencies required that customized automated corrections be
applied in the post-OCR phase of our project.
Nevertheless, the production of machine-readable text resulted in a uniform
dataset of newspaper articles that offers considerable granularity, including
the capacity to analyze a corpus of articles on the Walworth murder trial
according to date, a critical factor in our study, considering the admission of
Mansfield Walworth’s extraordinary letters into evidence. The
coverage of the corpus, disaggregated by newspaper, is shown in Figure 2.
OCR Errors and Quality Control for Text Mining
The variable quality of digitized historical newspapers has long been a challenge
for digital scholarship [
Arlitsch2004] Much of that variability
is associated with the historical and contemporary resources of publishing
houses, meaning that major metropolitan papers typically sit at one end of the
legibility spectrum and smaller, regional papers sit at the other. In our study,
articles from the
New York Times yielded accuracy
levels of 94.5%, and they are obtainable through the paper’s own search engine.
Furthermore, articles in the
Times repository are
cropped and cleaned, which means they are ready for OCRing with minimal image
manipulation. In contrast, OCR scans from an important upstate New York
newspaper based in the state capital, the
Albany
Argus, yielded results of only 65% accuracy. Unfortunately, due to
the substantial additional labour required to raise this and other smaller
papers’ level of accuracy, these scans were mostly too poor to incorporate. Thus
the variation in file quality from different newspapers presented a limitation
on the project’s initial ambitions. More broadly, this problem flags the
significant impact that OCR quality can make in the range of sources used for
text mining. OCR errors are part of a wider problem of dealing with “noise”
in text mining [
Knoblock 2007], which may also stem from other
sources such as historical spelling variations or language specific to different
media texts.
The impact of OCR errors varies depending on the task performed, however [
Eder 2013]. The tasks of sentence boundary detection,
tokenization, and part-of-speech tagging on text are all compromised by OCR
errors [
Lopresti 2008]. As Lopresti concludes: “While most such errors are localised,
in the worst case some have an amplifying effect that extends well beyond
the site of the original error, thereby degrading the performance of the
end-to-end system.” Another study performed document clustering and
topic modelling on text obtained from OCR [
Walker et al. 2010]. These
authors found that for the clustering task the errors had little impact on
performance, although the errors had a greater impact on performance for the
topic modelling task. A study involving the task of stylistic text
classification found that OCR errors had little impact on performance [
Stein et al. 2006]. In contrast, Eder advises that “tidily prepared corpora are integral
to tests of authorship attribution”
[
Eder 2013, 10]. Thus, the relevance of scanning errors remains a matter of debate.
Some studies of the effect of OCR errors [
Lopresti 2008]
[
Walker et al. 2010]
[
Stein et al. 2006] have conducted comparisons by analysing two corpora,
identical except for corrections of individual words. Our study was distinct in
two respects. First, it analyzed the effect of OCR corrections on corpora at the
word level, and it removed duplicate, irrelevant and very poorly scanned text.
We then added genre metadata and verified newspaper and date metadata. From the
point of view of a scientific experiment about the effect of OCR errors, these
extra steps may be considered “confounding variables”. In contrast,
projects such as ours make these corpus preparation steps necessary, since
questions of content as well as subtleties of word use are both critical.
Second, rather than conducting “canonical” tests, such as document
classification tasks through supervised machine learning, we selected key word
analysis with log likelihood ratio significance testing. These decisions
situated our text analysis in a real-world digital humanities workflow.
The accuracy of character recognition at the word level is especially significant
in projects that involve the interpretation of sentiment [
Wiebe 2005]. Words that appear rarely, as opposed to ones that
appear most frequently, tend to convey deep meaning, particularly words
associated with intense emotions, such as anger or disgust [
Strapparava and Mihalcea 2008]. Because we attempted to determine the Walworth
case’s meanings for contemporaries, including their moral judgments of the
principals, we considered our initial scanning error rate of 20% to be
unacceptable. This assessment led us to invest the time required to clean the
text manually after the OCR process by correcting errors at the character level
as well as removing duplicate and irrelevant text. Additionally, because we
expected that opinion pieces such as editorials and letters to the editor would
provide the clearest indication of public perception of the Walworth case, we
added genre metadata to the corpus as a supplement to the cleaning process. We
then conducted the log likelihood ratio comparison of word frequency across two
phases of the case’s reportage, both to analyze the impact of the cleaning
process and the addition of genre metadata, and to test our historical
hypothesis through text mining.
Correcting OCR-induced Errors and Adding Genre Metadata
This section discusses the strategies we undertook to reach a level of accuracy
comparable to that achieved in benchmark historical newspaper text mining
projects. It also explains how and why we added genre-based metadata before we
performed analysis using log likelihood ratio [
Dunning 1993].
Measuring the accuracy of OCR scans can be conducted at both the character and
word level, which is determined by dividing the number of units that are correct
by the total number of units [
Rice et al. 1993]. Calculating such
accuracy involves hand-labelling all characters and words with their correct
values and is very time consuming, however. To avoid this evaluation step, a
word accuracy approximation can be measured as a proportion of words appearing
in a standard dictionary.
[9] This approach does not
consider two opposing factors: those words which are correct but not in the
dictionary, and those that are incorrect but in the dictionary. Despite this
limitation, a reliable indication of the digitized text’s accuracy is possible.
Because the coverage of the Walworth case included proper names and archaic
terminology, it was unrealistic to anticipate 100 per cent accuracy. Words that
were split or joined through OCR errors were another confounding factor in this
estimation of accuracy. For example, in one instance the word “prosecution”
was split into two words (“prosec” and “ution”) by the OCR scan, while
in another case the words “was severely” were merged into one garbled word,
“was”
“Aevertyy”. To assess this effect, we calculated that the average word
length for the uncorrected (5.84 letters) and corrected (5.68 letters) texts had
approximately a 3% error rate, which we deemed small enough to ignore for the
purposes of our study.
Table 1 shows the approximate word accuracy
calculated according to this method, both before correction and after the
corrections, which we describe in the remainder of this section. The
pre-correction accuracy was comparable to the 78% achieved for the British
Library’s 19
th Century Online Newspaper Archive [
Tanner et al. 2009]. The post-correction accuracy is near the target of
98% used by the National Library of Australia Newspaper Digitization
Program
[
Holley 2009].
|
Words |
Words in dictionary |
Words not in Dictionary |
Approximate Word Accuracy |
Original |
478762 |
391384 |
87378 |
81.7% |
Clean |
345181 |
336779 |
8402 |
97.6% |
Table 1.
Effect of post-OCR correction on accuracy.
A process of manual correction was undertaken to remove the errors generated
through OCR, because we sought a clearer signal in the analysis of the texts.
Working with “noise”, whether induced by OCR or from other sources such as
spelling variations or language variants used on social media, is common in the
fields of text mining and corpus linguistics [
Knoblock 2007]
[10]
However, historical interpretation relies on data sufficiently clean to boost
the credibility of the analysis at this scale. Automated techniques were used in
a limited way, but to achieve results at the high standard desired, we
determined that manual correction was essential.
Manual correction offered the benefit of removing duplicate and irrelevant
sections of text; in addition, it allowed us to add document-level metadata
tags, which is a critical step in complex text analysis. For practical reasons a
single corrector was used, but to achieve even greater accuracy, multiple
correctors could be used and their results compared.
The post-OCR correction process entailed five steps:
- Simple automatic corrections were made. These included: the removal of
hyphens at line breaks, which are mostly a product of words appearing across
lines; correcting some simple errors (such as “thb” → “the”); and
the correction of principal names in the text, such as “Walworth” or
“Mansfield”. Full stops not marking the end of sentences were also
removed to permit the documents to be broken into semantically meaningful
chunks using the full stop delimiter.
- Articles with an approximate word accuracy below a threshold, set to 80%,
were in general discarded to speed the correction process. However, those
falling below the threshold, but hand-selected for their rich content, were
retained.
- The text was corrected by hand, comparing the original image file and the
post-OCR text version of the same articles.
- Duplicate and irrelevant text was removed.
- Metadata tags for article genre were added, broken into four categories:
“editorial”; “incidental reportage”; “trial proceedings”;
and “letter to the editor”. Previously added tags for the name of the
newspaper and date were also verified and corrected where required.
Automated correction using search and replace with regular expressions was
necessarily limited to avoid introducing new errors, since we considered a
garbled word preferable to a “correction” leading to a wrong
word. We anticipated that the clear patterns in the observed errors would lend
themselves to more sophisticated correction processes using supervised machine
learning techniques. However this application proved beyond the scope of this
project.
Given the modest size of the corpus and the research funding available it was
feasible to hand-tag genre, delivering accuracy benefits over automated
approaches. Although we considered automatic inclusion of metadata (for example,
within the TEI standard) as well as automatic part-of-speech tagging (valuable
for tasks such as document classification), we determined that plain text plus
article-level genre/date/newspaper metadata was sufficient for keyword analysis
in our project.
Manual correction is inescapably a time-consuming process, although it does offer
collaborative benefits, since it involves all team members in the close
examination of texts. Some projects opt to offshore OCR correction, but ethical
considerations concerning the exploitation of foreign labour as well as quality
control concerns ruled out this option in our study. The efficiency of inputting
corrections was improved by using spelling and grammar error highlighting in
Microsoft Word. This phase took approximately one hundred hours, at an average
rate of 57.5 words per minute, which is comparable to that of an efficient
typist. Although this procedure was efficient for moderately corrupted text, and
easier to sustain over long work sessions, highly inaccurate scans rendered
typing from scratch necessary, as it was quicker than correcting the garbled OCR
output. When added to the lengthy OCR scanning process, the labour required to
correct scanned text does raise the question of whether OCR is the most
efficient way to digitize a medium-size corpus of historical newspapers to a
high degree of accuracy.
As well as typing from original scans, we transcribed texts using a voice
recognition program (Dragon Naturally Speaking 12), another option
for the correction and input process. Typing was predominantly used, since it
tends to be quicker than transcriptions of dictation for corrections. For
inputting longer sections from scratch, dictation was slightly faster and more
convenient to use. However, it tends to fail “silently”, in
that it substitutes unrecognized words with other words, which a spell-checker
cannot detect. Typographical errors, on the other hand, are more likely to form
non-words that spell checkers can identify. Dictation is also more likely to
fail on names and uncommon words and proper nouns, precisely those words which
the study is most interested in identifying.
In summary, while OCR achieves relatively accurate results (around 80%) on
historical newspaper collections such as the one used in this study, manual
correction is required to achieve high accuracy (around 98%). Depending on the
corpus size and the resources at hand, this two-step process may be no more
efficient than directly inputting the original texts from scratch.
Measuring the Effect of Post-OCR Correction Using a Sample Task
Determining the tenor of the Walworth case’s newspaper coverage and testing for
possible shifts over the course of trial was the object of our text mining
analysis, but the methods we selected to do so are relevant to wider debates
over the utility of OCR and post-OCR correction processes. In order to evaluate
changes in the popular assessment of the case we created two subsets of the
digitized corpus: Phase I (news accounts before the introduction of
Mansfield Walworth’s shocking letters), and Phase II (trial
coverage subsequent to the letters' introduction, including Frank
Walworth's conviction and sentence of life in prison).
[11] We investigated which words varied at
statistically significant rates from Phase I to Phase II, particularly those
indicative of the sentiments stirred by the crime and the characters involved.
To undertake this analysis we used a list of “judgment
words”. Through a close reading of the texts and knowledge of common
words used in criminal trial reportage in this period the historian produced a
preliminary list of words of moral judgment and character assessment, which we
supplemented through the addition of similar words selected with the aid of
topic modelling of the corpus. Finally, we further augmented our list by adding
other forms of the selected words that appeared in the 2011 edition of the
American English Spell Checker Oriented Word Lists.
[12] We chose the
statistical tool log likelihood ratio, since it is designed to measure variation
in the word frequency between two sections of a corpus [
Dunning 1993]. Most importantly, log likelihood ratio discerns
statistically significant word frequency variations which are highly likely to
appear as a result of true properties of the corpora, rather than by chance. By
calculating log likelihood ratio across the two phases of newspaper reportage,
we tested for changes in the popular judgment of the Walworth case; this test
also allowed us to analyze the effectiveness of post-OCR cleaning by comparing
the results of the task performed on the text before and after correction.
[13]
Log likelihood ratio, which identifies meaningful variation in word frequency in
one corpus relative to another [
Dunning 1993], produces a p-value
on the corresponding test statistic, which can be interpreted as the probability
of the observed word frequencies, given the null hypothesis that there is no
difference between the two corpora. For words with a p-value below some
significance level (for example p≤0.05) the null hypothesis may be rejected; in
other words, the difference in word frequency between the two corpora may be
considered statistically significant when the variation is highly unlikely to be
a result of chance. However, it is worth noting that while this holds for any
given word, if we use a given significance level to select a set of words, it
may still be likely that the result for at least one of the selected words may
appear to be significant by chance alone. Multiple hypothesis testing provides a
rubric for managing this phenomenon, for example by reducing the p-value used
for individual words. We chose not to pursue this approach; instead we
qualitatively analyzed words, identified by log likelihood ratio, which helped
to detect the minority of words incorrectly identified as significant. Because
we worked as an interdisciplinary team, the historian contributed to this
critical examination of the output of a statistical technique.
Dunning introduced the log likelihood ratio as a tool for word frequency analysis
that would be more robust than the previously prevalent chi-squared test for
small samples of text. It has been used in previous studies comparing corpora,
for example looking at the proceedings of a 19th century British murder trial
[
Archer forthcoming]; the distinctive lexicon used in a
professional environment [
Rayson and Garside 2000]
[14]; historical spelling variations [
Baron 2009]; and the lines of a particular character in a play
[
McIntyre 2010]. We did consider other tests, which have been
proposed as alternatives to the log likelihood ratio. For instance, Fisher’s
exact test [
Moore 2004] calculates exactly what the log likelihood
ratio approximates but requires greater computational resources, while the
Mann-Whitney Ranks test [
Kilgarriff 2001] considers the
distribution of word frequency within a corpus, as does the t-test [
Paquot and Bestgen 2009]. After reviewing these options we decided that log
likelihood ratio was the best option, due to its well-established use in the
comparison of corpora.
As described in Section 4, the correction process was enhanced through the
inclusion of metadata about article genre. The two phases differed in genre mix,
since the reportage from Phase II was dominated by coverage of the Walworth case
trial proceedings. While we were interested in detecting changing public
opinion, differences in genre could possibly have obscured the shift we
anticipated. Where genre metadata was available, we restricted our analysis to
opinion articles about the case, consisting of editorials and letters to the
editor, since this genre is most likely to capture words of interest.
Furthermore, technical judgment words appearing in trial proceedings –
particularly those used by lawyers and the judge in court – indicate legal
constructions that may not have reflected public opinion. This caveat was
another reason for the restriction of the corpus to opinion articles using genre
metadata. As
Figure 3 shows, we compared the
original text without metadata; the original text restricted using genre
metadata; and the cleaned text also restricted using genre metadata.
Pre-processing of the texts was performed to improve the quality of results
returned. The following four steps were taken:
- All text converted to lower case.
- Punctuation removed and the possessive form “’s”.[15]
- Stopwords removed, such as “the” and “of”, from a standard
stopwords list[16]
- Words identified from the custom-built list of 357 judgment words,
consisting primarily of adjectives, adverbs and abstract nouns.
These steps, as well as the log likelihood ratio calculations, were performed
using several open source tools.
[17] The results of the
experiments are detailed in the following section.
Results
The tests we conducted involved comparing newspaper coverage of the Walworth case
over our two periods: Phase I (the crime, the arrest, the coroner’s inquest and
the trial’s opening); and Phase II (subsequent to the letters’ introduction up
to the verdict and sentencing). By using data with and without post-OCR
correction we were able to address our historical question and to evaluate the
effect of this correction.
Figure 4 shows the ten Phase I words that appeared
at most significant frequency, measured by log likelihood ratio, for the three
datasets presented in
Figure 3. The defendant is
the focus of early reportage, with terms such as “young”, and “son”
appearing, as well as the negative word “murderer” (considering that his
conviction had not yet occurred). The opinion article genre focuses on the
defendant’s family background, including the word “chancellor” (Judge
Reuben Hyde Walworth, Mansfield’s father), “albany” (the state
capital, where the defendant’s uncle lived) and “literary”, the last of
which referred to Mansfield Walworth’s career as a gothic novelist,
rather than his negative character traits. Even without OCR correction, most of
these words of interest were identified. Despite the presence of non-words
caused by OCR error, they do not appear in these top few words. Without
metadata, the words tend to focus on the minutiae of the murder scene, such as
“stairs”, “body” and “door”, rather than on more substantive
issues of character. This points to the need for historical researchers to
consider adding genre metadata prior to calculating log likelihood ratios.
The same approach for Phase II yielded primarily legal terms, which disclosed
little about changing opinion. Therefore, we show in Figure 5 those words from
our judgment word list that occurred more frequently in this second trial period
at a statistically significant rate (using a significance level of p≤0.05). The
words “insanity” and “insane” may refer to both Frank and Mansfield,
since the defence suggested that the son may have suffered from a form of
madness inherited from his disturbed father. The terms “threats” and
“madman” reflect a new focus on the condemnation of Mansfield, although
there is some possibility “madman” could also refer to Frank. The words
“deliberation” and “deliberate” may negatively describe Frank’s
actions, but they may equally be procedural legal terms relating to the jury.
The differences between the datasets are less pronounced in this experiment,
aside from the fact that the original dataset contained more statistically
significant words, since it includes substantially more words overall (see
Table 1for details). Some of these words suggest a
condemnation of Mansfield (“demon”) and potential approval of Frank
(“honor”), since he claimed he had killed his father to protect his
mother. This pattern suggests that using judgment words may be an alternative to
adding genre metadata, since these words implicitly refer to genre.
Fortunately, ambiguities such as whether “insanity” refers to
Frank or Mansfield, or whether “deliberate” refers to Frank’s shooting, the
judge or the jury, may be resolved using more sophisticated techniques. For
example, words may be matched to characters in the case through sentence blocks,
word proximity, or full-scale parsing for semantic structure. Such techniques
typically require very high quality text to be effective. For words with
relatively low frequencies, manually investigating the contexts of occurrences
can also be used.
Significant Words |
Precision |
Recall |
Significant Judgment Words |
Judgment Words Precision |
Judgment Words Recall |
Original with Metadata |
751 |
0.48 |
0.66 |
23 |
0.65 |
0.93 |
Original |
2852 |
0.12 |
0.61 |
38 |
0.34 |
0.81 |
Table 2.
Performance of Original with Metadata and Original Datasets compared
to Clean with Metadata dataset. The Clean with Metadata dataset returned
545 significant words of which 16 were judgment words. A significance
level of p≤0.05 was used.
Table 2 shows the precision and recall of the
results of words with a significance level of p≤0.05 for the original with genre
metadata and original datasets compared to the “gold standard”, that is,
the clean with metadata dataset. Recall refers to the proportion of words
significant in the clean with metadata dataset, which are also significant in
the original (with or without metadata) dataset. Precision is the proportion of
words significant in the original (with or without metadata) dataset, which are
also significant in the clean with metadata dataset. This evaluation methodology
allowed us to drill down deeper than we could by using the small shortlist of
words shown in
Figure 4 and
Figure 5, and it revealed a strong discrepancy
between the datasets.
The precision scores of 0.48 and 0.12 indicate that many words were incorrectly
identified as significant, while the recall scores of 0.66 and 0.61 suggest that
a substantial portion of significant words was missed. The original dataset
approached the original with metadata dataset on recall, but it had much lower
precision, indicating that it returned many results with limited usefulness.
Some non-words induced by OCR error appear at a statistically significant level
in the original and original with metadata lists, such as “ol” (instead of
“of”) and “ihe” (instead of “the”). Using the judgment word
list, the precision scores of 0.65 and 0.34 suggest, again, that many “false
positive” words were identified, with the problem magnified without
adding in genre metadata. The recall results of 0.93 and 0.81 were stronger for
the judgment word list, however, which is of interest given the emotive nature
of the case and its coverage. Overall, there was a substantial discrepancy in
the words identified in the original datasets, with and without metadata,
compared to the clean with metadata dataset. This confirms that OCR errors can,
indeed, influence later analysis of this nature.
It is worth examining in detail one example in which an OCR error produced a
judgment word found to be significant using the original with metadata dataset,
but not by using the clean with metadata dataset. In the original with metadata
dataset, the word “maudlin” was identified as occurring significantly more
in Phase I, with a frequency of 3 compared to 0 in Phase II. However, there was
one instance of “maudlin” occurring in Phase II in the clean with metadata
dataset which was missed due to an OCR error – swapping “maudlin” for
“inaudlin”. In the clean with metadata dataset the frequency counts of
3 for Phase I versus 1 for Phase II were not significantly different. While
these frequencies may seem low, relative scarcity does not indicate low
significance. In fact, our test indicated the opposite to be the case.
The contexts of “maudlin” appear in
Table 3,
which indicates that the uses of the term in Phase I occurred in the context of
disapproval of Frank’s parricidal motive and cool demeanor. The use of the term
in Phase II was different, we discovered, because it referred to the state of
mind of another murderer in an earlier trial, in which a plea of insanity had
been successful. This shows that further work is required to identify the
implications of word use based on their contexts. Indeed, it seems that there
was a suggestive change in the frequency of “maudlin” between Phase I and
Phase II.
Newspaper |
Date |
Phase |
Correct in Original? |
Context |
NY Tribune
|
1873-06-04 |
I |
Yes |
“We protest in advance against such
resort to maudlin sentimentality”
|
NY Tribune
|
1873-06-05 |
I |
Yes |
“There’s a [sic] something indefinable
about this maudlin sentimentalism that throws a glow of heroism
round the murder”
|
The Saratogian (quoting NY Tribune) |
1873-06-12 |
I |
Yes |
“We protest in advance against such
resort to maudlin sentimentality”
|
NY Tribune
|
1873-07-04 |
II |
No |
“the maudlin sorrow of a
drunkard” (referring to another case where insanity was
successfully pled, an outcome the author critiques) |
Table 3.
Contexts of the judgment word “maudlin”.
Overall, the cleaning of the data was not essential to achieving results of
interest on the two-phase comparison task, since many significant words could
still be identified. Still, there were substantial differences between the
results of the clean and original datasets, as significant words were missed and
“false positives” were generated. The adding of genre metadata
permitted the filtering of more significant words through the use of opinion
articles, something that was not possible with the original dataset. Our list of
judgment words likely performed a similar filtering function to the genre
metadata, though it lacks the flexibility to detect unexpected words.
Future Work
Moving beyond the analysis presented in this paper, it would be desirable to
identify which words refer to which characters in the case. With the clean
corpus we now have at our disposal, this identification could be achieved
through the automated tagging of syntactic metadata. This analysis would allow
us to track public opinion at the level of the individual with greater
precision. Words which may refer to multiple individuals may be disambiguated
using techniques such as sentence blocks, word proximity and semantic parsing.
It is expected that this more complex task would show greater differences
between the raw OCR output and the corrected text, since it depends on the
presence of grammatically well-formed sentences rather than word counts alone.
An alternative approach that may be useful in similar projects would involve
identifying which terms are distinctive “hallmarks” of particular
subcorpora (for instance, selected on the basis of date or news source). A
feature selection metric such as mutual information could be used to identify
which terms are most predictive in classifying documents as belonging to
particular subcorpora. Turning from supervised to unsupervised learning, our
team anticipates producing results based on topic modelling, a common strategy
in the digital humanities which has been applied to US historical newspapers
[
Newman and Block 2006], [
Yang et al. 2011], and [
Nelson 2012]. The goal of such projects is to find topics in large
volumes of newspaper reportage, and to track changes as indices of shifts in
public discourse. The effect of OCR errors on such topic models is also an
active subject of research, and our results suggest that scholars consider this
issue thoroughly before undertaking large-scale projects [
Walker et al. 2010].
The broader ambition of this research project is to situate the Walworth case in
its wider historical context. Can it be shown through text mining that
prevailing understandings of masculine honour, morality and family values were
challenged by this dramatic incident? In future work we will compare our
digitized collection with larger newspaper corpora. Google n-grams
[18] is a common and accessible choice for researchers, but its contents are
different from our corpus in both format (the full text of its sources is
unavailable) and in genre (it covers non-fiction, arcane technical writings and
literary works). A more promising collection is Gale’s Nineteenth Century
Newspaper collection.
[19] If it becomes fully searchable it will provide a vast dataset from which
subsets of texts (such as editorials on domestic homicide) can be selected to
evaluate the distinct and shared features of the Walworth case’s coverage.
Nevertheless, there are reasons for caution. In large corpora such as these, OCR
induced errors will remain an issue, since hand correction of texts on a vast
scale is infeasible.
Conclusion
Digital humanities scholars have been drawn to text mining as a technique well
suited to the analysis of historical newspapers, since it allows for meaning to
be drawn from volumes of text that would be unmanageable for an individual
researcher to absorb and analyze. It provides a tool that can test hypotheses
generated through traditional historical analysis, and ideally, generate new
possibilities for study that could not have been generated through close reading
alone. However, the digitization of historical texts is a complex and
time-consuming process which is worthy of consideration in itself. Through the
example of the Walworth murder case’s newspaper coverage, this paper has
outlined the two-step digitization process our team undertook: first, performing
OCR scans from original newspapers and image files; and second, cleaning and
post-processing to ensure that all text included is accurate, relevant, and
labelled with genre metadata. We have provided an original, detailed methodology
for conducting digitization of a medium-size corpus.
OCR, we determined, is effective in digitising historical newspapers to roughly
80% accuracy. However, to achieve high levels of accuracy (around 98%), the
labour-intensive cleaning required to remove OCR errors means the two-step
process may be no more efficient than manually inputting texts from scratch, a
procedure that suits small- to medium-scale projects. While our research
involved both scanning and cleaning texts, historical researchers more commonly
perform keyword searches on existing databases of historical documents to
conduct text mining analysis [
Hitchcock 2013]. Importantly, our
study shows that the result set may contain OCR errors, irrelevant and duplicate
content; similarly, insufficient metadata can generate spurious results that are
difficult to detect. Our method proposed for the cleaning process, as well as
our appraisal of the value of this step, signals the way forward to overcome
this problem.
The value of correcting OCR output from around 80% accuracy to near 100% is an
important consideration for researchers, in view of the labour-intensive process
required. We demonstrated this empirically by performing a sample task of
interest on both the clean and original versions of the corpora. This task
involved finding words, including those from a list of pertinent judgment words,
which changed in frequency across two phases of the case’s reportage. Log
likelihood ratio was used as a test for statistical significance. With the
uncorrected OCR output it was possible to identify words appearing significantly
more frequently in one time period relative to another, but a substantial
proportion were missed and “false positives” were introduced. The cleaning
was thus desirable but not essential. The addition of genre metadata led to
results of greater interest, since it allowed a focus on articles more clearly
relevant to the research question. This paper is unique in situating OCR error
correction in a digitization workflow also involving content selection,
document-level metadata enhancement and practical time and cost constraints, as
it evaluates this text cleaning phase holistically.
Like many digital humanities projects, this study underlines the value of input
from researchers across the disciplines of history and computer science to
design the project, select the methodology, implement the tasks and interpret
the results [
Ayers 2013]; [
Nelson 2012]. Without
this combination of skills and expertise, as well as facilitative research
funding, such studies are unfeasible. Our team’s scientific expertise allowed us
to customize software for text mining analysis rather than using off-the-shelf
solutions, which gave us full control over the integrity of the tools used,
while the historian posed the research question and critically examined test
results against the initial close reading of the case. This collaborative,
interdisciplinary model will continue to be critical to foster robust research
in the field of digital humanities.
Notes
[2]
Research funding for this project was provided by the Australian
Research Council
[4] The 10 public and private institutions
are Improving Access to Text; Australian Newspapers Digitisation Program;
The Text Creation Partnership; British Newspapers 1800-1900; Early English
Books Online; American National Digital Newspaper Program; Project
Gutenberg; Universal Digital Library Million Book Collection; and Gale
Eighteenth Century Collections Online.
[5] Other commercial software that were evaluated,
but were not found to be as suitable as ABBYY include: ExperVision OCR,
Vividata, VelOCRaptor, Presto! OCR, OmniPage, Olive, and Prizmo. Other open
source software that was evaluated for its suitability was OCRopus,
hocr-tools, isri-ocr-evaluation-tools, Tesseract, and GOCR. For a
comprehensive list of OCR software see http://en.wikipedia.org/wiki/List_of_optical_character_recognition_software.
[6] We are also grateful for advice provided by the
Digitisation Facility at the National Centre of Biography, Australian
National University (http://ncb.anu.edu.au/scanner). [7]
Since Lightroom cannot import PDFs all files were sorted and processed in
the one application. Only the high quality PDFs of the New York Times were “clean” enough to be OCRed directly
from the downloaded PDF, so did not require processing through Lightroom.
[8] One unfortunate downside to this workflow is that
Lightroom cannot export greyscale images, so it only exported each 20MB TIFF
as a 110MB TIFF.
[11]
Frank Walworth was pardoned four years after his conviction,
but this twist to the story attracted little attention from the press [Strange 2010]. [12] For information on the
Spell Checker Oriented Word List see http://wordlist.sourceforge.net/. The list, or
“dictionary”, is a concatenation of word lists
compiled for use in spell checkers. We are grateful for Loretta
Auville’s advice on this aspect of our study. [13]
A first principles approach to this question is also possible, but due to
the mathematical complexity of incorporating OCR errors into calculations
finding significant words with log likelihood ratio, we used an empirical
approach in this paper.
[14] Apparently by accident, this study uses an incorrect variant of the log
likelihood ratio. In the second equation the authors present on page 3, the
sum should run over all four cells of the contingency table (rather than
just those in the top row), and the observed and expected values for each of
these should be calculated. With a large corpus size relative to word
frequency, the ratio of observed to expected values for the bottom row cells
will be approximately 1 and hence the contribution of these cells will be
negligible. However, with a small corpus size relative to word frequency
these cells make a substantial contribution and should not be ignored.
Several open-source tools including Meandre (http://seasr.org/meandre), which
we used in this study, repeat this error. We modified the source code in
order to use the log likelihood ratio as it originally appeared in [Dunning 1993]. This confirms the need for humanities scholars
to work with experts in computer science and digital humanities, to ensure a
deep understanding of statistical techniques, rather than rely on
off-the-shelf tools which may occasionally have inaccuracies in their
implementation. [15] A more
thorough stemming or lemmatisation approach was not performed but may be
useful in future.
[16] An in-house stopword list from NICTA (National
ICT Australia) was used.
Works Cited
Archer forthcoming .
Archer, Dawn. “Tracing the crime narratives within the
Palmer Trial (1856): From the lawyer’s opening speeches to the judge’s
summing up.”
Arlitsch2004 Arlitsch, Kenning, and John Herbert. “Microfilm,
paper, and OCR: issues in newspaper
digitization.””. Microform and Imaging Review 33: 2 (2004), pp. 58-67.
Ayers 2013 Ayers, Edward L. “Does Digital Scholarship have a
Future?”. EDUCASEreview 48: 4 (2013), pp. 24-34.
Baron 2009 Baron, Alistair, Paul Rayson and Dawn Archer. “Word
frequency and key word statistics in corpus
linguistics”. Anglistik 20: 1 (2009), pp. 41-67.
Dunning 1993 Dunning, Ted. “Accurate Methods for the Statistics of Surprise and Coincidence”. Computational Linguistics 19: 1 (1993), pp. 61-74.
Eder 2013 Eder, Maciej. “Mind your Corpus: Systematic Errors in
Authorship Attribution”. Literary and Linguistic Computing 10: 1093 (2013).
Hitchcock 2013 Hitchcock, Tim. “Confronting the Digital, or How
Academic History Writing Lost the Plot”. Cultural and Social History 10: 1 (2013).
Holley 2009 Holley, Rose. “How Good Can It Get? Analysing and
Improving OCR Accuracy in Large Scale Historic Newspaper
Digitization Programs”. D-Lib Magazine 15: 3-4 (2009).
Kagan 2009 Kagan, Jerome. The Three Cultures: Natural Sciences, Social Sciences, and the Humanities in the 21st Century. Cambridge: Cambridge University Press, 2009.
Kilgarriff 2001 Kilgarriff, A. “Comparing Corpora”. International Journal of Corpus Linguistics 6 (2001), pp. 97-133.
Knoblock 2007 Knoblock, Craig, Daniel Lopresti, Shourya Roy and Venkata Subramaniam, eds. “Special Issue on
Noisy Text Analytics”. International Journal on Document Analysis and
Recognition 10: 3-4 (2007).
Lopresti 2008 Lopresti, Daniel. “Optical Character Recognition
Errors and their Effects on Natural Language
Processing”. Presented at The Second Workshop on Analytics for Noisy Unstructured
Text Data, sponsored by ACM (2008).
McIntyre 2010 McIntyre, Dan, and Dawn Archer. “A corpus-based
approach to mind style”. Journal of Literary
Semantics 39: 2 (2010), pp. 167-182.
Moore 2004 Moore, Robert C. “On log-likelihood-ratios and the
significance of rare events”. Presented at The 2004 Conference on Empirical Methods
in Natural Language Processing (2004).
Newman and Block 2006 Newman, David J., and Sharon Block. “Probabilistic topic decomposition of an
eighteenth‐century American newspaper”. Journal of the American Society for
Information Science and Technology 57: 6 (2006), pp. 753-767.
O'Brien 2010 O'Brien, Geoffrey. The Fall of the House of
Walworth: A Tale of Madness and Murder in Gilded Age
America. New York: Henry Holt and Company, 2010.
Paquot and Bestgen 2009 Paquot, Magali, and Yves Bestgen. “Distinctive words in academic writing: A comparison of
three statistical tests for keyword
extraction”. Language and
Computers 68: 1 (2009), pp. 247-269.
Powell 2004 Powell, Kerry. The Cambridge Companion to
Victorian and Edwardian Theatre. Cambridge: Cambridge University Press, 2004.
Rayson and Garside 2000 Rayson, Paul, and Roger Garside. “Comparing corpora using frequency
profiling”. Presented at Workshop on Comparing Corpora, sponsored by Association for Computational Linguistics (2000).
Rice et al. 1993 Rice, Stephen V., Junichi Kanai and Thomas A. Nartker. An Evaluation of OCR
Accuracy. Information Science Research Institute, 1993.
Stein et al. 2006 Stein, Sterling Stuart, Shlomo Argamon and Ophir Frieder. “The effect of OCR errors on stylistic text
classification”. Presented at The 29th annual international ACM SIGIR
conference on Research and development in information
retrieval, sponsored by ACM (2006).
Strange 2010 Strange, Carolyn. “The Unwritten Law of Executive
Justice: Pardoning Patricide in Reconstruction-Era New
York”. Law and History
Review 28: 4 (2010), pp. 891-930.
Strapparava and Mihalcea 2008 Strapparava, Carlo, and Rada Mihalcea. “Learning to Identify Emotions in Texts”. Presented at The 2008 ACM
symposium on Applied computing, sponsored by ACM (2008).
Tanner et al. 2009 Tanner, Simon, Trevor Muñoz and Pich Hemy Ros. “Measuring Mass Text Digitization Quality and
Usefulness”. D-Lib
Magazine 15: 7-8 (2009).
Walker et al. 2010 Walker, Daniel D., William B. Lund and Eric K. Ringger. “Evaluating Models of Latent Document
Semantics in the Presence of OCR Errors”. Presented at The 2010 Conference on
Empirical Methods in Natural Language
Processing, sponsored by Association for Computational Linguistics (2010).
Wiebe 2005 Wiebe, Janyce, Theresa Wilson and Claire Cardie. “Annotating Expressions of Opinions and Emotions in
Language”. Language
Resources and Evaluation 39: 2-3 (2005), pp. 165-210.
Williams 2011 Williams, Jeffrey J. “The Statistical Turn in
Literary Studies”. The
Chronicle Review 57: 18 (2011), pp. B14-B15.
Yang et al. 2011 Yang, Tze-I, Andrew J. Torget and Rada Mihalcea. “Topic modelling on historical newspapers”. Presented at The 5th ACL-HLT Workshop on Language Technology for
Cultural Heritage, Social Sciences, and
Humanities, sponsored by Association for Computational Linguistics (2011).