Jerry Bonnell is a senior PhD student of Computer Science at the University of Miami. His area of research focuses on data mining, machine learning, natural language processing, and digital humanities. His dissertation investigates domain adaptation of large historical corpora with specific emphasis on exhibits of modern historical Japanese text written during the Meiji and Taisho periods (1895-1925). Bonnell has given multiple talks at the annual Japanese Association for Digital Humanities (JADH) conference on his work. The tools he has developed have been made available online to the DH community.
Mitsunori Ogihara is a Professor of Computer Science at the University of Miami. Ogihara received his Ph.D. in Information Sciences from the Tokyo Institute of Technology in 1993. From 1994 to 2007, he was a faculty member in the Department of Computer Science at the University of Rochester, where he served as Chair of Department from 1999 to 2007. At the University of Miami, Ogihara serves as the Director of Education and Workforce Development at the Institute of Data Science and Computing at the University. Ogihara is an author/co-author/co-editor of four books to date and has published more than 200 peer-reviewed articles covering a wide range of research areas including computational complexity theory, music information retrieval, data mining, bioinformatics, artificial intelligence, and digital humanities. He is on the editorial board for Theory of Computing Systems (Springer, Editor-in-Chief) and International Journal of Foundations of Computer Science (World Scientific).
This is the source
Historical materials are an indispensable resource for many scholarly workflows in
the Digital Humanities. These workflows can benefit from the application of natural
language processing (NLP) pipelines that offer support for tokenization, tagging,
lemmatization, and dependency parsing. However, the application of these tools is not
trivial as
Introducing a rule-based workflow which can improve the function of pre-trained NLP tools.
Scholarly workflows in the Digital Humanities can benefit from the application of
natural language processing (NLP) tools newswire
articles, microblog text, and general books), making any
direct application prone to error
Motivated by this issue, the present paper purposes to address the following questions with respect to modern historical Japanese corpora: (1) can accurate UD annotations be developed from scratch using pre-trained tools while also minimizing the amount of manual effort needed for correction, (2) can said generated annotations be used as model training data to achieve improved accuracy on a fundamental NLP task, e.g., word segmentation, (3) does the trained model adapted to historical materials have a substantial effect on the output parsings produced when compared to pre-trained tools, and (4) can the proposed workflow be carried out by non-experts in UD? The answers can be encouraging to DH scholars working with historical corpora who are not subject experts in UD, but would like to make more frequent use of linguistic metadata in their scholarship.
To attempt an answer, this paper introduces a rule-based workflow for modern historical Japanese corpora that produces more accurate UD annotations directly from the raw text in the corpus using the output of pre-trained tools as a starting point. It consists of:
In this way, any parser that learns their model from this data also learns how to deal with historical text.
core data
manually corrected by experts to predict sentence boundaries on unlabeled
data. Our approach also corrects UD annotations, but aims to do so automatically and
does not use any gold-standard
or already corrected data as a starting point.
Text normalization is a well-known preprocessing problem in NLP with many proposed
solutions. For Japanese text normalization,
We follow
To evaluate our methods, we also incorporate three other magazines made available
by NINJAL in its corpora of Modern Japanese: Josei (女性) (1894 - 1925), Meiroku
Zasshi (鳴鹿) (1874 - 1875), and Kokumin (國民) (1887 - 1888) colloquial
or formal.
Because the goal of this work is to verify whether a
rule-based approach can bring any improvement in UD annotations for historical text,
the incorporation of colloquial works makes assessing the efficacy of the approach
more difficult to gauge. Therefore, we make exclusive use of formal materials during
all experiments presented here. Nevertheless, the rules constructed here (e.g.,
character substitution rules) are applicable to the colloquial portion and new rules
can always be generated by manual analysis of the colloquial texts; these would not
be in opposition to the current ruleset developed from the formal texts.
GiNZA is a recent open-source NLP framework that advertises as an easy one-stop
solution for providing tokenization, part of speech tagging, and dependency
analysis on Japanese text simultaneously and enjoys popularity among NLP
researchers
We present the workflow used by our research to produce improved UD metadata directly from the raw sentences of the Taiyo corpus. Three steps organize the work: (1) a development phase where a set of handcrafted rules is generated to normalize portions of the historical text, (2) a text normalization phase that then applies the rule set to fulfill the needed transformation, and (3) application of GiNZA to the normalized text followed by an alignment step that assigns the UD metadata generated to word forms from the historical text. These steps are realized by means of a Python script, which we have made available through GitHub at https://github.com/jerrybonnell/Rules2UD. We have also provided a Binder link through this repository that launches a live Jupyter notebook in an executable environment so that users can interact with the tool without needing to install any packages on their machine.
Before proceeding to describe our workflow in detail, we offer a brief overview of
the dichotomy between symbolic and non-symbolic approaches, and how both are
combined into a single workflow in the proposed work. The developed collection of
rules used for producing normalized text can be viewed as an expert system
that
generates direct applications of domain knowledge for decision-making tasks. A
salient aspect of this system is that rule application is a symbolic
transformation: a sequence of symbols is substituted with another sequence where
the symbols compose written language as either ASCII (e.g., a
, z
, +
), UTF-8
unicode (e.g., あ
, タ
, 勉
), or numerals (e.g., 0
, 2
, 3
), and the
representations before and after the substitution support human comprehension. In
contrast, non-symbolic systems in the form of pretrained language models like
GiNZA transform raw text into an internal numerical representation (e.g., word
embeddings, contextual word embeddings, etc.) through means of deep learning that,
while necessary for its computation and effective for achieving state-of-the-art
across NLP benchmarks, obstructs any meaningful interpretation as per current
methods
However, if the expert system is allowed to carry out its work as a separate preprocessing step and the language model follows with its own independent computation, then the language model can take advantage of any symbolic transformations made by the expert system when receiving its input, thereby guiding its own computation. While the representations used during that computation are no longer symbolic, the output returns to a representation that is, which can again be used by the expert system for further postprocessing to complete the required annotation. By making these interactions indirect, the systems can inform one another effectively. This approach forms the basis for the workflow presented here.
We define a rule as some mapping between two word forms, a historical usage and
a normalized usage. A collection of rules is a set of these mappings which, in
implementation, is a Python dictionary of key-value pairs. We generate rules by
manual evaluation of the GiNZA output to identify errors in the parsing that
occur primarily because of historical usages. For instance, a significant
portion of the rule set are character substitutions, mapping historical kanjis
presently rare in use (called Kyukanji 旧漢字), e.g., 黨, to contemporary usages
(called Shinkanji 新漢字), e.g., 党. Also included are substitutions of
「わ行」Hiragana and 「ワ行」Katakana to their modern forms, e.g, changing「ゐ」to「い」. A
key feature of this evaluation is that only the FORM field is considered for
correction and no review time is given to the HEAD and DEPREL fields that form
the dependency tree, reducing the overall amount of manual effort needed.
The fourth component consists of two characters. The first character of the
pair is ordinarily pronounced hatsu
, but when combined with the next
character, which is pronounced tatsu
, the pronunciation changes to its
shortened version ha-
, thereby yielding hattatsu
instead of hatsutatsu
.
The sixth component also contains the modification in the pronunciation. The
first character is nornally pronounced shitaga
, but when ふ
is attached, the
pronunciation changes to shitago
and the pronunciation of ふ
changes from
hu
to u.
The encoded dependency tree given by the HEAD and DEPREL fields is
shown using the CoNLL-U viewer tool 社會
society.
Because some rules can be general, a global pattern match-and-replace could be
too aggressive and prone to error. To overcome this, the rule set is
partitioned into disjoint sets so that some rules may apply only after a
condition is met. For instance, つた
→ った
. may trigger only when the
word form つた
appears after some kanji. Table 3 shows the different rule sets
and distribution.
This step receives as input a single sentence from the target corpus and
returns the sentence after normalization. Each rule set is visited in turn for
possible applications. If a match is found and meets the condition of the
group, the word form is replaced by the value in the rule’s key-value pair. The
procedure builds state about the match in a list of start
and end
index
pairs; this information is stored in a dictionary and is needed for successful
alignment of UD annotations to the historical word forms in the proceeding
step. Following is a breakdown showing an example sentence, two matches, and
the corresponding normalized output:
This sentence contains one historical word form, 黨, and an ambiguous character sequence がよい. The former is normalized to the form 党. The latter in this context can be spelled as が良い with the use of one kanji. The alternate form is「通い」, which is normally pronounced as 「かよい」, but in the case where it is preceded by a general noun representing a commercial location, pronounced as 「がよい」. The former is a global rule while the latter may only trigger if the word form is preceded by a hiragana. This normalized sentence is ready for submission to GiNZA for parsing, which returns UD metadata in CoNLL-U format.
The fundamental problem with the CoNLL-U output returned by GiNZA in the
previous step is that the UD annotations supplied are for the
However, the alignment is complicated by nature of tokenization: the normalized word form that needs to be replaced may not be contained within a single row of the FORM field. This yields two scenarios when doing the alignment:
We envisage each row as containing two parts: a part that is not influenced by
the normalization (part A) and a part that is (part B). The two scenarios are
demonstrated using the GiNZA output from the example sentence in the previous
step.
Scenario #1: the normalized form 党. The start-end pair (6,6) identifies 小 as
the part not influenced by normalization (part A) and 党 as the part that is
(part B). The normalized form is fully contained by the row and, therefore,
string replacing 党 with 黨 completes the alignment.
Scenario #2: the normalized form が良い. The start-end pair (17, 19) identifies
part A empty and part B to be split across two rows, rows 11 and 12. To help
make informed decisions, we use the normalized GiNZA output to guide the
character lengths to maintain for each row. Meaning, the lengths of each
normalized row should remain unchanged after the alignment is completed.
There is a possibility for conflicts to arise during processing. Two main issues
are addressed here: (1) overlapping rules, and (2) rows with blank
forms.
There are scenarios where multiple rules can fire on the same historical word form or portions of it. These are usually due to the large number of kanji substitution rules in the rule set. This introduces undefined behavior as the alignment step is unable to determine which rule should have precedence and be applied first. The following gives an example of such a case with two rules that overlap:
The second rule is a simple substitution of a traditional kanji character with
a modern character. The first rule is a bit of a hack
where the second
character 「た」, pronounced ta,
is substituted with 「て」, pronounced te.
The
application of the first rule thus changes it to a more modern way of
communicating the same meaning at the cost of changing the pronounciation.
We define two or more rules to be overlapping
when the ranges of the indices
covered by the historical forms to be substituted overlap. When a rule is fully
covered by another rule, that is, its historical form is fully contained by the
historical form of another rule, the covered rule is jettisoned from
application as it is assumed that longer rules are more specific – and, hence,
more useful – than general
short rules like kanji substitution rules. The
above is an example of such a scenario where candidate rule #2 is a kanji
substitution rule and is removed from processing.
Some scenarios can be more complex when the historical form is not covered by
another rule, as in the following example. The seventh character of this
sentence 「か」 is normally pronounced ka,
but the traditional (early modern
period) spelling, with the succeeding character 「う」(u
), forces the
pronunciation 「こ」instead.
When the simple deletion technique above is no longer applicable, we form non-conflicting combinations of rules that also maximize the number of rules to include. Only two exist for this example: {かう} and {働か}. To determine which to use for processing, each combination is scored along four axes:
bad reading,that is, the reading given in the MISC field does not contain a katakana pronunciation (e.g.,日本 instead of ニホン).
agreement percentagein the BLEX, CLAS, and MLAS metrics by comparing the aligned accuracy parsing from a combination against that with no rule application
The combination with a minimum score is selected for processing. If multiple minima exist, the instance is flagged for inspection. However, this has not occurred during our experiments with the rule set applied as of this writing.
Situations can arise where there are not enough characters in the historical
word form to fill
the rows spanned by the normalized word form. In the
following example sentence, the normalized form 而かして
(or, alternatively,
しこうして
) spans the first four rows of the FORM field in the CoNLL-U output
returned by GiNZA.
Only two characters from the historical form are available to distribute among four rows, which results in the alignment step after completion leaving the third and fourth rows in the FORM field empty. While the issue seems like implementation error, it points to a problem with the rule itself: the normalized form is not helpful in guiding GiNZA to a more accurate parse that sees 而かして as a single word form, hence the tokenization into multiple rows. The solution is an adjustment of the normalized form in the rule, e.g., changing 而かして to 而して. This yields a parse where the normalized form spans a single row, is more accurate, and allows the alignment step to proceed without error.
This section evaluates the rule set introduced by this research along three criteria:
Figure 2 shows the top 10 most frequently applied rules in the Taiyo corpus, and the frequency of application for said rules across the other three corpora.
Overall, we observe fair representation of the Taiyo rules in the other corpora.
These rules make up roughly, on average, a quarter of applications with respect to
the other corpora, with the other three contributing about evenly to the remaining
75% of application. Indeed, some rules saw disproportionately more application in
Taiyo than in other corpora, e.g., ‘會’ saw over 30% application in Taiyo while
only 10% in Meiroku. The character as a noun means group
and as a verb means to
meet
. The use of the character is used mostly in the former sense. The difference
in the frequency is due to the fact that Taiyo speaks more frequently about groups
(specifically, political parties and groups) than Meiroku.
If the proposed approach is to be successful in bringing improvement to a fundamental NLP task like word segmentation, it must first have an observable effect on the resulting parsings generated by GiNZA. This is especially critical when evaluating the method against unlabeled corpora like Taiyo where it is not possible to compare predictions made with any ground truth labels. In the absence of ground truth, visualizing disagreements in CoNLL-U output between the proposed approach and what a pre-trained tool would normally generate can showcase whether any effect can be seen in the output and the quantity of that difference. If the answer is in the positive, then this raises the possibility for improved performance on the historical materials.
To evaluate this, CoNLL-U output with and without rule application is compared
across the 4 corpora using the BLEX, CLAS, and MLAS metrics as defined in
We observe a gradual increase in disagreement with the original parsing as more rules are introduced, with the maximum disagreement at full rule usage reaching 22.8% in the BLEX metric from the Taiyo corpus and the minimum 12.8% in the CLAS metric from Josei. BLEX results yield the highest disagreements because of the strict conditions it places on the two CoNLL-U files being compared, relative to the other two metrics. Despite frequent rules being introduced last, the amount of disagreement begins to plateau when rule usage approaches the complete rule set; this could be an artefact of character substitution rules that, while frequently applied, have minimal effect on the dependency structure.
We evaluate whether the observed effect on the dependency structure of the CoNLL-U
output can bring an improvement in a basic NLP task in word segmentation using
UDPipe. The test is performed by comparing three experimental set-ups:
The
training data is prepared using five-fold cross-validation over the documents in
the Taiyo corpus where the testing fold is not used due to lack of word-level
metadata. Instead, the trained models are tested against documents from the
Kokumin and Meiroku collections which supply short word unit (SUW) annotations and
can be used for ground truth. Figure 4 shows an example of the SUW tag for two
tokens in the Kokumin corpus, 陛下
and 及び.
The experiment is repeated 10 times
for the two setups with and without rule application. In total, 101 different
models are evaluated.
In keeping with the methods proposed in B
label estimation using the metrics precision,
recall, and F1. However, we adjust the meaning of the B
label to mean start of
token
and I
label as rest of token
. Predictions using the tokenizer option in
UDpipe are compared with the B
and I
truth labels derived from the SUW tags in
Meiroku and Kokumin. Figure 5 reports the results.
Significant improvements obtained using rule application are in precision and F1.
For Kokumin, the true
model gives a 3.4% improvement in precision and a 1.7%
improvement in F1 over the false
model, and a 7.5% improvement in precision and
a 9.3% improvement in F1 over the pretrained model. For Meiroku, the true
model
gives a 3.4% improvement in precision and a 1.7% improvement in F1 over the
false
model, and a 7.0% improvement in precision and a 10.2% improvement in F1
over the pretrained model. We emphasize the improvements brought by precision as
being most significant as tokenizations that are inaccurate often produce many
tokens (i.e., B
labels) that result in high recall but low precision.
Indeed, some management of the rule set is needed to achieve an observable improvement. The proposed workflow is not totally automatic and care is needed to ensure rules that are introduced into the collection do in fact lead to more accurate parsings produced by GiNZA (or a pretrained tool of choice) and that overlapping rules do not produce a condition where multiple minima exist. Problems with the former usually present as rows with empty forms that can be detected with ease. Moreover, the amount of manual time needed for review is still reduced as the reviewer need only to concentrate review on the FORM field to obtain improved dependency parsings on historical materials.
While the proposed workflow has an effect on the CoNLL-U output produced by GiNZA and
said effects bring a significant improvement in precision and F1 for B
-label word
estimation, the results also point toward a need for developing mechanisms that
facilitate expansion of the rule collection that, in turn, furthers the changes made
to the dependency structure in the CoNLL-U output; this has the potential to bring
more improvement in the performance of trainable NLP pipelines like UDPipe on
fundamental NLP tasks for historical materials.
Perhaps one step in this direction are NLP methods that can flag instances of
pretrained output with inaccurate parsings that need review, thereby allowing easier
rule introduction. Alternatively, another approach is to orient the research towards
automatic rule inferencing that pave the path for rapid expansion of the current rule
collection. One possibility is to allow for rule chaining,
that is, the application
of one rule that triggers the application of one or more new rules that were not
directly applicable on the original sentence. Furthermore, methods from deep learning
like the encoder-decoder model shown in
In this work we introduced a rule-based workflow for providing improved UD
annotations to historical Japanese corpora. The principal advantage of our approach
is that no gold standard
data is required for training data development, only the
availability of a pre-trained model in the target language. Moreover, the amount of
time needed for post-editing pre-trained model output is significantly reduced as the
reviewer need only to develop rules that address problems in the FORM field and the
review does not require deep expertise in UD. We showed that this cheaper
review
strategy exhibits an effect on the dependency structure in the CoNLL-U output and,
furthermore, brings an improvement in the performance of trainable language-agnostic
NLP pipelines like UDPipe on word segmentation tasks.
These results are encouraging to DH scholars who would like to enhance their
scholarship by annotating historical materials with linguistic metadata that is
customizable and more reliable than what would be possible by the straightforward
application of off-the-shelf
tools. Future work will do well to further expedite
the manual review needed to achieve good results on the target corpus by
incorporating methods that flag pretrained output that is inaccurate and allow
automatic rule inferencing. We also caution against the development of techniques
that could be interesting for a venue in NLP but offers little for the scholar that
would use said techniques on a target DH corpus.
We would like to thank the Department of Computer Science at the University of Miami for providing the computational resources necessary for running the experiments in this research. The work is in part supported by the National Science Foundation Grant CNS P2145800.