David Bamman is a senior researcher in computational linguistics for the Perseus Project, focusing especially on natural language processing for Latin and Greek, including treebank construction, computational lexicography, morphological tagging and word sense disambiguation. David received a BA in Classics from the University of Wisconsin-Madison and an MA in Applied Linguistics from Boston University. He is currently leading the development of the Latin Dependency Treebank and the Dynamic Lexicon Project.
Gregory Crane, Professor of Classics and Winnick Family Chair of Technology and Entrepreneurship at Tufts University, is the editor in chief of the Perseus Project. He has a broad interest in and has published extensively on the interaction between intellectual practice and technological infrastructure in the humanities.
Authored for DHQ; migrated from original DHQauthor format
Manual lexicography has produced extraordinary results for Greek and Latin, but it cannot in the immediate future provide for all texts the same level of coverage available for the most heavily studied materials. As we build a cyberinfrastructure for Classics in the future, we must explore the role that automatic methods can play within it. Using technologies inherited from the disciplines of computational linguistics and computer science, we can create a complement to these traditional reference works - a dynamic lexicon that presents statistical information about a word’s usage in context, including information about its sense distribution within various authors, genres and eras, and syntactic information as well.
Automated methods for lexicography
Advertisement for the Lewis & Short Latin Dictionary, March 1, 1879....Great advances have been made in the sciences on which lexicography depends. Minute research in manuscript authorities has largely restored the texts of the classical writers, and even their orthography. Philology has traced the growth and history of thousands of words, and revealed meanings and shades of meaning which were long unknown. Syntax has been subjected to a profounder analysis. The history of ancient nations, the private life of the citizens, the thoughts and beliefs of their writers have been closely scrutinized in the light of accumulating information. Thus the student of to-day may justly demand of his Dictionary far more than the scholarship of thirty years ago could furnish.
The scholarship of thirty years
ago
that Lewis and Short here distance themselves from is Andrews' 1850
Latin-English lexicon, itself largely a
translation of Freund’s German Wörterbuch published
only a decade before. As we design a cyberinfrastructure to support Classical Studies
in the future, we will soon cross a similar milestone: the Oxford
Latin Dictionary (1968-1982) has begun the slow process of becoming thirty
years old (several of the earlier fascicles have already done so) and by 2012 the
eclipse will be complete. Founded on the same lexicographic principles that produced
the juggernaut
Manual methods, however, cannot in the immediate future provide for all texts the
same level of coverage available for the most heavily studied materials, and as we
think toward Classics in the next ten years, we must think not only of desiderata,
but also of the means that would get us there. Like Lewis and Short, we can also say
that great advances have been made over the past thirty years in the sciences
underlying lexicography; but the sciences
that we group in that statement
include not only the traditional fields of paleography, philology, syntax and
history, but computational linguistics and computer science as well.
Lexicographers have long used computers as an aid in dictionary production, but the
recent rise of statistical language processing now lets us do far more: instead of
using computers to simply expedite our largely manual labor, we can now use them to
uncover knowledge that would otherwise lie hidden in expanses of text. Digital
methods also let us deal well with scale. For instance, while the
In deciding how we want to design a cyberinfrastructure for Classics over the next
ten years, there is an important question that lurks between where are we now?
and where do we want to be?
: where are our colleagues already? Computational
linguistics and natural language processing generally perform best in high-resource
languages — languages like English, on which computational research has been focusing
for over sixty years, and for which expensive resources (such as treebanks,
ontologies and large, curated corpora) have long been developed. Many of the tools we
would want in the future are founded on technologies that already exist for English
and other languages; our task in designing a cyberinfrastructure may simply be to
transfer and customize them for Classical Studies. Classics has arguably the most
well-curated collection of texts in the world, and the uses its scholars demand from
that collection are unique. In the following I will document the technologies
available to us in creating a new kind of reference work for the future — one that
complements the traditional lexicography exemplified by the
In answering this question, I am mainly concerned with two issues: the production of reference works (i.e., the act of lexicography) and the use that scholars make of them.
All of the reference works available in Classics are the products of manual labor, in
which highly skilled individuals find examples of a word in context, cluster those
examples into distinguishable senses,
and label those senses with a word or
phrase in another language (like English) or in the source language (as with the
apt
sentences as they come across them (as with the
We can contrast this computer-assisted lexicography with a new variety — which we
might more properly call computational lexicography
— that has emerged with
the COBUILD project
This corpus-based approach has since been augmented in two dimensions. On the one
hand, dictionaries and lexicographic resources are being built on larger and larger
textual collections: the German
In their ability to include statistical information about a word’s actual use, these contemporary projects are exploiting advances in computational linguistics that have been made over the past thirty years. Before turning, however, to how we can adapt these technologies in the creation of a new and complementary reference work, we must first address the use of such lexica.
Like the
This is what we might consider a manual form of lemmatized searching.
The
Perseus Digital Library
The advantage of the Perseus and TLG lemmatized search is that it gives scholars the
opportunity to find all the instances of a given word form or lemma in the textual
collections they each contain. The
The
In order to accomplish this, we need to consider the role that automatic methods can
play within our emerging cyberinfrastructure. I distinguish cyberinfrastructure from
the vast corpora that exist for modern languages not only in the structure imposed
upon the texts that comprise it, but also in the very composition of those texts:
while modern reference corpora are typically of little interest in themselves (as
mainly newswire), Classical texts have been the focus of scholars’ attention for
millennia. The meaning of the word
We therefore must concentrate on two problems. First, how much can we automatically learn from a large textual collection using machine learning techniques that thrive on large corpora? And second, how can the vast labor already invested in handcrafted lexica help those techniques to learn?
What we can learn from such a corpus is actually quite significant. With a large
bilingual corpus, we can induce a word sense inventory to establish a baseline for
how frequently certain definitions of a word are manifested in actual use; we can
also use the context surrounding each word to establish which particular definition
is meant in any given instance. With the help of a treebank (a handcrafted collection
of syntactically parsed sentences), we can train an automatic parser to parse the
sentences in a monolingual corpus and extract information about a word’s
subcategorization frames (the common syntactic arguments it appears with — for
instance, that the verb
If we leverage all of these techniques to create a lexicon for both Latin and Greek, the lexical entries in each reference work could include the following:
In creating a lexicon with these features, we are exploring two strengths of automated methods: they can analyze not only very large bodies of data but also provide customized analysis for particular texts or collections. We can thus not only identify patterns in one hundred and fifty million words of later Latin but also compare which senses of which words appear in the one hundred and fifty thousand words of Thucydides. Figure 1 presents a mock-up of what a dictionary entry could look like in such a dynamic reference work. The first section (
We have already begun work on a dynamic lexicon like that shown in Figure 1
Each of these technologies has a long history of development both within the Perseus Project and in the natural language processing community at large. In the following I will detail how we can leverage them all to uncover large-scale usage patterns in a text.
Our work on building a Latin sense inventory from a small collection of parallel
texts in our digital library is based on that of Brown et
al. 1991 and Gale et al. 1992, who suggest
that one way of objectively detecting the real senses of any given word is to
analyze its translations: if a word is translated as two semantically distinct
terms in another language, we have
Finding all of the translation equivalents for any given word then becomes a task of aligning the source text with its translations, at the level of individual words. The Perseus Digital Library contains at least one English translation for most of its Latin and Greek prose and poetry source texts. Many of these translations are encoded under the same canonical citation scheme as their source, but must further be aligned at the sentence and word level before individual word translation probabilities can be calculated. The workflow for this process is shown in Figure 2.
Since the XML files of both the source text and its translations are marked up with the same reference points,
In step 3, we then align these 1-1 sentences using GIZA++
Figure 3 shows the result of this word alignment (here
with English as the source language). The original, pre-lemmatized Latin is
From these alignments we can calculate overall translation probabilities, which we currently present as an ordered list, as in Figure 4.
The weighted list of translation equivalents we identify using this technique can
provide the foundation for our further lexical work. In the example above, we have
induced from our collection of parallel texts that the headword
The granularity of the definitions in such a dynamic lexicon cannot approach that
of human labor: the Lewis and Short speech
to formal language
to the power of oratory
and
beyond. Our approach, however, does have two clear advantages which complement
those of traditional lexica: first, this method allows us to include statistics
about actual word usage in the corpus we derive it from. The use of
Second, our word alignment also maps multi-word expressions, so we can include
significant collocations in our lexicon as well. This allows us to provide
translation equivalents for idioms and common phrases such as
Approaches to word sense disambiguation generally come in three varieties:
raw,unannotated text, either a monolingual corpus
Corpus methods (especially supervised methods) generally perform best in the
SENSEVAL competitions — at SENSEVAL-3, the best system achieved an accuracy of
72.9% in the English lexical sample task and 65.1% in the English all-words
task.
Since the Perseus Digital Library contains two large monolingual corpora (the
canon of Greek and Latin classical texts) and sizable parallel corpora as well, we
have investigated using parallel texts for word sense disambiguation. This method
uses the same techniques we used to create a sense inventory to disambiguate words
in context. After we have a list of possible translation equivalents for a word,
we can use the surrounding Latin or Greek context as an indicator for which sense
is meant in texts where we have no corresponding translation. There are several
techniques available for deciding which sense is most appropriate given the
context, and several different measures for what definition of context
is
most appropriate itself. One technique that we have experimented with is a naive
Bayesian classifier (following Gale et al. 1992),
with context defined as a sentence-level bag of words (all of the words in the
sentence containing the word to be disambiguated contribute equally to its
disambiguation).
Bayesian classification is most commonly found in spam filtering. A filtering
program can decide whether or not any given email message is spam by looking at
the words that comprise it and comparing it to other messages that are already
known to be spam — some words generally only appear in spam messages (e.g.,
We can also use this principle to disambiguate word senses by building a classifier for every sense and training it on sentences where we do know the correct sense for a word. Just as a spam filter is trained by a user explicitly labeling a message as spam, this classifier can be trained simply by the presence of an aligned translation.
For instance, the Latin word
Word sense disambiguation will be most helpful for the construction of a lexicon
when we are attempting to determine the sense for words in context for the large
body of later Latin literature for which there exists no English translation. By
training a classifier on texts for which we do have translations, we will be able
to determine the sense in texts for which we don’t: if the context of
Two of the features we would like to incorporate into a dynamic lexicon are based
on a word’s role in syntax: subcategorization and selectional preference. A verb’s
subcategorization frame is the set of possible combinations of surface syntactic
arguments it can appear with. In linear, unlabeled phrase structure grammars,
these frames take the form of, for example,
A predicate’s selectional preference specifies the type of argument it generally
appears with. The verb
In order to extract this kind of subcategorization and selectional information
from unstructured text, we first need to impose syntactic order on it. One option
for imposing this kind of order is through manual annotation, but this option is
not feasible here due to the sheer volume of data involved — even the more
resourceful of such endeavors (such as the Penn Treebank
A second, more practical option is to assign syntactic structure to a sentence using automatic methods. Great progress has been made in recent years in the area of syntactic parsing, both for phrase structure grammars (Charniak 2000, Collins 1999) and dependency grammars (Nivre et al. 2006, McDonald et al. 2005), with labeled dependency parsing achieving an accuracy rate approaching 90% for English (a high resource, fixed word order language) and 80% for Czech (a relatively free word order language like Latin and Greek). Automatic parsing generally requires the presence of a treebank — a large collection of manually annotated sentences — and a treebank’s size directly correlates with parsing accuracy: the larger the treebank, the better the automatic analysis.
We are currently in the process of creating a treebank for Latin, and have just
begun work on a one-million-word treebank of Ancient Greek. Now in version 1.5,
the Latin Dependency Treebankthat glory would know my old
age
— would look like the following:
While this treebank is still in its infancy, we can still use it to train a
parser to parse the volumes of unstructured Latin in our collection. Our treebank
is still too small to achieve state-of-the-art results in parsing but we can still
induce valuable lexical information from its output by using a large corpus and
simple hypothesis testing techniques to outweigh the noise of the
occasional error
These technologies, borrowed from computational linguistics, will give us the grounding to create a new kind of lexicon, one that presents information about a word’s actual usage. This lexicon resembles its more traditional print counterparts in that it is a work designed to be browsed: one looks up an individual headword and then reads its lexical entry. The technologies that will build this reference work, however, do so by processing a large Greek and Latin textual corpus. The results of this automatic processing go far beyond the construction of a single lexicon.
I noted earlier that all scholarly dictionaries include a list of citations illustrating a word’s exemplary use. As Figure 1 shows, each entry in this new, dynamic lexicon ultimately ends with a list of canonical citations to fixed passages in the text. These citations are again a natural index to a corpus, but since they are based in an electronic medium, they provide the foundation for truly advanced methods of textual searching — going beyond a search for individual word form (as in typical search engines) to word sense.
The ability to search a Latin or Greek text by an English translation equivalent
is a close approximation to real cross-language information retrieval. Consider
scholars researching Roman slavery: they could compare all passages where any
number of Latin slave
words appear, but this would lead to separate
searches for
Searching by word sense also allows us to investigate problems of changing
orthography — both across authors and time: as Latin passes through the Middle
Ages, for instance, the spelling of words changes dramatically even while meaning
remains the same. So, for example, the diphthong
The ability to search by a predicate’s selectional preference is also a step
toward semantic searching — the ability to search a text based on what it
means.
In building the lexicon, we automatically assign an argument
structure to all of the verbs. Once this structure is in place, it can stay
attached to our texts and thereby be searchable in the future, allowing us to
search a text for the subjects and direct objects of any verb. Our scholar
researching Roman slavery can use this information to search not only for passages
where any slave has been freed (i.e., when any Latin variant of the English
translation
Manual lexicography has produced fantastic results for Classical languages, but as we design a cyberinfrastructure for Classics in the future, our aim must be to build a scaffolding that is essentially enabling: it must not only make historical languages more accessible on a functional level, but intellectually as well; it must give students the resources they need to understand a text while also providing scholars the tools to interact with it in whatever ways they see fit. In this a dynamic lexicon fills a gap left by traditional reference works. By creating a lexicon directly from a corpus of texts and then situating it within that corpus itself, we can let the two interact in ways that traditional lexica cannot.
Even driven by the scholarship of the past thirty years, however, a dynamic lexicon cannot yet compete with the fine sense distinctions that traditional dictionaries make, and in this the two works are complementary. Classics, however, is only one field among many concerned with the technologies underlying lexicography, and by relying on the techniques of other disciplines like computational linguistics and computer science, we can count on the future progress of disciplines far outside our own.