The three classes of digital project outlined above reflect three different sources
of energy: the industrialized processes of mass digitization and of general
algorithms, the specialized production of domain specific, machine actionable
knowledge, and the generalized ability for many different individuals to contribute,
in ways large and small. When these three sources of energy begin to interact with
one another, the resulting environment is qualitatively different not only from print
culture but from any of the three digital environments taken in isolation. Having
reviewed some developments in the previous generation, we can now begin to consider
the implications for ePhilology (primary and secondary sources relevant to classical
Greek and Latin), eClassics (ancient Greek and Latin as they work within human
minds), and Cyberinfrastructure (the material systems whereby we exchange the objects
of our intellectual labor and ourselves internalize these objects). The following
sections describe ePhilology and eClassics. The conclusion to this collection returns
to the Cyberinfrastructure towards which the individual articles point.
Producing new knowledge: ePhilology
Any one can discourse to you forever about the
advantages of a brave defence, which you know already. But instead of listening
to him I would have you day by day fix your eyes upon the greatness of Athens,
until you become filled with the love of her; and when you are impressed by the
spectacle of her glory, reflect that this empire has been acquired by men who
knew their duty and had the courage to do it, who in the hour of conflict had
the fear of dishonor always present to them, and who, if ever they failed in an
enterprise, would not allow their virtues to be lost to their country, but
freely gave their lives to her as the fairest offering which they could present
at her feast.
(Pericles’ Funeral Oration, Thuc.
1.43.2)
If we think only in terms of word searches, the production of camera-ready copy,
image management, the ability to generate basic maps, and manually produced format
such as wikis and blogs, increased storage and computational power may seem
relatively unimportant. For anyone whose career extends more than a decade,
current technologies are astonishingly powerful. In 1982, it cost the Harvard
Classics Computing Project $34,000 to purchase a 660 megabyte disk drive to store
early versions of the TLG: the disk was the size of a washing machine, arrived in
a wooden crate, needed a special disk controller, took two days for the
technicians to install and required modifications to the version of the Unix
operating system then available. The maintenance contract cost c. $4,000/year and
was essential. As this introduction is written, $100 buys a terabyte of storage —
more than 1000 times as much storage as its 1982 predecessor for 300 times less
money, a decrease in cost of more than 300,000 in one quarter of a century. We can
now take for granted storage that was previously unimaginable, collecting huge
digital images as well as texts and datasets with little regard for the costs of
storage or computation. A generation ago, only a few of the wealthiest departments
could raise tens of thousands of dollars to provide the storage to search a few
million words of Greek and support the first generation of digital publishing. In
2008, many cell phones have more than enough storage and computational power to do
much more.
All of us in the academy and in society as a whole, of course, already depend upon
general services, such as Google, that require stunning amounts of storage and
computational power — even academics who may proudly dissociate themselves from
the web of digital services depend completely upon those services for the paper
publications that arrive in the mail and the catalogues by which they find books
on the shelf. And, of course, we already depend upon digital infrastructure for
the paychecks, medical treatments and other fundamental components of material
life. Within classical studies, it is easy to see the need for vast networked
storage and high performance computing for the analysis and visualization of
quantitative and visual evidence from the material culture.
[34]
Consider the basic problem of reading Greek and Latin. The machine-actionable
Liddell-Scott-Jones (LSJ) Greek-English and Lewis and Short Latin-English lexica
developed by the Perseus Project contain 422,000 and 303,000 tagged citations to
800 Greek and 80 Latin authors. In LSJ, half of the 422,000 citations are to a
half dozen canonical authors. For Lewis and Short, the top dozen authors account
for more than two-thirds (215,000) of the citations.
Not all lexicographic projects have such narrow focus, but extensive lexicographic
coverage is extraordinarily labor intensive. The
Thesaurus
Linguae Latinae (TLL) is building a lexicon that covers Latin from
earliest times through AD 600 and bases its work on an archive of 10,000,000 slips
with information about particular words. The TLL in 2008 boasts a staff of twenty
Latinists, began work in 1894, published its first fascicle, and has been an
international project since 1949. Its official website promises that the TLL will
during 2009 “reach the end of the letter P, at
which point more than two thirds of the complete work will have
appeared”.
[35]
The ten million or so words of ancient Latin may require more then a century of
labor, but they constitute, of course, a relatively small corpus. The TLG had
accumulated 99,000,000 words in 2007.
[36] An
individual Latinist, Johann Ramminger, had accumulated a wordlist of later Latin
from Petarch up through 1700 that was based on 200,000,000 words of text already
available in digital form. Semi-automated methods involving computerized data but
still dependent upon manual analysis of each form may increase productivity by a
factor of two or three, but simply enhancing traditional approaches would require
centuries to provide us with truly comprehensive lexica of Greek and Latin.
No branch of scholarship is probably older than lexicography, but our traditional
methods do not scale up to the challenges of representing textual materials in
Greek and Latin. We have no choice but to exploit, as vigorously as we can,
automated methods. The essay by
Bamman and Crane in this collection describes some of these methods as
they exist today. The essay by
Finkel and Stump illustrates how automated methods can reconfirm — but
place on a profoundly new foundation — ancient analytical instruments such as the
reduction of Latin verbs to a four dimensional space defined by the traditional
principal parts.
Ultimately, automated and manual methods reinforce one another. Decisions embedded
in print reference materials such as lexica, indices, and grammars can be, at
least in part, extracted and converted into machine actionable data. In effect,
human annotators provide the examples and rules from which automated systems
learn. The automated systems present the results of what they learn when they work
with new materials. Human readers then correct and augment the automated results.
The automated systems recalculate their statistical models and then
recalculate.
[37] In a mature system, we separate training
data from test data so that we can automatically measure the impact that our
changes have upon performance.
Complex algorithms can be computationally demanding even when we are working with
small corpora. In preliminary work on sense detection in 2005, we found that by
comparing five different translations with the 150,000 Greek words in Thucydides
we can identify words with many senses in Thucydides: e.g., passages where the
Greek word archê corresponds to “beginning” or to “empire”. It took days
of processing power from a single CPU to identify clusters of word senses in five
translations of the 150,000 words in Thucydides.
[38] Even if we shift to these algorithms, analyzing millions of
words and thousands of translations in a half dozen languages would require more
computational power than any desktop system could readily deploy.
The infrastructure of 2008 forces researchers in classics and in the humanities to
develop autonomous, largely isolated, resources. We cannot apply any analysis to
data that is not accessible. We need, at the least, to be able gather the data
that is available today and, second, to ensure that we can retrieve the same data
in 2050 or 2110 that we retrieve in 2010.
[39] We need digital
libraries that may be physically distributed in different parts of the world but
that act as a single unit: we need to be able to pose queries such as “find all
Greek editions and modern language translations of Aeschylus,
Persians, lines 1-40” and retrieve machine
actionable results from a variety of sites.
[40]
There are two components to this problem. First, we need libraries that can
preserve collections in the digital world as they have preserved them in the print
world. The institutional repository movement is slowly addressing this
challenge.
[41] Thus, the publications in this
collection are a part of a long-term institutional repository that can manage
static expository prose with very general features such as sections, footnotes,
bibliography etc.
We need, however, more than digital preprints. A second component is the need for
sophisticated citation and reference linking services. Smith’s paper in this
collection,
“Citation in Classical Studies”, describes the system of canonical text citations by which classicists
identify precise chunks of text within the surviving corpus of classical Greek and
Latin. The
Canonical Text Services (CTS) described in this piece begin where
library catalogues end and provide furthers layers of granularities essential for
classical scholarship: the CTS provides a common language whereby we can aggregate
information about particular lines in the
Iliad or a
numbered section from a chapter in Thucydides.
[42]
The TEI has developed a shared language whereby humanists can describe the same
phenomena in similar ways so that we can more readily combine documents produced
by different groups. The TEI has many different methods, however, and it is
possible to represent the same phenomenon in many different TEI-compliant ways.
Cayless et al. describes
how experts in Greek inscriptions as a community adapted the very general TEI
framework to their needs, allowing classicists to create documents that are
increasingly interoperable and easy to maintain over time.
Robertson documents research in
methods to describe historical events in a format that is not only machine
actionable but language independent, contributing to the production of
multilingual scholarship.
Dué and
Ebbott describe editorial standards for a new generation of dynamic
digital editions. These new editions do not simply provide a single best attempt
at reconstructing a single text but can dynamically represent multiple versions of
the text as it has appeared over time and provide databases of variants,
conjectures, testimonia and other materials.
Elliott and Gillies look more
generally at how we can then build on these and other services to manage
geographic information about the ancient world in new ways. Wikipedia has provided
a famous and famously successful model for distributed authorship, but classicists
had already begun pioneering such systems in the 1990s.
Mahoney’s article describes the
infrastructure for the Suda On Line project, which has produced translations for
more than 24,000 entries of a fundamental reference work about the classical Greek
world produced in 10th century Byzantium. At the same time,
Finkel and Stump illustrate how
methods from computer science can manage such fundamental structures as Latin
morphology.
And, of course, only a small part of the printed record relevant to classical
Greek and Latin has been — or will be — carefully transcribed and edited. If we
begin to consider the challenge of extracting and analyzing information about
classical Greek and Latin scattered throughout very large collections of books
available as scanned page images, the challenges of storage and computation become
daunting. The collection of essays thus ends with articles about converting print
materials into a form that can support the kinds of services that the previous
articles have articulated.
Rydberg-Cox describes the issues involved in trying to convert early
printed scholarship into a machine actionable form. Later publications lend
themselves much more readily to automated analysis.
Crane et al. consider the
problems and opportunities that emerge for classics as whole research libraries
become available in digital form.
Infrastructure includes not only data, services and physical systems but the
social practices as well.
Figure 6 illustrates some
of the particular elements of the cyberinfrastructure needed for philology. The
papers in this collection illustrate shifts in the practices of classicists as a
new cyberinfrastructure develops:
-
Expository argumentation: While new forms of
scholarship and new intellectual practices are taking shape, we should
emphasize that the collection published here reflects the on-going need for
expository arguments that articulate particular points of view constructed
at a particular time. Nevertheless, even when the superficial form of
argumentation remains largely traditional in form, the substitution of
dynamic links for static citations can exercise a major impact upon the
content and the audience that publications can reach. Stoa.org was founded
in 1997 to support, among other things, new forms of publication that would
provide rich links to original sources while bringing classics to a broader
audience. Thomas Martin’s Overview of Classical Greek History in the Perseus Digital Library and Ross Scaife’s Diotima, an electronic publication on gender in antiquity, did much to
inspire this goal. All of the publications associated with the Stoa
illustrate forms of publication that were not feasible a generation ago.
Christopher Blackwell’s Demos: Classical Athenian Democracy illustrates how a publication that is traditional in form can exploit
online evidence and publication to provide better documentation on a major
subject to a wider audience than was feasible in print.
-
Collaboration: While the final form of the papers
in this collection may be familiar, their production and content reflects a
fundamental change in scholarly practice: the majority of the papers
published here have multiple authors, while the single-author papers either
report on group projects or on general methods whereby classicists can
create interoperable data.
-
Open access and open source production: All of
the scholars who have contributed to this collection depend upon open access
and open source production. In contrast, Figure
7 illustrates an example of a much more closed form of access. In
cases where authors are making particular arguments at a particular point in
time, open access allows third parties to locate and automatically analyze
what they have produced: search engines such as Google can index and then
deliver their arguments to anyone online; more specialized text mining
systems could analyze what has been written to search for trends in
scholarship or to apply specialized services designed for classics (e.g.,
the ability to recognize strings such as “Thuc. 1.86” as citations to
primary sources).
The authors of these papers represent, however, a greater advance than the work
that they have produced so far. In part, this reflects the hope that they will
produce even more in the future. They also represent a new community, one large
enough to foster junior scholars within the field, and in this way they may
indirectly spawn far more productive work than all of them could in the aggregate
produce during their own careers. But more significant than any output is the
sense within this community that the field of classics is being reborn and that
limitations with which many of us grew up are no longer relevant. This new digital
world not only changes what we can do but who can do what. The collection of
essays thus opens with
Blackwell and
Martin’s article about undergraduate research. Before introducing that
discussion, we need return to the broader topic of classics and the humanities in
a digital environment that has begun to increase the intellectual reach of
humanity as a whole
Extending the intellectual reach of humanity: eClassics &
eHumanities
In short, I say that as a city we are the school
of Hellas; while I doubt if the world can produce a man, who where he has only
himself to depend upon, is equal to so many emergencies, and graced by so happy
a versatility as the Athenian.
(Pericles’ Funeral Oration, Thuc.
2.41.1)
We look to a new digital infrastructure not only so that we can increase the body
of published information about classical Greek and Latin but so that these
languages can play an increased role in the intellectual life of humanity. We can
do this in two ways. First, we can create environments that more fully engage
those already working with Greek and Latin — we have already begun to address this
by creating searchable corpora of Greek and Latin, by making secondary sources
available online as PDF files or by adding links between inflected words in a text
and their dictionary entries and thus reducing time spent flipping large
dictionaries. These all reduce the time between when we pose a question and when
we receive an answer. It would be hard to overstate the degree to which
cost-benefit decisions, often unconscious, shape the directions that we take in
our intellectual lives. Classicists have for millennia understood the difference
between being in a small, poorly organized collection and a large collection in
which it is easy to find what we want. Cyberinfrastructure provides new threads
that we can follow through the vast body of published information.
The second way to increase the role of classical Greek and Latin is to engage more
people in reading and thinking about these languages. Anecdotal evidence suggests
that this began to happen as soon as substantial bodies of Greek and Latin became
available to the general public. Perseus quickly received letters from students in
isolated locations such as rural homes and naval vessels at sea who were using
online lexica and texts. Even more interesting, people who had studied Greek and
Latin decades before found that the reading support tools available online gave
them the support that they needed to begin reading Greek and Latin again.
The first paragraph in the opening “Call to action” of
the National Science Foundation’s 2007 “Cyberinfrastructure
Vision for 21st Century Discovery” calls for “an
individualized health model of every human being for personalized health care
delivery” (
“Cyberinfrastructure Vision for 21st Century
Discovery”, March 2008: page 5). Such models would open up new methods where doctors
and patients could not only determine the best courses with which to treat disease
but also to identify potential problems and predispositions in advance. Health
records that include decades of medical tests and case histories clearly raise
daunting issues of confidentiality, but the potential benefits are enormous.
Emergent cyberinfrastructure for health care includes thus both methods to
represent our particular background in great detail and a major investment in
maintaining personal privacy.
The same instruments developed for health care can be adapted for our intellectual
backgrounds. We can begin to devise ways for us to keep track of what we have
learned so that we can receive background information customized for our
particular needs when we confront a new object of study.
[44]
Figure 8 illustrates a system that compares an
arbitrary text of Latin against a model of the vocabulary that a particular reader
has encountered, then calculates which words have been seen before and which are
new. Seen words can then be associated with the places where they have been seen
in the past, while unseen words can be ranked by their importance according to
various criteria (e.g., numerical frequency, relevance to a particular theme etc.)
The implementation is conceptually simple but represents the first stage at an
open-ended process. As our data sources improve, we can look for more complex
linguistic phenomena such as syntax and semantics (e.g., a new sense of a seen
word). As our learning models grow more sophisticated, we can begin helping
readers identify areas of weakness on which they can focus to enhance their
ability to read with fluency.
Even small advances in our ability to work with multiple languages can be
important if they open up historical languages to new audiences, whether these
audiences are professional researchers using more linguistic sources or members of
the public reading Greek poetry that they would not otherwise have experienced.
The biggest benefits are likely to come when we open up linguistic materials to
audiences with little or no training in the language. None of us has the
opportunity to become familiar with more than a handful of languages. None of us
can, in print culture, work with un-translated sources in dozens of languages.
Classics can, however, show how knowledge about an ancient culture can be designed
to serve the speakers of multiple languages. The traditional method is for
communities to choose a lingua franca — Akkadian, Greek, Latin, French, German,
and now English have all served as common languages of diplomacy and scholarship.
The speakers of an unbounded set of local languages communicate by learning one of
these linguae francae — thus, the Chinese businessman in a Damascus hotel will
probably carry on his business in English. Classicists are more broad-minded but
generally expect scholars to publish materials in English, French, German and
Italian. Speakers of Croatian or Modern Greek must learn these languages if they
are to gain access to most information about the Greco-Roman world.
Classicists can, however, design their cyberinfrastructure from the start to be as
portable as possible across multiple languages. There are at least three basic
strategies, the third and most important of which is peculiarly suitable to
historical fields where primary sources are finite and heavily studied.
First, we need to be able to optimize machine translation for the field of
classics.
[45] We can develop statistical models that
capture the idiosyncrasies of documents about Greco-Roman culture. We develop
these models by adding markup, using a combination of manual and automated
methods, to finite bodies of material as training sets. Machine learning systems
then scan these bodies and recognize that Alexandria usually refers to the city in
Egypt and almost never to the suburb of Washington, DC, by that name. An ambiguous
word such as “case” probably designates a grammatical case in a Greek grammar and
a display case in a museum catalogue. These domain specific features, once
identified, can help general machine translation systems avoid many of the worst
problems they face and improve the quality of their output.
Second, we need to include as much basic information as we can in forms from which
they can be converted into multiple languages. Thus, if we represent birth and
death dates in a generic form, we can then develop modules to represent that
knowledge in multiple languages.
[46] Some ontologies such as the
CIDOC-CRM for museum objects and
FRBR for books have
been under development for years and can represent a great deal of basic
background information.
[47]
Third, canonical literary texts attract very large amounts of labor. We can use
that labor to create databases of linguistic annotations that describe syntax
(e.g., the subject and object of a verb), co-reference (e.g., which person is the
subject of a particular verb), semantics (e.g., where does
oratio
correspond to “prayer” rather than “oration” or some other concept).
These annotations stored in treebanks and other linguistic databases not only
allow us to put our understanding of Greek and Latin on a wholly new, quantifiable
foundation but can resolve the ambiguities that bedevil machine translation and
can ultimately support higher quality machine translation.
[48] Such
annotations are expensive but are, in effect, the digital successors to print
editions. Where print editors labored to resolve ambiguities and problems in the
textual tradition, digital editors provide machine actionable annotations that
resolve where possible ambiguities in the reconstructed texts.
The problem of multilingual knowledge thus breaks down into language independent
and language dependent phases.
Knowledge bases (e.g., basic propositional statements) and linguistic annotation
can be created by speakers of any language. The tag sets of ontologies and
annotation schemes are relatively contained and can themselves be translated,
allowing authors to work entirely with Greek, Latin and their own primary
languages: the birthdate of a given author may be uncertain but that uncertainty
can be represented in a general form by the speaker of any language. We may differ
in how we construe the syntax of a sentence, but anyone who knows Greek,
regardless of their native language, can decide which word depends on which and
represent this in a common format.
Communities that want to make publications in their own languages accessible to
wider audiences will have to develop the training sets for documents about
classics. The results will not be perfect but readers can then use dictionary
lookups and other translation aids to more closely study the original language.
Each language needs its own training sets but this approach will not only make
publications in the traditional languages of publication accessible to wider
audiences but will also open up publications in less widely read languages (e.g.
Croatian and Dutch) to much larger audiences.
Communities that want to be able to read basic knowledge about the Greco-Roman
world in their own language will need machine translation that can be optimized
for classics and language specific drivers that can convert the basic knowledge
from ontologies into their language, and systems that can exploit the dense
linguistic annotations available for major canonical source texts.
The creation of knowledge bases designed from the start to flow from language to
language would be a radical change from traditional scholarly practice.
Nevertheless, there are profound strategic reasons for this new form of
scholarship in the two major classes of society that produce scholarship about the
Greco-Roman world.
Classical Greek and Latin are the foundational languages of Europe and were the
languages of high culture and trans-European discourse until relatively recent
times — in fact, Turkey, whatever its religious background, would only restore to
Europe a region that had been lost to it from the past. The European Union has a
commitment to make the cultural heritage of its nations intellectually accessible
to the widest possible audience. This implies an infrastructure that maximizes
what can be learned not only in English, French, German, and Italian, but in all
of the other official languages of Europe.
[49]
The United States, Canada, Australia, New Zealand, and South Africa are, however,
not only geographically distinct from Europe but are fashioning themselves into
cosmopolitan societies, European in origin but creating new identities with roots
from every civilization of humanity. The United States has in particular
identified Chinese and Arabic as the two strategic languages on which it will
concentrate its resources. While Europe concentrates on making its cultural
heritage accessible to the speakers of its official languages, American scholars
can take the lead in making classical antiquity increasingly accessible to
speakers of Chinese, Arabic and other languages. Ultimately, the increased
distribution of Greco-Roman cultural materials into many other languages will
speed the complementary process of opening up materials in classical Chinese,
Arabic, Sanskrit and other languages to speakers of English and other European
languages. Our larger goal must be to make the record of humanity accessible to
everyone regardless of linguistic and cultural background.
While a linguistically and culturally portable knowledge base about the
Greco-Roman world may seem daunting, the tools already at hand allow us to rethink
not only who can read and consume primary and secondary sources but who can
contribute substantively to the field.
Blackwell and Martin’s essay
opens this collection by describing how the practices of undergraduates have begun
to change. The rise of undergraduate research is arguably the most important and
promising development for classics as a discipline since classics lost its
privileged position. Before we can appreciate the possibilities of the technology
now available but not yet fully exploited, we need to see how much classicists
have already begun to accomplish.
Before turning to the prospects for undergraduate and more general non-specialist
research in classics, we should emphasize that the collection of essays published
here themselves illustrate the greatest achievement of classical philology in this
digital world. We now have a critical mass of classicists who are committed to
building and exploiting the evolving digital infrastructure upon which all
scholarship and teaching in our field will depend. While discussions of digital
humanities still revert to the problem of tenure and promotion, several of the
contributors to this collection have already earned tenure by pursuing digital
projects. All of the authors here are able to review innovative forms of digital
scholarship on its intellectual merits, neither penalizing or rewarding the use of
digital technologies per se but assessing the degree to which the new work
advances our ancient and unchanging goals to bring the Greco-Roman heritage in
general and ancient Greek and Latin in particular ever more fully to life in the
minds of the broadest audience possible.
No one showed more vision and patience to create this community than our colleague
and beloved friend, Allen Ross Scaife. He showed the way with his own pioneering
work on Diotima, a digital representations of women
in antiquity. As director of the Stoa from its
founding until his death ten years later, Ross always understood that the greatest
resource for any field was the people whom it attracted. Ross supported, fostered,
encouraged, and advanced careers that will continue now for decades and will shape
other careers as well. “Do not lament,” the
Pericles of Thucydides (1.143.5) tells the Athenians, “houses and land but people, for it is not houses and land that
acquire people but people who acquire them.” The passing of Ross Scaife
wounds the field of classics more deeply than would have the loss of everything
that the field as a whole has produced. But the community that Ross fostered with
intelligence, patience and love and that produced these essays is greater than any
single achievement that their authors could ever produce.