<?xml version="1.0" encoding="UTF-8"?>
<?oxygen RNGSchema="../../common/schema/DHQPublish.rng" type="xml"?>
<DHQarticle xmlns="http://digitalhumanities.org/DHQ/namespace"
  xmlns:cc="http://web.resource.org/cc/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
  <DHQheader>
    <title>Computational Linguistics and Classical Lexicography</title>
    
    <author>
      <name>David <family>Bamman</family></name>
      <affiliation>Tufts University</affiliation>
      <email>David.Bamman@tufts.edu</email>
      <bio><p>David Bamman is a senior researcher in computational linguistics for 
        the Perseus Project, focusing especially on natural language 
        processing for Latin and Greek, including treebank construction, 
        computational lexicography, morphological tagging and word sense 
        disambiguation. David received a BA in Classics from the University of 
        Wisconsin-Madison and an MA in Applied Linguistics from Boston 
        University. He is currently leading the development of the Latin 
        Dependency Treebank and the Dynamic Lexicon Project.</p></bio>
    </author>
    <author>
      <name>Gregory <family>Crane</family></name>
      <affiliation>Tufts University</affiliation>
      <email>gregory.crane@tufts.edu</email>
      <bio><p>Gregory Crane, Professor of Classics and Winnick Family Chair of
        Technology and Entrepreneurship at Tufts University, is the editor in
        chief of the Perseus Project. He has a broad interest in and has
        published extensively on the interaction between intellectual practice
        and technological infrastructure in the humanities.</p></bio>
    </author>
    <publicationStmt>
      <idno type="DHQarticle-id">000033</idno>
      <idno type="volume">003</idno>
      <idno type="issue">1</idno>
      <issueTitle>Winter 2009</issueTitle>
      <specialTitle>Changing the Center of Gravity: Transforming Classical Studies Through Cyberinfrastructure</specialTitle>
      <articleType>article</articleType>
      <date when="2009-02-26">26 February 2009</date>
      <availability>
        <cc:License rdf:about="http://creativecommons.org/licenses/by-nc-nd/2.5/"/>
      </availability>
    </publicationStmt>
    <langUsage>
      <language id="en"/>
    </langUsage>
    <history>
      <revisionDesc>
        <change when="2009-01-29" who="JHF">Reviewed XML, fixed heaader.</change>
        <change when="2009-01-30" who="CRB">Added publicationStmt element and associated content.</change>
        <change when="2009-01-30" who="JHF">Fixed encoding of notes.</change>
        <change when="2009-02-10" who="JHF">Made corrections based on author review.</change>
        
      </revisionDesc>
    </history>
    <abstract>
      <p>Manual lexicography has produced extraordinary results for Greek and Latin, but it cannot
        in the immediate future provide for all texts the same level of coverage available for the
        most heavily studied materials. As we build a cyberinfrastructure for Classics in the
        future, we must explore the role that automatic methods can play within it. Using
        technologies inherited from the disciplines of computational linguistics and computer
        science, we can create a complement to these traditional reference works - a dynamic lexicon
        that presents statistical information about a word’s usage in context, including information
        about its sense distribution within various authors, genres and eras, and syntactic
        information as well.</p>
    </abstract>
    <teaser>Automated methods for lexicography</teaser>
  </DHQheader>
  <text>
    <epigraph>
      <quote rend="block">
        <p>...Great advances have been made in the sciences on which lexicography depends. Minute
          research in manuscript authorities has largely restored the texts of the classical
          writers, and even their orthography. Philology has traced the growth and history of
          thousands of words, and revealed meanings and shades of meaning which were long unknown.
          Syntax has been subjected to a profounder analysis. The history of ancient nations, the
          private life of the citizens, the thoughts and beliefs of their writers have been closely
          scrutinized in the light of accumulating information. Thus the student of to-day may
          justly demand of his Dictionary far more than the scholarship of thirty years ago could
          furnish.</p>
      </quote>
      <ref target="#lewisshort">Advertisement for the Lewis &amp; Short Latin
        Dictionary, March 1, 1879.</ref>
    </epigraph>
    <div id="div1">
      <p>The “scholarship of thirty years ago” that Lewis and Short here distance themselves from is
        Andrews' 1850 <ref target="#andrews1850">Latin-English lexicon</ref>, itself largely a
        translation of Freund’s German <ref target="#freund1840">Wörterbuch</ref> published only a
        decade before. As we design a cyberinfrastructure to support Classical Studies in the
        future, we will soon cross a similar milestone: the <ref target="#old">Oxford Latin
          Dictionary</ref> (1968-1982) has begun the slow process of becoming thirty years old
        (several of the earlier fascicles have already done so) and by 2012 the eclipse will be
        complete. Founded on the same lexicographic principles that produced the juggernaut
          <emph>Oxford English Dictionary</emph>, the <emph>OLD</emph> is a testament to the
        extraordinary results that rigorous manual labor can provide. It has, along with the <ref
          target="#tll">Thesaurus Linguae Latinae</ref>, provided extremely thorough coverage for
        the texts of the Golden and Silver Age in Latin literature and has driven modern scholarship
        for the past thirty years.</p>
      <p>Manual methods, however, cannot in the immediate future provide for all texts the same
        level of coverage available for the most heavily studied materials, and as we think toward
        Classics in the next ten years, we must think not only of desiderata, but also of the means that would get
        us there. Like Lewis and Short, we can also say that great advances have been made over the
        past thirty years in the sciences underlying lexicography; but the “sciences” that we group
        in that statement include not only the traditional fields of paleography, philology, syntax
        and history, but computational linguistics and computer science as well.</p>
      <p>Lexicographers have long used computers as an aid in dictionary production, but the recent
        rise of statistical language processing now lets us do far more: instead of using computers
        to simply expedite our largely manual labor, we can now use them to uncover knowledge that
        would otherwise lie hidden in expanses of text. Digital methods also let us deal well with
        scale. For instance, while the <emph>OLD</emph> focused on a canon of Classical authors that
        ends around the second century CE, Latin continued to be a productive language for the
        ensuing two millennia, with prolific writers in the Middle Ages, Renaissance and beyond. The
        Index Thomisticus <ptr target="#thomisticus"/> alone contains 10.6 million words attributed
        to Thomas Aquinas and related authors, which is by itself larger than the entire corpus of
        extant classical Latin.<note>The Biblioteca Teubneriana BTL-1 collection, for instance, contains 6.6 million
          words, covering Latin literature up to the second century CE. For a recent overview of the
          Index Thomisticus, including the corpus size and composition, see <ref target="#busa2004">Busa
            (2004)</ref>.</note> Many handcrafted lexica exist for this period,
        from the scale of individual authors (cf. Ludwig Schütz’ 1895 <ref target="#schutz1895"
          >Thomas-Lexikon</ref>) to entire periods (e.g., J. F. Niermeyer’s 1976 <ref
          target="#niermeyer1976">Mediae Latinitatis Lexikon Minus</ref>), but we can still do more:
        we can create a dynamic lexicon that can change and grow when fed with new texts, and that
        can present much more information about a word than reference works bound by the conventions
        of the printed page.</p>
      <p>In deciding how we want to design a cyberinfrastructure for Classics over the next ten
        years, there is an important question that lurks between “where are we now?” and
        “where do we want to be?”: where are our colleagues already? Computational
        linguistics and natural language processing generally perform best in high-resource
        languages – languages like English, on which computational research has been focusing for
        over sixty years, and for which expensive resources (such as treebanks, ontologies and
        large, curated corpora) have long been developed. Many of the tools we would want in the
        future are founded on technologies that already exist for English and other languages; our
        task in designing a cyberinfrastructure may simply be to transfer and customize them for
        Classical Studies. Classics has arguably the most well-curated collection of texts in the
        world, and the uses its scholars demand from that collection are unique. In the following I
        will document the technologies available to us in creating a new kind of reference work for
        the future – one that complements the traditional lexicography exemplified by the
        <emph>OLD</emph> and the <emph>TLL</emph> and lets scholars interact with their texts in new
        and exciting ways.</p>
    </div>
    <div id="div2">
      <head>Where are we now?</head>
      <p>In answering this question, I am mainly concerned with two issues: the production of
        reference works (i.e., the act of lexicography) and the use that scholars make of them.</p>
      <p>All of the reference works available in Classics are the products of manual labor, in which
        highly skilled individuals find examples of a word in context, cluster those examples into
        distinguishable “senses,” and label those senses with a word or phrase in another language
        (like English) or in the source language (as with the <emph>TLL</emph>). In the past thirty
        years, computers have allowed this process to be significantly expedited, even in such
        simple ways as textual searching. Rather than relying on a vast network of volunteer readers
        to read through scores of books and write down “apt” sentences as they come across them (as
        with the <emph>OED</emph>), we can simply search our electronic corpora, find all examples
        of a word in context, and winnow through them sequentially to find those that most clearly
        illuminate the meaning of any given sense. This approach has been exploited most recently by
        the Greek Lexicon Project<note>See <ref target="http://people.pwf.cam.ac.uk/blf10/GLP/Greek_Lexicon_Project.htm">http://people.pwf.cam.ac.uk/blf10/GLP/Greek_Lexicon_Project.htm</ref>.
        </note> at the University of Cambridge, which has been
        developing a <emph>New Greek Lexicon</emph> since 1998 using a large database of
        electronically compiled slips (with a target completion date of 2010). Here the act of
        lexicography is still very manual, as each dictionary sense is still heavily curated, but
        the tedious job of citation collection is not.</p>
      <p>We can contrast this computer-assisted lexicography with a new variety – which we might
        more properly call “computational lexicography” – that has emerged with the COBUILD project
          <ptr target="#sinclair1987"/> of the late 1980s. The <emph>COBUILD English Language
          Dictionary</emph> (1987) is a learner’s dictionary centered around a word’s use in
        context, and is created from an analysis of an evolving English textual corpus (the Bank of
        English, on which current editions of the COBUILD dictionary are based, was officially
        launched in 1991 and now includes 524 million words<note id="boe">See <ref target="http://www.collins.co.uk/books.aspx?group=153">http://www.collins.co.uk/books.aspx?group=153</ref>.
        </note>). This corpus
        evidence allows lexicographers to include frequency information as part of a word’s entry
        (helping learners concentrate on common words) and also to include sentences from the corpus
        that demonstrate a word’s common collocations – the words and phrases that it frequently
        appears with. By keeping the underlying corpus up to date, the editors are also able to add
        new headwords as they appear in the language, and common multi-word expressions and idioms
        (such as <emph>bear fruit</emph>) can also be uncovered as well.</p>
      <p>This corpus-based approach has since been augmented in two dimensions. On the one hand,
        dictionaries and lexicographic resources are being built on larger and larger textual
        collections: the German <emph>elexiko</emph> project <ptr target="#elexiko"/>, for instance,
        is built on a modern German corpus of 1.3 billion words, and we can expect much larger
        projects in the future as the web is exploited as a corpus.<note>In 2006, for example, Google released the first version of its Web 1T 5-gram
          corpus <ptr target="#brants2006"/> – a collection of n-grams (n=1-5) and their frequencies
          calculated from 1 trillion words of text on the web.</note> At the
        same time, researchers are also subjecting their corpora to more complex automatic processes
        to extract more knowledge from them. While word frequency and collocation analysis is
        fundamentally a task of simple counting, projects such as Kilgarriff’s Sketch Engine <ptr
          target="#kilgarriff2004"/> also enable lexicographers to induce information about a word’s
        grammatical behavior as well.</p>
      <p>In their ability to include statistical information about a word’s actual use, these
        contemporary projects are exploiting advances in computational linguistics that have been
        made over the past thirty years. Before turning, however, to how we can adapt these
        technologies in the creation of a new and complementary reference work, we must first
        address the use of such lexica.</p>
      <p>Like the <emph>OED</emph>, Classical lexica generally include a list of citations under
        each headword, providing testimony by real authors for each sense. Of necessity, these
        citations are usually only exemplary selections, though the <emph>TLL</emph> provides
        comprehensive listings by Classical authors for many of its lemmata. These citations
        essentially function as an index into the textual collection. If I am interested in the
        places in Classical literature where the verb <emph>libero</emph> means <emph>to
        acquit</emph>, I can consult the <emph>OLD</emph> and then turn to the source texts it
        cites: Cic. <emph>Ver</emph>. 1.72, Plin. <emph>Nat</emph>. 6.90, etc. For a more
        comprehensive (but not exhaustive) comparison, I can consult the <emph>TLL</emph>.</p>
      <p>This is what we might consider a manual form of “lemmatized searching.” The Perseus Digital
        Library<note>
          See <ref target="http://www.perseus.tufts.edu/hopper">http://www.perseus.tufts.edu/hopper/</ref>.
        </note> and the Thesaurus Linguae Graecae<note>See <ref target="http://www.tlg.uci.edu/">http://www.tlg.uci.edu/</ref>.
        </note> both
        provide a form of lemmatized searching for their respective texts, but it is a fuzzier
        variety than that presented here: a user can search for a word form such as <emph>edo</emph>
          (<emph>to eat</emph>) and simultaneously search the texts for all of its various
        inflections, but ambiguity is rampant - a lemmatized search for <emph>edo</emph> would also
        search for <emph>est</emph>, which is also an inflection of the far more common
        <emph>sum</emph> (<emph>to be</emph>). The search results are thus significantly diluted by
        a large number of false positives.</p>
      <p>The advantage of the Perseus and TLG lemmatized search is that it gives scholars the
        opportunity to find all the instances of a given word form or lemma in the textual
        collections they each contain. The <emph>TLL</emph> may be built on a comprehensive
        collection of 10 million slips containing all of Latin literature up to 200 CE and
        selections beyond, but that complete collection can only be found housed in their archives;
        what we have in print and on CD-ROM is still only a sample. The <emph>TLL</emph>, however,
        is impeccable in precision, while the Perseus and TLG results are dirty. What we need is a
        resource to combine the best of both.</p>
    </div>
    <div id="div3">
      <head>Where do we want to be?</head>
      <p>The <emph>OLD</emph> and <emph>TLL</emph> are not likely to become obsolete anytime soon;
        as the products of highly skilled editors and over a century of labor, the sense
        distinctions within them are highly precise and well substantiated. What we can provide in the near future, 
        however, is a complement to these resources, one that presents statistics about a
        word’s actual usage in texts – and not only in texts from the Classical period, but from any
        era for which we have electronic corpora. Heavily curated reference works provide great
        detail for a small set of texts; our complement is to provide lesser detail for
        <emph>all</emph> texts.</p>
      <p>In order to accomplish this, we need to consider the role that automatic methods can play
        within our emerging cyberinfrastructure. I distinguish cyberinfrastructure from the vast
        corpora that exist for modern languages not only in the structure imposed upon the texts
        that comprise it, but also in the very composition of those texts: while modern reference
        corpora are typically of little interest in themselves (as mainly newswire), Classical texts
        have been the focus of scholars’ attention for millennia. The meaning of the word
          <emph>child</emph> in a single sentence from the <emph>Wall Street Journal</emph> is
        hardly a research question worth asking, except for the newspaper’s significance in being
        representative of the language at large; but this same question when asked of Vergil’s
        fourth <emph>Eclogue</emph> has been at the center of scholarly debate since the time of the
        emperor Constantine.<note id="bourne">See <ptr target="#bourne1916"/> for an overview of
          <emph>puer</emph> in <emph>Ec.</emph> IV.</note> We need to provide traditional scholars with the
        apparatus necessary to facilitate their own textual research. This will be true of a
        cyberinfrastructure for any historical culture, and for any future structure that develops
        for modern scholarly corpora as well.</p>
      <p>We therefore must concentrate on two problems. First, how much can we automatically learn
        from a large textual collection using machine learning techniques that thrive on large
        corpora? And second, how can the vast labor already invested in handcrafted lexica help
        those techniques to learn?</p>
      <p>What we can learn from such a corpus is actually quite significant. With a large bilingual
        corpus, we can induce a word sense inventory to establish a baseline for how frequently
        certain definitions of a word are manifested in actual use; we can also use the context
        surrounding each word to establish which particular definition is meant in any given
        instance. With the help of a treebank (a handcrafted collection of syntactically parsed
        sentences), we can train an automatic parser to parse the sentences in a monolingual corpus
        and extract information about a word’s subcategorization frames (the common syntactic
        arguments it appears with – for instance, that the verb <emph>dono</emph> (to give) requires
        a subject, direct object and indirect object), and selectional preferences (e.g., that the
        subject of the verb <emph>amo</emph> (to love) is typically animate). With clustering
        techniques, we can establish the semantic similarity between two words based on their
        appearance in similar contexts.</p>
      <p>If we leverage all of these techniques to create a lexicon for both Latin and Greek, the
        lexical entries in each reference work could include the following:</p>
      <list type="ordered">
        <item>a list of possible senses, weighted according to their probability;</item>
        <item>a list of instances of each sense in the source texts;</item>
        <item>a list of common subcategorization frames, weighted according to their probability;
          and</item>
        <item>a list of selectional preferences, weighted according to their probability.</item>
      </list>
      <p>In creating a lexicon with these features, we are exploring two strengths of automated
        methods: they can analyze not only very large bodies of data but also provide customized
        analysis for particular texts or collections. We can thus not only identify patterns in one
        hundred and fifty million words of later Latin but also compare which senses of which words
        appear in the one hundred and fifty thousand words of Thucydides. <ref target="#fig01"
          >Figure 1</ref> presents a mock-up of what a dictionary entry could look like in such a
        dynamic reference work. The first section (“Translation equivalents”) presents items 1 and 2
        from the list, and is reminiscent of traditional lexica for classical languages: a list of
        possible definitions is provided along with examples of use. The main difference between a
        dynamic lexicon and those print lexica, however, lies in the scope of the examples: while
        print lexica select one or several highly illustrative examples of usage from a source text,
        we are in a position to present far more. <figure id="fig01">
          <label>Mock-up of a sample entry in a dynamic lexicon</label>
          <graphic url="resources/images/libero.png"/>
          <figDesc>Mock-up of a sample entry in a dynamic lexicon.</figDesc></figure>
      </p>
    </div>
    <div id="div4">
      <head>How do we get there?</head>
      <p>We have already begun work on a dynamic lexicon like that shown in <ref target="#fig01"
          >Figure 1</ref> <ptr target="#bamman2008"/>. Our approach is to use already established methods in natural language
        processing; as such, our methodology involves the application of three core technologies:</p>
      <list type="ordered">
        <item>identifying word senses from parallel texts;</item>
        <item>locating the correct sense for a word using contextual information; and</item>
        <item>parsing a text to extract important syntactic information.</item>
      </list>
      <p>Each of these technologies has a long history of development both within the Perseus
        Project and in the natural language processing community at large. In the following I will
        detail how we can leverage them all to uncover large-scale usage patterns in a text.</p>
      <div id="div4a">
        <head>Word Sense Induction</head>
        <p>Our work on building a Latin sense inventory from a small collection of parallel texts in
          our digital library is based on that of <ref target="#brown1991">Brown et al. 1991</ref>
          and <ref target="#gale1992">Gale et al. 1992</ref>, who suggest that one way of
          objectively detecting the real senses of any given word is to analyze its translations: if
          a word is translated as two semantically distinct terms in another language, we have
            <emph>prima facie</emph> evidence that there is a real sense distinction. So, for
          example, the Greek word <emph>archê</emph> may be translated in one context as
            <emph>beginning</emph> and in another as <emph>empire</emph>, corresponding respectively
          to LSJ definitions I.1 and II.2.</p>
        <p>Finding all of the translation equivalents for any given word then becomes a task of
          aligning the source text with its translations, at the level of individual words. The
          Perseus Digital Library contains at least one English translation for most of its Latin
          and Greek prose and poetry source texts. Many of these translations are encoded under the
          same canonical citation scheme as their source, but must further be aligned at the
          sentence and word level before individual word translation probabilities can be
          calculated. The workflow for this process is shown in <ref target="#fig02">Figure 2.</ref></p>
        <figure id="fig02">
          <label>Alignment workflow</label>
          <graphic url="resources/images/alignment.jpg"/>
          <figDesc>Alignment workflow.</figDesc>
        </figure>
        <p>Since the XML files of both the source text and its translations are marked up with the
          same reference points, “chapter 1, section 1” of Tacitus' <emph>Annales</emph> is
          automatically aligned with its English translation (step 1). This results (for Latin at
          least) in aligned chunks of text that are 217 words long. These chunks are then aligned on
          a sentence level in step 2 using Moore’s Bilingual Sentence Aligner <ptr
            target="#moore2002"/>, which aligns sentences that are 1-1 translations of each other
          with a very high precision (98.5% for a corpus of 10,000 English-Hindi sentence pairs <ptr
            target="#singh2005"/>).</p>
        <p>In step 3, we then align these 1-1 sentences using GIZA++ <ptr target="#giza"/>. Prior to
          alignment, all of the tokens in the source text and translation are lemmatized, where each
          word is replaced with all of the lemmas from which it can be inflected (for example, the
          Latin word <emph>est</emph> is replaced with <emph>sum1 edo1</emph> and the English word
            <emph>is</emph> is replaced with <emph>be</emph>). This word alignment is performed in
          both directions in order to discover multi-word expressions (MWE's) in the source
          language. </p>
        <figure id="fig03">
          <label>Sample word alignment from GIZA++</label>
          <graphic url="resources/images/salvus.jpg"/>
          <figDesc>Sample word alignment from GIZA++.</figDesc>
        </figure>
        <p><ref target="#fig03">Figure 3</ref> shows the result of this word alignment (here with
          English as the source language). The original, pre-lemmatized Latin is <emph>salvum tu me
            esse cupisti</emph> (Cicero, <emph>Pro Plancio</emph>, chapter 33). The original English
          is <emph>you wished me to be safe</emph>. As a result of the lemmatization process, many
          source words are mapped to multiple words in the target – most often to lemmas which share
          a common inflection. For instance, during lemmatization, the Latin word <emph>esse</emph>
          is replaced with the two lemmas from which it can be derived – <emph>sum1</emph> (<emph>to
            be</emph>) and <emph>edo1</emph> (<emph>to eat</emph>). If the word alignment process
          maps the source word <emph>be</emph> to both of these lemmas in a given sentence (as in
            <ref target="#fig03">Figure 3</ref>), the translation probability is divided evenly
          between them.</p>
        <p>From these alignments we can calculate overall translation probabilities, which we
          currently present as an ordered list, as in <ref target="#fig04">Figure 4.</ref></p>
        <figure id="fig04">
          <label>Sense inventory for <emph>oratio</emph> induced from parallel texts</label>
          <graphic url="resources/images/oratio.png"/>
          <figDesc>Sense inventory for oratio induced from parallel texts.</figDesc>
        </figure>
        <p>The weighted list of translation equivalents we identify using this technique can provide
          the foundation for our further lexical work. In the example above, we have induced from
          our collection of parallel texts that the headword <emph>oratio</emph> is primarily used
          with two senses: <emph>speech</emph> and <emph>prayer</emph>. </p>
        <p>The granularity of the definitions in such a dynamic lexicon cannot approach that of
          human labor: the Lewis and Short <emph>Latin Dictionary</emph>, for instance, enumerates
          fourteen subsenses in varying degrees of granularity, from “speech” to “formal language”
          to the “power of oratory” and beyond. Our approach, however, does have two clear
          advantages which complement those of traditional lexica: first, this method allows us to
          include statistics about actual word usage in the corpus we derive it from. The use of
            <emph>oratio</emph> to signify <emph>prayer</emph> is not common in classical Latin, but
          since the corpus we induced this inventory from is largely composed of the
          <emph>Vulgate</emph> of Jerome, we are also able to mine this use of the word and include
          it in this list as well. Since the lexicon is dynamic, we can generate a sense inventory
          for an entire corpus or any part of it – so that if we were interested, for instance, in
          the use of <emph>oratio</emph> only until the second century CE, we can exclude the texts
          of Jerome from our analysis. And since we can run our word alignment at any time, we are
          always in a position to update the lexicon with the addition of new texts.</p>
        <figure id="fig05">
          <label>Sense inventory for the multi-word expression <emph>res publica</emph> induced from
            parallel texts</label>
          <graphic url="resources/images/respublica.png"/>
          <figDesc>Sense inventory for the multi-word expression res publica induced
            from parallel texts.</figDesc>
        </figure>
        <p>Second, our word alignment also maps multi-word expressions, so we can include
          significant collocations in our lexicon as well. This allows us to provide translation
          equivalents for idioms and common phrases such as <emph>res publica</emph>
          (<emph>republic</emph>) or <emph>gratias ago</emph> (<emph>to give thanks</emph>). </p>
      </div>
      <div id="div4b">
        <head>Word Sense Disambiguation</head>
        <p>Approaches to word sense disambiguation generally come in three varieties: </p>
        <list type="ordered">
          <item>knowledge-based methods (<ref target="#lesk1986">Lesk 1986</ref>, <ref
              target="#banerjee2002">Banerjee and Pedersen 2002</ref>), which rely on existing
            reference works with a clear structure such as dictionaries and Wordnets <ptr
              target="#wordnet"/>; </item>
          <item>supervised corpus methods <ptr target="#grozea2004"/>, which train a classifier on a
            human-annotated sense corpus such as Semcor <ptr target="#miller1993"/> or any of the
            SENSEVAL competition corpora <ptr target="#mihalcea2004"/>; and </item>
          <item>unsupervised corpus methods, which train classifiers on “raw,” unannotated text,
            either a monolingual corpus <ptr target="#mccarthy2004"/> or parallel texts (<ref
              target="#brown1991">Brown et al. 1991</ref>, <ref target="#tufis2004">Tufis et al.
              2004</ref>). </item>
        </list>
        <p>Corpus methods (especially supervised methods) generally perform best in the SENSEVAL
          competitions – at SENSEVAL-3, the best system achieved an accuracy of 72.9% in the English
          lexical sample task and 65.1% in the English all-words task.<note id="senseval">At the time of writing, the SEMEVAL-1/SENSEVAL-4 (2007) competition is
            currently underway.</note>
          Manually annotated corpora, however, are generally cost-prohibitive to create, and this is
          especially exacerbated with sense-tagged corpora, for which the human inter-annotator
          agreement is often low.</p>
        <p>Since the Perseus Digital Library contains two large monolingual corpora (the canon of
          Greek and Latin classical texts) and sizable parallel corpora as well, we have
          investigated using parallel texts for word sense disambiguation. This method uses the same
          techniques we used to create a sense inventory to disambiguate words in context. After we
          have a list of possible translation equivalents for a word, we can use the surrounding
          Latin or Greek context as an indicator for which sense is meant in texts where we have no
          corresponding translation. There are several techniques available for deciding which sense
          is most appropriate given the context, and several different measures for what definition
          of “context” is most appropriate itself. One technique that we have experimented with is a
          naive Bayesian classifier (following <ref target="#gale1992">Gale et al. 1992</ref>), with
          context defined as a sentence-level bag of words (all of the words in the sentence
          containing the word to be disambiguated contribute equally to its disambiguation). </p>
        <p>Bayesian classification is most commonly found in spam filtering. A filtering program can
          decide whether or not any given email message is spam by looking at the words that
          comprise it and comparing it to other messages that are already known to be spam – some
          words generally only appear in spam messages (e.g., <emph>viagra</emph>,
          <emph>refinance</emph>, <emph>opt-out</emph>, <emph>shocking</emph>), while others only
          appear in non-spam messages (<emph>archê</emph>, <emph>subcategorization</emph>), and some
          appear equally in both (<emph>and</emph>, <emph>your</emph>). By counting each word and
          the class (spam/not spam) it appears in, we can assign it a probability that it falls into
          one class or the other.</p>
        <p>We can also use this principle to disambiguate word senses by building a classifier for
          every sense and training it on sentences where we do know the correct sense for a word.
          Just as a spam filter is trained by a user explicitly labeling a message as spam, this
          classifier can be trained simply by the presence of an aligned translation. </p>
        <p>For instance, the Latin word <emph>spiritus</emph> has several senses, including
            <emph>spirit</emph> and <emph>wind</emph>. In our texts, when <emph>spiritus</emph> is
          translated as <emph>wind</emph>, it is accompanied by words like <emph>mons</emph>
          (mountain), <emph>ala</emph> (wing) or <emph>ventus</emph> (wind). When it is translated
          as <emph>spirit</emph>, its context has (more naturally) a religious tone, including words
          such as <emph>sanctus</emph> (holy) and <emph>omnipotens</emph> (all-powerful). If we are
          confronted with an instance of <emph>spiritus</emph> in a sentence for which we have no
          translation, we can disambiguate it as either <emph>spirit</emph> or <emph>wind</emph> by
          looking at its context in the original Latin.</p>
        <table>
          
          <row>
            <cell role="label">Latin context word</cell>
            <cell role="label">English translation</cell>
            <cell role="label">Probability of accompanying <emph>spiritus</emph> =
            <emph>wind</emph></cell>
          </row>
          <row>
            <cell>Mons</cell>
            <cell>Mountain</cell>
            <cell>98.3%</cell>
          </row>
          <row>
            <cell>Commotio</cell>
            <cell>Commotion</cell>
            <cell>98.3%</cell>
          </row>
          <row>
            <cell>Ventus</cell>
            <cell>Wind</cell>
            <cell>95.2%</cell>
          </row>
          <row>
            <cell>Ala</cell>
            <cell>Wing</cell>
            <cell>95.2%</cell>
          </row>
          <caption>
            Latin contextual probabilities where <emph>spiritus</emph> = <emph>wind</emph>.
          </caption>
        </table>
        <table>
          
          <row>
            <cell role="label">Latin context word</cell>
            <cell role="label">English translation</cell>
            <cell role="label">Probability of accompanying <emph>spiritus</emph> =
              <emph>spirit</emph></cell>
          </row>
          <row>
            <cell>Sanctus</cell>
            <cell>Holy</cell>
            <cell>99.9%</cell>
          </row>
          <row>
            <cell>Testis</cell>
            <cell>Witness</cell>
            <cell>99.9%</cell>
          </row>
          <row>
            <cell>Vivifico</cell>
            <cell>Make alive</cell>
            <cell>99.9%</cell>
          </row>
          <row>
            <cell>Omnipotens</cell>
            <cell>All-powerful</cell>
            <cell>99.9%</cell>
          </row>
          <caption>
            
              Latin contextual probabilities where <emph>spiritus</emph> = <emph>spirit</emph>.
            
          </caption>
        </table>
        <p>Word sense disambiguation will be most helpful for the construction of a lexicon when we
          are attempting to determine the sense for words in context for the large body of later
          Latin literature for which there exists no English translation. By training a classifier
          on texts for which we do have translations, we will be able to determine the sense in
          texts for which we don’t: if the context of <emph>spiritus</emph> in a late Latin text
          includes words such as <emph>mons</emph> and <emph>ala</emph>, we can use the
          probabilities we induced from parallel texts to know with some degree of certainty that it
          refers to <emph>wind</emph> rather than <emph>spirit</emph>. This will enable us to
          include these later texts in our statistics on a word’s usage, and link these passages to
          the definition as well.</p>
      </div>
      <div id="div4c">
        <head>Parsing</head>
        <p>Two of the features we would like to incorporate into a dynamic lexicon are based on a
          word’s role in syntax: subcategorization and selectional preference. A verb’s
          subcategorization frame is the set of possible combinations of surface syntactic arguments
          it can appear with. In linear, unlabeled phrase structure grammars, these frames take the
          form of, for example, <emph>NP PP</emph> (requiring a direct object + prepositional
          phrase, as in <emph>I gave a book to John</emph>) or <emph>NP NP</emph> (requiring two
          objects, as in <emph>I gave John a book</emph>). In a labeled dependency grammar, we can
          express a verb’s subcategorization as a combination of syntactic roles (e.g., OBJ OBJ).</p>
        <p>A predicate’s selectional preference specifies the type of argument it generally appears
          with. The verb <emph>to eat</emph>, for example, typically requires its object to be a
          thing that can be eaten and its subject to have animacy, unless used metaphorically.
          Selectional preference, however, can also be much more detailed, reflecting not only a
          word class (such as <emph>animate</emph> or <emph>human</emph>), but also individual words
          themselves. For instance, the kind of arguments used with the Latin verb
          <emph>libero</emph> (to free) are very different in Cicero and Jerome: Cicero, as an
          orator of the republic, commonly uses it to speak of liberation from
          <emph>periculum</emph> (danger), <emph>metus</emph> (fear), <emph>cura</emph> (care) and
            <emph>aes alienum</emph> (debt); Jerome, on the other hand, uses it to speak of
          liberation from a very different set of things, such as <emph>manus Aegyptorum</emph> (the
          hand of the Egyptians), <emph>os leonis</emph> (the mouth of the lion), and
          <emph>mors</emph> (death).<note id="bamman">See <ptr target="#bamman2007"/> for a summary of
            this work.</note> These are syntactic qualities since each
          of these arguments bears a direct syntactic relation to their head as much as they hold a
          semantic place within the underlying argument structure.</p>
        <p>In order to extract this kind of subcategorization and selectional information from
          unstructured text, we first need to impose syntactic order on it. One option for imposing
          this kind of order is through manual annotation, but this option is not feasible here due
          to the sheer volume of data involved – even the more resourceful of such endeavors (such
          as the Penn Treebank <ptr target="#penn"/> or the Prague Dependency Treebank <ptr
            target="#pdt"/>) take years to complete.</p>
        <p>A second, more practical option is to assign syntactic structure to a sentence using
          automatic methods. Great progress has been made in recent years in the area of syntactic
          parsing, both for phrase structure grammars (<ref target="#charniak2000">Charniak
          2000</ref>, <ref target="#collins1999">Collins 1999</ref>) and dependency grammars (<ref
            target="#nivre2006">Nivre et al. 2006</ref>, <ref target="#mcdonald2005">McDonald et al.
            2005</ref>), with labeled dependency parsing achieving an accuracy rate approaching 90%
          for English (a high resource, fixed word order language) and 80% for Czech (a relatively
          free word order language like Latin and Greek). Automatic parsing generally requires the
          presence of a treebank – a large collection of manually annotated sentences – and a
          treebank’s size directly correlates with parsing accuracy: the larger the treebank, the
          better the automatic analysis.</p>
        <p>We are currently in the process of creating a treebank for Latin, and have just begun work on a one-million-word treebank of Ancient Greek. Now in version 1.5, the
          Latin Dependency Treebank<note>See <ref target="http://nlp.perseus.tufts.edu/syntax/treebank/">http://nlp.perseus.tufts.edu/syntax/treebank/</ref>.
          </note> is composed of excerpts from eight texts, including Caesar, Cicero, Jerome, Ovid, Petronius, Propertius, Sallust and Vergil. Each
          sentence in the treebank has been manually annotated so that every word is assigned a
          syntactic relation, along with the lemma from which it is inflected and its morphological
          code (a composite of nine different morphological features: part of speech, person,
          number, tense, mood, voice, gender, case and degree). Based predominantly on the
          guidelines used for the Prague Dependency Treebank, our annotation style is also
          influenced by the Latin grammar of <ref target="#pinkster1990">Pinkster (1990)</ref>, and
          is founded on the principles of dependency grammar <ptr target="#melcuk1988"/>. Dependency
          grammars differ from phrase-structure grammars in that they forego non-terminal phrasal
          categories and link words themselves to their immediate heads. This is an especially
          appropriate manner of representation for languages with a free word order (such as Latin
          and Czech), where the linear order of constituents is broken up with elements of other
          constituents. A dependency grammar representation, for example, of <emph>ista meam norit
            gloria canitiem</emph> (Propertius I.8.46) – “that glory would know my old age” – would
          look like the following:</p>
        <figure id="fig06">
          <label>Dependency grammar representation of <emph>ista meam norit gloria canitiem</emph>
            ("that glory would know my old age")</label>
          <graphic url="resources/images/ista.png"/>
          <figDesc>Dependency grammar representation of ista meam norit gloria canitiem
            ("that glory would know my old age").</figDesc>
        </figure>
        <p>While this treebank is still in its infancy, we can still use it to 
train a parser to parse the volumes of unstructured Latin in our collection.  Our treebank is still too small to achieve state-of-the-art results in parsing but we can still induce valuable lexical information from its output by using a large corpus and simple 
hypothesis testing techniques to outweigh the noise of the occasional 
error <ptr target="#bamman2008"/>. The key to improving this parsing 
accuracy is to increase the size of the annotated treebank: the better the parser, the more accurate the syntactic information we can extract from our corpus.</p>
      </div>
    </div>
    <div id="div5">
      <head>Beyond the lexicon</head>
      <p>These technologies, borrowed from computational linguistics, will give us the grounding to
        create a new kind of lexicon, one that presents information about a word’s actual usage.
        This lexicon resembles its more traditional print counterparts in that it is a work designed
        to be browsed: one looks up an individual headword and then reads its lexical entry. The
        technologies that will build this reference work, however, do so by processing a large Greek
        and Latin textual corpus. The results of this automatic processing go far beyond the
        construction of a single lexicon. </p>
      <p>I noted earlier that all scholarly dictionaries include a list of citations illustrating a
        word’s exemplary use. As <ref target="#fig01">Figure 1</ref> shows, each entry in this new,
        dynamic lexicon ultimately ends with a list of canonical citations to fixed passages in the
        text. These citations are again a natural index to a corpus, but since they are based in an
        electronic medium, they provide the foundation for truly advanced methods of textual
        searching – going beyond a search for individual word form (as in typical search engines) to
        word sense. </p>
      <div id="div5a">
        <head>Searching by word sense</head>
        <figure id="fig07">
          <label>Mock-up of a service to search Latin texts by English word sense</label>
          <graphic url="resources/images/slave-search.png"/>
          <figDesc>Mock-up of a service to search Latin texts by English word sense.</figDesc>
        </figure>
        <p>The ability to search a Latin or Greek text by an English translation equivalent is a
          close approximation to real cross-language information retrieval. Consider scholars
          researching Roman slavery: they could compare all passages where any number of Latin
          “slave” words appear, but this would lead to separate searches for <emph>servus, serva,
            ancilla, famulus, famula, minister, ministra, puer, puella</emph> etc. (and all of their
          inflections), plus many other less-common words. By searching for word sense, however, a
          scholar can simply search for <emph>slave</emph> and automatically be presented with all
          of the passages for which this translation equivalent applies. <ref target="#fig07">Figure
            7</ref> presents a mock-up of what such a service could look like.</p>
        <p>Searching by word sense also allows us to investigate problems of changing orthography –
          both across authors and time: as Latin passes through the Middle Ages, for instance, the
          spelling of words changes dramatically even while meaning remains the same. So, for
          example, the diphthong <emph>ae</emph> is often reduced to <emph>e</emph>, and prevocalic
            <emph>ti</emph> is changed to <emph>ci</emph>. Even within a given time frame, spelling
          can vary, especially from poetry to prose. By allowing users to search for a sense rather
          than a specific word form, we can return all passages containing <emph>saeculum, saeclum,
            seculum</emph> and <emph>seclum</emph> – all valid forms for <emph>era</emph>.
          Additionally, we can automate this process to discover common words with multiple
          orthographic variations, and include these in our dynamic lexicon as well.</p>
      </div>
      <div id="div5c">
        <head>Searching by selectional preference</head>
        <p>The ability to search by a predicate’s selectional preference is also a step toward
          semantic searching – the ability to search a text based on what it “means.” In building
          the lexicon, we automatically assign an argument structure to all of the verbs. Once this
          structure is in place, it can stay attached to our texts and thereby be searchable in the
          future, allowing us to search a text for the subjects and direct objects of any verb. Our
          scholar researching Roman slavery can use this information to search not only for passages
          where any slave has been freed (i.e., when any Latin variant of the English translation
            <emph>slave</emph> is the direct object of the active form of the verb
          <emph>libero</emph>), but also who was doing the freeing (who in such instances is the
          subject of that verb). This is a powerful resource that can give us much more information
          about a text than simple search engines currently allow.</p>
      </div>
    </div>
    <div id="div6">
      <head>Conclusion</head>
      <p>Manual lexicography has produced fantastic results for Classical languages, but as we
        design a cyberinfrastructure for Classics in the future, our aim must be to build a
        scaffolding that is essentially enabling: it must not only make historical languages more
        accessible on a functional level, but intellectually as well; it must give students the
        resources they need to understand a text while also providing scholars the tools to interact
        with it in whatever ways they see fit. In this a dynamic lexicon fills a gap left by
        traditional reference works. By creating a lexicon directly from a corpus of texts and then
        situating it within that corpus itself, we can let the two interact in ways that traditional
        lexica cannot. </p>
      <p>Even driven by the scholarship of the past thirty years, however, a dynamic lexicon cannot
        yet compete with the fine sense distinctions that traditional dictionaries make, and in this
        the two works are complementary. Classics, however, is only one field among many concerned
        with the technologies underlying lexicography, and by relying on the techniques of other
        disciplines like computational linguistics and computer science, we can count on the future
        progress of disciplines far outside our own. </p>
    </div>
  </text>
 
  <listBibl>
    <bibl id="andrews1850">
      <label>Andrews 1850</label>
      <editor>Andrews, E. A. (ed.)</editor> <title>A Copious and Critical Latin-English Lexicon,
        Founded on the Larger Latin-German Lexicon of Dr. William Freund; With Additions and
        Corrections from the Lexicons of Gesner, Facciolati, Scheller, Georges, etc.</title>.
        <pubPlace>New York</pubPlace>: <publisher>Harper &amp; Bros.</publisher>,
      <date>1850</date>.</bibl>
    <bibl id="bamman2007">
      <label>Bamman and Crane 2007</label>
      <author>Bamman, David and Gregory Crane</author>. <title rend="quotes">The Latin Dependency
        Treebank in a Cultural Heritage Digital Library</title>, <title>Proceedings of the ACL
        Workshop on Language Technology for Cultural Heritage Data</title> (<date>2007</date>).</bibl>
    <bibl id="bamman2008">
      <label>Bamman and Crane 2008</label>
      <author>Bamman, David and Gregory Crane</author>. <title rend="quotes">Building a Dynamic Lexicon from a Digital Library</title>, <title>Proceedings of the 8th ACM/IEEE-CS Joint Conference 
on Digital Libraries.</title> (<date>2008</date>).</bibl>
    <bibl id="banerjee2002">
      <label>Banerjee and Pedersen 2002</label>
      <author>Banerjee, Sid and Ted Pedersen</author>. <title rend="quotes">An Adapted Lesk
        Algorithm for Word Sense Disambiguation Using WordNet</title>, <title>Proceedings of the
        Conference on Computational Linguistics and Intelligent Text Processing</title>
      (<date>2002</date>).</bibl>
    <bibl id="bourne1916">
      <label>Bourne 1916</label>
      <author>Bourne, Ella</author>. <title rend="quotes">The Messianic Prophecy in Vergil’s Fourth
        Eclogue</title>, <title>The Classical Journal</title>
      <vol>11.7</vol> (<date>1916</date>).</bibl>
    <bibl id="brants2006">
      <label>Brants and Franz 2006</label>
      <author>Brants, Thorsten and Alex Franz</author>. <title>Web 1T 5-gram Version 1</title>.
        <pubPlace>Philadelphia</pubPlace>: <publisher>Linguistic Data Consortium</publisher>,
        <date>2006</date>.</bibl>
    <bibl id="brown1991">
      <label>Brown et al. 1991</label>
      <author>Brown, Peter F., Stephen A. Della Pietra, Vincent J. Della Pietra and Robert L.
      Mercer</author>. <title rend="quotes">Word-sense disambiguation using statistical
      methods</title>, <title>Proceedings of the 29th Conference of the Association for
        Computational Linguistics</title> (<date>1991</date>).</bibl>
    <bibl id="busa2004">
      <label>Busa 2004</label>
      <author>Busa, Roberto</author>. <title rend="quotes">Foreword: Perspectives on the Digital
        Humanities</title>, <title>Blackwell Companion to Digital Humanities</title>.
        <pubPlace>Oxford</pubPlace>: <publisher>Blackwell</publisher>, <date>2004</date>.</bibl>
    <bibl id="thomisticus">
      <label>Busa 1974-1980</label>
      <author>Busa, Roberto</author>. <title>Index Thomisticus: sancti Thomae Aquinatis operum
        omnium indices et concordantiae, in quibus verborum omnium et singulorum formae et lemmata
        cum suis frequentiis et contextibus variis modis referuntur quaeque / consociata plurium
        opera atque electronico IBM automato usus digessit Robertus Busa SI</title>.
        <pubPlace>Stuttgart-Bad Cannstatt</pubPlace>: <publisher>Frommann-Holzboog</publisher>,
        <date>1974-1980</date>.</bibl>
    <bibl id="charniak2000">
      <label>Charniak 2000</label>
      <author>Charniak, Eugene</author>. <title rend="quotes">A Maximum-Entropy-Inspired
      Parser</title>, <title>Proceedings of NAACL</title> (<date>2000</date>).</bibl>
    <bibl id="collins1999">
      <label>Collins 1999</label>
      <author>Collins, Michael</author>. <title rend="quotes">Head-Driven Statistical Models for
        Natural Language Parsing</title>, <title>Ph.D. thesis</title>.
      <pubPlace>Philadelphia</pubPlace>: <publisher>University of Pennsylvania</publisher>,
        <date>1999</date>.</bibl>
    <bibl id="freund1840">
      <label>Freund 1840</label>
      <editor>Freund, Wilhelm (ed.)</editor>. <title>Wörterbuch der lateinischen Sprache: nach
        historisch-genetischen Principien, mit steter Berücksichtigung der Grammatik, Synonymik und
        Alterthumskunde</title>. <pubPlace>Leipzig</pubPlace>: <publisher>Teubner</publisher>,
        <date>1834-1840</date>.</bibl>
    <bibl id="gale1992">
      <label>Gale et al. 1992</label>
      <author>Gale, William, Kenneth W. Church and David Yarowsky</author>. <title rend="quotes"
        >Using bilingual materials to develop word sense disambiguation methods</title>,
        <title>Proceedings of the 4th International Conference on Theoretical and Methodological
        Issues in Machine Translation</title> (<date>1992</date>).</bibl>
    <bibl id="old">
      <label>Glare 1982</label>
      <editor>Glare, P. G. W. (ed.)</editor>. <title>Oxford Latin Dictionary</title>.
        <pubPlace>Oxford</pubPlace>: <publisher>Oxford University Press</publisher>,
      <date>1968-1982</date>.</bibl>
    <bibl id="grozea2004">
      <label>Grozea 2004</label>
      <author>Grozea, Christian</author>. <title rend="quotes">Finding Optimal Parameter Settings
        for High Performance Word Sense Disambiguation</title>, <title>Proceedings of Senseval-3:
        Third International Workshop on the Evaluation of Systems for the Semantic Analysis of
      Text</title> (<date>2004</date>).</bibl>
    <bibl id="pdt">
      <label>Hajič 1999</label>
      <author>Hajič, Jan</author>. <title rend="quotes">Building a Syntactically Annotated Corpus:
        The Prague Dependency Treebank</title>, <title>Issues of Valency and Meaning. Studies in
        Honour of Jarmila Panevová</title>. <pubPlace>Prague</pubPlace>: <publisher>Charles
        University Press</publisher>, <date>1999</date>.</bibl>
    <bibl id="kilgarriff2004">
      <label>Kilgarriff et al. 2004</label>
      <author>Kilgarriff, Adam, Pavel Rychly, Pavel Smrz, and David Tugwell</author>. <title
        rend="quotes">The Sketch Engine</title>, <title>Proceedings of EURALEX</title>
      (<date>2004</date>).</bibl>
    <bibl id="elexiko">
      <label>Klosa et al. 2004</label>
      <author>Klosa, Annette, Ulrich Schnörch, and Petra Storjohann</author>. <title rend="quotes"
        >ELEXIKO – A Lexical and Lexicological, Corpus-based Hypertext Information System at the
        Institut für deutsche Sprache, Mannheim</title>, <title>Proceedings of the 12th Euralex
        International Congress</title> (<date>2006</date>).</bibl>
    <bibl id="lesk1986"><label>Lesk 1986</label><author>Lesk, Michael</author>. <title rend="quotes"
        >Automatic Sense Disambiguation Using Machine Readable Dictionaries: How to Tell a Pine Cone
        from an Ice Cream Cone</title>, <title>Proceedings of the ACM-SIGDOC Conference</title>
        (<date>1986</date>).</bibl>
    <bibl id="lewisshort">
      <label>Lewis and Short 1879</label>
      <author>Lewis, Charles T. and Charles Short (eds.)</author>. <title>A Latin
      Dictionary</title>. <pubPlace>Oxford</pubPlace>: <publisher>Clarendon Press</publisher>,
        <date>1879</date>.</bibl>
    <bibl id="lsj">
      <label>Liddell and Scott 1940</label>
      <author>Liddell, Henry George and Robert Scott (eds.)</author>. <title>A Greek-English
        Lexicon, revised and augmented throughout by Sir Henry Stuart Jones</title>.
        <pubPlace>Oxford</pubPlace>: <publisher>Clarendon Press</publisher>, <date>1940</date>.</bibl>
    <bibl id="penn">
      <label>Marcus et al. 1994</label>
      <author>Marcus, Mitchell P., Beatrice Santorini, and Mary Ann Marcinkiewicz</author>. <title
        rend="quotes">Building a Large Annotated Corpus of English: The Penn Treebank</title>,
        <title>Computational Linguistics</title>
      <vol>19.2</vol> (<date>1994</date>).</bibl>
    <bibl id="mccarthy2004">
      <label>McCarthy et al. 2004</label>
      <author>McCarthy, Diana, Rob Koeling, Julie Weeds and John Carroll</author>. <title
        rend="quotes">Finding Predominant Senses in Untagged Text</title>, <title>Proceedings of the
        42nd Annual Meeting of the Association for Computational Linguistics</title>
      (<date>2004</date>).</bibl>
    <bibl id="mcdonald2005">
      <label>McDonald et al. 2005</label>
      <author>McDonald, Ryan, Fernando Pereira, Kiril Ribarov, and Jan Hajič</author>. <title
        rend="quotes">Non-projective Dependency Parsing using Spanning Tree Algorithms</title>,
        <title>Proceedings of HLT/EMNLP</title> (<date>2005</date>).</bibl>
    <bibl id="melcuk1988">
      <label>Mel’čuk 1988</label>
      <author>Mel’čuk, Igor A.</author>
      <title>Dependency Syntax: Theory and Practice</title>. <pubPlace>Albany</pubPlace>:
        <publisher>State University of New York Press</publisher>, <date>1988</date>.</bibl>
    <bibl id="mihalcea2004">
      <label>Mihalcea and Edmonds 2004</label>
      <editor>Mihalcea, Rada and Philip Edmonds (eds.)</editor>. <title>Proceedings of Senseval-3:
        Third International Workshop on the Evaluation of Systems for the Semantic Analysis of
      Text</title> (<date>2004</date>).</bibl>
    <bibl id="wordnet">
      <label>Miller 1995</label>
      <author>Miller, George</author>. <title rend="quotes">Wordnet: A Lexical Database</title>,
        <title>Communications of the ACM</title>
      <vol>38.11</vol> (<date>1995</date>).</bibl>
    <bibl id="miller1993">
      <label>Miller et al. 1993</label>
      <author>Miller, George, Claudia Leacock, Randee Tengi, and Ross Bunker</author>. <title
        rend="quotes">A Semantic Concordance</title>, <title>Proceedings of the ARPA Workshop on
        Human Language Technology</title> (<date>1993</date>).</bibl>
    <bibl id="moore2002">
      <label>Moore 2002</label>
      <author>Moore, Robert C.</author>
      <title rend="quotes">Fast and Accurate Sentence Alignment of Bilingual Corpora</title>,
        <title>AMTA '02: Proceedings of the 5th Conference of the Association for Machine
        Translation in the Americas on Machine Translation</title> (<date>2002</date>).</bibl>
    <bibl id="niermeyer1976">
      <label>Niermeyer 1976</label>
      <author>Niermeyer, Jan Frederick</author>. <title>Mediae Latinitatis Lexicon Minus</title>.
        <pubPlace>Leiden</pubPlace>: <publisher>Brill</publisher>, <date>1976</date>.</bibl>
    <bibl id="nivre2006">
      <label>Nivre et al. 2006</label>
      <author>Nivre, Joakim, Johan Hall, and Jens Nilsson</author>. <title rend="quotes">MaltParser:
        A Data-Driven Parser-Generator for Dependency Parsing</title>, <title>Proceedings of the
        Fifth International Conference on Language Resources and Evaluation</title>
      (<date>2006</date>).</bibl>
    <bibl id="giza">
      <label>Och and Ney 2003</label>
      <author>Och, Franz Josef and Hermann Ney</author>. <title rend="quotes">A Systematic
        Comparison of Various Statistical Alignment Models</title>, <title>Computational Linguistics</title>
      <vol>29.1</vol> (<date>2003</date>).</bibl>
    <bibl id="pinkster1990">
      <label>Pinkster 1990</label>
      <author>Pinkster, Harm</author>. <title>Latin Syntax and Semantics</title>.
      <pubPlace>London</pubPlace>: <publisher>Routledge</publisher>, <date>1990</date>.</bibl>
    <bibl id="schutz1895">
      <label>Schütz 1895</label>
      <author>Schütz, Ludwig</author>. <title>Thomas-Lexikon</title>.
      <pubPlace>Paderborn</pubPlace>: <publisher>F. Schoningh</publisher>, <date>1895</date>.</bibl>
    <bibl id="sinclair1987">
      <label>Sinclair 1987</label>
      <author>Sinclair, John M. (ed.)</author>. <title>Looking Up: an account of the COBUILD project
        in lexical computing</title>. <publisher>Collins</publisher>, <date>1987</date>.</bibl>
    <bibl id="singh2005">
      <label>Singh and Husain 2005</label>
      <author>Singh, Anil Kumar and Samar Husain</author>. <title rend="quotes">Comparison,
        Selection and Use of Sentence Alignment Algorithms for New Language Pairs</title>,
        <title>Proceedings of the ACL Workshop on Building and Using Parallel Texts</title>
        (<date>2005</date>).</bibl>
    <bibl id="tll">
      <label>TLL</label>
      <title>Thesaurus Linguae Latinae, fourth electronic edition</title>.
      <pubPlace>Munich</pubPlace>: <publisher>K. G. Saur</publisher>, <date>2006</date>.</bibl>
    <bibl id="tufis2004">
      <label>Tufis et al. 2004</label>
      <author>Tufis, Dan, Radu Ion, and Nancy Ide</author>. <title rend="quotes">Fine-Grained Word
        Sense Disambiguation Based on Parallel Corpora, Word Alignment, Word Clustering and Aligned
        Wordnets</title>, <title>Proceedings of the 20th International Conference on Computational
        Linguistics </title> (<date>2004</date>).</bibl>
  </listBibl>
</DHQarticle>
