Neel Smith is Associate Professor of Classics at the College of the Holy Cross, and leads a Technical Working Group at the Center for Hellenic Studies. With Thomas Martin, he co-hosted the meeting where the initial planning for the founding of the Stoa Consortium took place, and was a frequent collaborator with Ross Scaife. He is currently working on a project on the shared interests of Hellenistic literary and scientific scholarship.
Authored for DHQ; migrated from original DHQauthor format
Citation practice reflects a model of a scholarly domain. This paper first considers traditional citation practice in the humanities as a description of our subjects of study. It then describes work at the Center for Hellenic Studies on an architecture for digital scholarship that is explicitly based on this model, and proposes a machine-actionable but technologically independent notation for citing texts, the Canonical Text Services URN.
Citation as ontology
For the past ten years, the Stoa Consortium has provided a hothouse for projects aimed at developing appropriate digital forms for scholarly work.translating
traditional print work into digital formats, but the metaphor of translation
mischaracterizes the Stoa's accomplishment since it suggests that the print form is an established ideal to be more or less effectively replicated in a new medium. The work that the Stoa Consortium has supported is both more difficult and more profoundly significant: it asks first whether we can model our scholarly work
These conclusions have little meaning if they are not worked out in a real implementation, so I will describe in some detail the design decisions we have made in the creating a digital system for scholarly citation at the Center for Hellenic Studies. Since the range of technologies may include topics unfamiliar to some DHQ readers, I append a glossary of technical terms at the end of the paper.
A survey of information technology projects in the humanities over the past quarter of a century would vividly illustrate how difficult it is to conceptualize our work without simply following the rut that some technology has worn for us. I will leave it to my readers to select their own favorite (or least favorite) examples of projects conceived in the narrow terms of a particular technology: whether projects that unthinkingly reproduce the visual appearance of print publications in digital form (shovel ware
), or projects that subordinate scholarly work to the requirements of an inadequate and rigid digital format. Instead, I would point out that the dangers of adapting scholarly work to an alternative information environment are not new. One telling example from he history of classical scholarship is the changing treatment of scholarship on the Homeric poems.
In the Hellenistic and early Roman periods, the most learned literary critics of the ancient world — scholars like Aristarchus of Samos — composed extensive commentaries on the Homeric poems as free-standing works. Lemmata, or catch words, introducing each note provided an explicit link to the texts they commented on, but commentaries in the form of scrolls must have been awkward to use, requiring the reader to advance both the text and commentary to the appropriate passage.
Small wonder that in the transition from the scroll or volumen to the book form of the codex, later copyists saw an opportunity to create a much more convenient reading environment. Not only could the sheets of the codex be more easily turned to a particular passage: the wide areas frequently left as margins of each page in the codex were an inviting space for copying related commentary. By the time of the earliest surviving manuscript copy of the
This improved reading environment came at a high price, however: the independent manuscripts of Aristarchus and others ceased to be copied. The individual comments, or scholia, that were placed in a given codex might only be a selection of an original commentary, and were indistinguishably merged with selections from other commentators. The form of the marginal scholia created a new experience of reading while ensuring the loss of much of the material the scholia were based on.
Recent print publications of the scholia have compounded the problem. The potential of the printed edition is to make the contents of the unique manuscripts accessible to readers who are unable to view the manuscripts themselves. To accomplish this, the two works often considered standard references for the Homeric scholia to the
Proecdosis” or preliminary release, to the
scientific publication,evidently conceived of as a print publication.
In developing new scholarly instruments for the codex manuscript of Homer, and for the printed edition of the Homeric scholia, innovators recognized how different technologies offered the possibility for different forms of reading. Both the manuscript adorned with scholia and the printed line-by-line collection of scholia privilege a single form of reading, however, and efface essential features of the sources they draw on. This was not due to any hostility towards their source material. The anonymous scribes who read the ancient commentators on Homer revered them, and cite them as authorities; Erbse and van Thiel are equally scrupulous in citing the specific manuscripts they have consulted. But in spite of their obvious, and self-professed, respect for their sources, both the medieval copyist and modern editors chose to work in a form that threatened the survival of the sources they admired.
The scholiasts' citations of authorities and the modern editors' citations of manuscripts describe a model of study grounded in sources. The gap between this model and the effects of their work suggests that in the struggle to fashion a new and improved scholarly tool, they lost sight of some of the most central objects of their study. Today, we routinely confront examples of how new technologies can effect new forms of reading — that is, more generically, new ways of discovering, visualizing and manipulating information — and we risk falling into similar pitfalls. If we want to pursue the implicit agenda of the Stoa Consortium and define a technologically independent model of scholarly work, then considering our citation practice can serve as a practical first step.
Of course, in most scholarly publications in Classics, an enormous proportion of citations are to other scholarly publications. But scholarly publications are representations of an argument. They in turn depend ultimately on references to more fundamental objects of study. While the length of this chain of dependencies may obscure the relation of argument to source material (and while part of the appeal of digital publications is certainly the possibility of following such a series of dependencies automatically to its ultimate sources), if we want to use citation practice as a guide to modelling the objects we study, we will begin with the most elemental objects, on which others depend.
The following section of this paper begins with
Discrete objects. A citation must, before anything else,
Within a single project citing a set of discrete objects, it should be straightforward to implement a requirement that objects have unique identifiers: relational database systems can enforce constraints for unique values of a field, XML documents can be designed to require unique identifying attributes on elements, etc. But how do we further guarantee that unambiguous identifiers will not conflict when they are disseminated across the internet?
This is directly analogous to a problem that the XML community faced a decade ago. When data from different schemas are commingled on the internet, how can automated systems determine the data structure that an element belongs to? If elements from a Dublin Core document, an XHTML document and a TEI document appear together, for example, how can we disambiguate an element named title that could belong to any of the three types of document?
One conceivable solution might have been to establish a centralized registry of data structures (schemas, DTDs), but that would have enormously burdened developers who frequently need to create or modify data structures with new DTDs or schemas. Instead, the World Wide Web Consortium defined XML namespaces as an easy and flexible way of qualifying XML structures with unambiguous references.
The Center for Hellenic Studies (CHS) Technical Working Group has followed the same reasoning to qualify references to specific objects (as opposed to data structures) with what we are calling domain namespace identifiers,
or DNIDs.domain namespace identifier
can qualify an object identifier to guarantee that it will be globally unique. Since the CHS owns the domain name chs.harvard.edu, a CHS developer can build on this string to define namespace identifiers to refer uniquely to a set of objects. A data namespace like chs.harvard.edu/datans/images could be used to refer to a set of digital images, for example.
Hierarchical texts. The ways scholars refer to the texts they read suggests a more complex, hierarchical identity than simple discrete objects. Our model of identifying texts needs to range from references to a poem like the
group 1 entitiesof the FRBR model.
the works of Sophocles), provenance (
the inscriptions from Aphrodisias), or may represent a conventional grouping the semantics of which might even be disputed.
The Homeric poems,for example, might be viewed by some as a category of author, and by others as a generic category. As purely conventional categories, the traditional groupings do not conflict with the librarians' proper concern to separate any kind of subject cataloging from the identification of a work. At the same time, the conventional groups provide a context for citing texts that classicists have found useful, and that permeates classical scholarship, so we will want to retain them in our model for identifying texts. Within these groups, the notional works classicists refer to correspond precisely to the notional works of the FRBR model.
In the center of the FRBR model, the distinction between expression and manifestation is not evident in classicists' citation practice. The difference between expression and manifestation matters to librarians responsible for physical holdings in a collection. Scholars citing texts focus instead on their semantic content. For their purposes, if two manifestations are so different that we need to distinguish them, they may as well be considered a new edition or translation.
Individual physical copies, on the other hand, may matter to the scholar as well as to the librarian, because finally the evidence for a given version rests in real, physical exemplars. If the only available or known exemplars are imperfectly preserved, scholarly citation may need to distinguish between evidence for a version from one copy versus another.
The following table summarizes the differences between the FRBR model and the model of texts suggested by our citation practice.
The internet's domain name system (DNS) provides an example of how a system of hierarchical unique identifiers can function across a global network, and closely parallels what we need in citing primary texts. Key points are
So in DNS, a top-level server for .org addresses tracks the assignment of names like stoa.org to organizations like the Stoa Consortium. The registrar of the Stoa consortium can extend the hierarchy, and these entries are represented with extended names like www.stoa.org.
We want a system that would assign responsibility for registering identifiers for texts to appropriate organizations or projects, analogous to the internet's top-level domains. The Stoa Consortium, for example, might be a logical registrar for works of Latin literature, while the Aphrodisias project would be an obvious choice as registrar for the inscriptions from Aphrodisias. Within these domains, registrars could extend identifiers to any hierarchical level of a text's identification, from text group down to individual exemplar. In the following section, I will further develop a unified notational scheme for citing works and passages within works, but at this stage, note that we want a simple textual notation that can be extended to each hierarchical level of a text's identification.
As with DNS, this kind of delegated system of authority requires consensus among all participants. In a field like classics, participants
really means the active projects that are disseminating digital texts — a comparatively small community, and probably a less fractious setting for trying to develop consensus than traditional professional organizations. The slow, hard work of building a consensus would be more than repaid when unambiguous hierarchical identifiers could be used for frequently cited texts. I will return in the following section to some potential rewards of a coordinated registry system for referring to texts.
In contrast to objects that are cited by simple identifiers, a handful of object types may be more precisely identified by citation pointing to some part of the object. Two familiar examples are geographic objects and images. In printed works, these may be cited
with a visualization as a map or illustration, where a digital citation could more generically refer to coordinates within a continuous reference system (which could be visualized as a map or illustration). A geographic object such as a city may have a unique identifier in a collection of locations, but a reference in a geographic coordinate system could point to a particular section of that entity.region of interest
in a digital image. These coordinates are expressed in percentage units so that the citation is easily applied regardless of the scale of reproduction of the image. Clipping the cited section or highlighting it on the full image with alpha compositing are possible displays in a digital environment.
coordinate system.
Other humanists have fallen into the trap of citing texts by the accidental physical unit of the page, with the unfortunate consequence that the citation is valid only for a specific manifestation of the work, since references to pages cannot be applied to other printings, much less to other expressions such as translations into different languages. Classicists, together with scholars in Biblical studies, have generally recognized the importance of a Book 2, chapter 5
of Thucydides has the same meaning no matter whether it is applied to the notional work, or a specific item.
In a classic article provocatively entitled
coordinate systemsclassicists use in the canonical citation of texts.
As early as 1993, however, Renear, Mylonas and Durand had backed away from the universality of their initial claim:
The 1993 revision to their original OHCO model contains important insights, but from the perspective of scholarly citation, the original OHCO thesis describes precisely how we cite texts. There is a single logical hierarchy for citation, and when we are interested in features of a text that are not aligned with the units of the citation scheme, we must nevertheless identify those features in terms of our citation scheme.
Renear et al. illustrate overlapping hierarchies with the example of sentences or speeches in a poetic work organized by metrical lines: the linguistic unit and the prosodic may not align, and the editor of the electronic text will be forced to privilege one hierarchy over the other, or devise a strategy for handling concurrent, overlapping hierarchies. Yet when classical scholars cite works by metrical line, they will not invent a new citation scheme to refer to a speech: they will express the citation in terms of lines no matter what feature they are analyzing. In book 6 of the
Who are you? I don't think I have ever seen you on the field of battle before.The question begins with line 123, and wraps onto the first metrical foot of line 125. The speech unit and the metrical unit are incompatible, but for most purposes, this is of no consequence: the line is a sufficiently precise pointer that we will simply refer to Diomedes' question with a reference like
This is directly comparable to the approximation we accept when we cite other objects with a simple identifier: use of a common citation system allows us to cite without having to agree in detail on the underlying data model, since the data model will be dictated by the perspective of the individual scholar. For different purposes, we might view a coin as having one set of properties (die axis, weight) or another (attested personal names), but we can recognize a citation as referring to the same coin, regardless of the data model applied to it. Just so we might view a text as having one logical structure (syntactic unit) or another (prosodic unit), but we can recognize a span of lines in the
Perhaps the complexity of citation expressed in prose has obscured the fact that canonical citations marry two hierarchies, one identifying the object, one describing its logical coordinate system
for purposes of citation. Certainly, the variety in natural-language expressions for these ideas is an obstacle to machine recognition and action on our references.
To express these canonical references to texts concisely and unambiguously, the CHS Technical Working Group has defined a notation for canonical text citation. We chose to express these citations as Uniform Resource Names (URNs). URNs are persistent, location-independent, resource identifiers
— precisely what a citation should be. The syntax of URNs (defined in RFC 2141) is designed to make it easy to map other namespaces (which share the properties of URNs) into URN-space. Therefore, the URN syntax provides a means to encode character data in a form that can be sent in existing protocols, transcribed on most keyboards, etc.
The values used to identify texts in CTS URN citations must be publicly documented, ideally in a DNS-like distributed registry system. In order to begin developing software using the CTS URN notation immediately, the CHS Technical Working Group is maintaining a hierarchical registry of identifiers for works of ancient Greek literature, and a registry of other CTS registries covering other domains (such as the Stoa Consortium for works of Latin literature).
URNs always begin with the string urn, followed by a protocol identifier. We propose the identifier cts for our protocol. CTS URNs are composed of up to four further top-level elements, separated by colons. They are
The general structure of a CTS URN is therefore
top level registry
where text identifiers can be looked up or resolved. We use the string greekLit to refer to the CHS registry of works of ancient Greek literature.urn:cts:DOMAIN:WORK:PASSAGE?
and a full CTS URN citing lines 123-125 of Iliad 6 would be
urn:cts:greekLit:tlg0012.tlg001:6.123-6.125
Because the URN explicitly indicates the hierarchy of both the work and the citation, applications can choose to interpret the reference at whatever level they consider appropriate. A URN might refer to a specific English translation, but an application could ignore the more specific components of either the work or citation hierarchy to apply it to a Greek edition of the
The CHS has a small registry of scholarly publications that it maintains on line, including Gregory Nagy's works
Pindar's Homer Nagy lyricranks as the top match a site offering
lyricin its abstract of the book). Searching for
urn:cts:chs:nagy.ph: lyricfinds only content matching that URN as well as the term
lyric. Because the URN structure is hierarchical, searching for
urn:cts:chs:nagy. Achillesfinds all occurrences of Achilles in the textgroup
nagyin the chs domain, ranked according to Google's search algorithm. Because the CTS URN expresses the semantics of hierarchical text citation in a simple flat string, we can use its precision when to limit results when searching Google Base for fuzzier terms.
More generally, the simple string of a CTS URN is well suited to passing around the internet as a precise form of machine-actionable citation. The links associated with the Google Base entries for the Nagy books, for example, pass the URN as a parameter to a text browsing application so that from a search of Google Base, a reader can pass directly to a continuous browser through the full text.
Source citation is just one part of scholarly publication, and conventions for citing resources digitally must be viewed as part of a larger architectural design. I have previously argued that when the digital library is the global internet, the natural architecture for scholarly publications is a hierarchy of services.diff
service, for example, describing differences between the same passage in two versions of a text, could be built on top of a service that retrieved passages by canonical reference.
Much of the energy of the CHS Technical Working Group has been focused on defining and implementing network services in support of scholarly applications, in part because we recognize the commonplace that while end-user applications are short-lived, thoughtfully designed services upon which end-user applications can be built can have much longer lives. I would suggest that we can extend our digital architecture one tier deeper to include a reference tier
that is more fundamental than the service level. This relationship is summarized in the following table.
book 6, lines 123-125 of theshould have a fixed meaning, and remain valid — it is important for us to design digital expressions that will be both immediately practical, and likely to remain sustainable for the indefinite future. Both the application of data namespaces to qualify unique identifiers and the use of CTS URNs for references to texts should satisfy those conditions.Iliad
To study, as humanists do, historically unique products of culture — works of literature, historical events, artifacts of material culture — is an extraordinarily complex undertaking. I suspect that for most humanists that complexity is part of the fascination of their work, a reflection of the richness that gives meaning to the object of their study. We are not merely comfortable with the complexity of our material, we revel in it. Most often we are less familiar with an idea that seems natural to the computer scientist: that any degree of complexity can be constructed from the composition of simpler elements.
Yet the way we cite sources suggests that at some level we intuit this. We tend to cite a provocatively large proportion of the material we study either as simple objects (expressible as a unique identifier qualified by a data namespace), or as a continuous reference within a hierarchically identified text (expressible as a CTS URN). It is worth considering how much of our work could be modelled in information systems that rest ultimately on foundations laid with these two simple shapes of building blocks.
group 1 entitiesdescribe a hierarchical model for texts, from a notional work down to a single specific item. (Summary in Wikipedia with links to technical documents: http://en.wikipedia.org/wiki/FRBR.)
simple method for qualifying element and attribute names used in Extensible Markup Language documents by associating them with namespaces identified by URI references.(See http://www.w3.org/TR/REC-xml-names/.)