Jeffrey C. Witt is an assistant professor of philosophy at Loyola University Maryland. He is the founder, designer, and developer of the
This is the source
In this paper, I offer an overview of an idea for a metadata archive, called the
Overview of the Sentences Commentary Text Archive
SentencesCommentary?
Peter Lombard wrote the medieval book, known as the
Given that Lombard's
While the current scholarly practice primarily aims to make these texts available in printed editions, this workflow suffers from a couple of problems. Today, a static print edition of these commentaries remains the gold standard. This is not always without good reason. The unchangeable nature of the text helps create stable citation practices that online resources often have a hard time replicating. Likewise, the print edition is easier to document for purposes of tenure and promotion. In the present proposal, I am not interested in disputing these realities, but rather in thinking about a workflow towards print that can enable other possibilities at the same time. One such possibility lies in making working drafts and on-going editions of texts — editions that often take decades to complete — sufficiently discoverable so as to promote collaboration. It is, however, hard for potentially interested scholars to collaborate if they are unaware of texts or editions in need of collaborators. Another important possibility is the future large-scale analysis of the tradition as a whole. The divergence of editorial practices, file formats, and the generally isolated practice of print production makes large-scale analysis of the entire corpus difficult. Often very little is known about the digital format of the post-publication file. But these post-publication files, if archived and catalogued in an accessible way, could become the bases of a new understanding of this five-century long tradition and its various developments across time and region.
In this short paper, I offer an overview of a metadata archive (still in development) that, if scaled for production, could support the kind of collaboration mentioned above, promote previously impossible analyses of large sections of the
By What defines the work of
archivist, and so
The suggestion that an archive is a collection of materials with an
Accordingly, this metadata is the beating heart of the archive and the basis on
which a user or application can search and sort resources. As such, it also true
to say that this conception of archive does depart from the perhaps more
traditional notion of an archive as a collection of material objects and is
instead primarily interested in a purposeful collection of digital surrogates
The need for such a metadata archive was first impressed on me as I worked (and continue to work) on an edition, encoded according to the standards of the Text Encoding Initiative (TEI), of a very late-fourteenth century
As the edition developed, I noticed two things. First, this semantically encoded text is rich in interesting metadata (e.g. citations, references, structure, length, word-frequency, etc.) that, in order to be useable, simply needed to be harvested and presented. Second, as I began to detach the
A primary goal of the
Figures 1 illustrates, in a highly abstract way, how this metadata archive could serve as a kind of switchboard between institutional and development repositories of texts and application uses of those texts.
In this section, I want to provide some fairly specific technical details about the current prototype of the archive that I have built. The point of such a report is not necessarily to declare the best possible practice or even describe what the final form of the
In the current prototype (an early incarnation of which can be found at http://scta.info), the archive metadata
extraction begins with an XML document, called a "projectdata.xml" file that
includes a list of
The standardization of these files is critical for the next step in the process
of extracting metadata according to the Resource Description Framework (RDF)
data model.rdfextraction.xsl,
is applied to a given projectdata.xml
file.Research projects and content providers in
the humanities such as libraries and museums are increasingly
incorporating semantic web technologies. As part of this process, many
ontologies initially developed for other contexts are being translated
into semantic-web-ready forms to enable the leveraging of existing
metadata in a semantic web context. XML schemas in particular are
targeted for such translation, and production of RDF based on existing
XML markup is increasing, with the W3C offering a conversion tool meant
to facilitate such translations
rdfextraction.xsl
is that it not only processes the information in
the projectdata.xml
file, but that it follows the pointers in the file to
the raw TEI text, when available, and begins extracting data for each of the XML
encoded transcriptions.
From the extracted information, this archive can, first and foremost, promote
collaboration by simply listing for interested researchers whether or not a
transcription of the particular part of text (i.e. an available,
in progress,
or not yet started.
If a user knows the exact section
of a text they are interested, they could look up the text in the database and
see the status of the text. This simple step would simultaneously help scholars
avoid redundant work and encourage collaboration.
This kind of information, however, is only the tip of the iceberg. When texts are
encoded according to the TEI schema — or, in the ideal scenario, according to a
customized TEI schema tailored to this specific genre of text — it is possible
to begin automatically harvesting all kinds of information about the text
itself. For example, when
For example, in the case of the Peter Plaoul edition, the
rdfextraction.xsl
stylesheet can, with astonishing speed, run through
over 650 documents and 1,200,000 words and return information about which
authors were mentioned, what texts were mentioned, and what quotations were
cited.
Though this is the ideal method of abstraction, it remains the case today that many
it often happens that the first set of procedures, transcription, is biased by the second, presentation
standard approachto extraction through the use of XSLT and XPath queries, the Orlando Project uses a Python script to target specific regular expressions; see
More essential than exactly how this information is harvested is the schema according to which this metadata is catalogued. My present prototype implementation has a created a provisional RDF schema. This schema is divided into three primary classes: texts, resources (such as names, works, quotations, etc.), and properties (such as hasTranscriptions, quotes, isQuotedBy, mentions, references, etc.). Properties are further divided into three main categories that allow the archive to organize metadata according to three main data streams, publication information (pubInfo), content information (contentInfo), and linking information (linkingInfo). Figure 2 offers a representation of how these content streams can be presented to users.
Ideally this is a schema that would be developed by a team of editors and constructed in concord with the TEI customization schema. Compatibility between these two schemas makes metadata harvesting extraordinarily efficient.
This kind of metadata collection allows for robust search and finding
possibilities. When in place, users can search for any text where a particular
Bible verse is used or any text that discusses a key word such as faith,
baptism,
etc. Likewise, by navigating the archive they could discover
networked connections such as sets of texts that quote the same authors in the
same places or use the same key words in the context of the same quotations.
Right now the information needed for this kind of search is scattered: scattered
throughout printed critical editions and various online editions. The
The current prototype instantiation of the archive stores the extracted RDF
triples into a Fuseki triple store, part of the Apache Jena Java
framework
While these possibilities are tantalizing in themselves, the real power of this archive is not primarily in the user interface, but in the further uses that other applications can make of this information. For this purpose, the RDF.rb library is equipped to handle content negotiation. Thus, when a user requests information, the browser will return nicely formatted HTML tables. However, when a machine requests information in other file formats, specified in HTTP header, (e.g. ttl, nt, json, rdf/xml), the requests are easily handled by the Ruby application.
One good example of an application that can make use of the metadata in this archive is the
First, the
A second example lies in the gathering of text files to be displayed in a single
projectdata.xmlfile which includes a lot of redundant information already contained in the archive, such as the location of the text in institutional and development repositories, the formal title of the text, its status (draft or otherwise), and whether it has any corresponding diplomatic transcriptions, subdivisions, etc. Future development aims to reduce this file to a simple set of dereferenceable URLs that the user wants to display, in the order he or she wants them displayed. With the archive in place, the application can make a call to the archive and, by parsing the response, can access all the other pertinent information it needs. Likewise, if the location of the raw XML file changes (for example, if it is moved to a new repository), no updates will ever need to be made. The application will still know, through the information returned from the archive, exactly where the raw text can be found.
This kind of dependence would also allow for the prospect of building a
baptism. Suppose further that these various texts were edited by several different editors and stored in various different institutional repositories across the globe. Using the archive, it is possible first to search for and select all the texts that discuss
baptism. Subsequently, an application can follow the metadata pointers to the location of the raw source text and then (using more associated metadata) display these texts chronologically.
Third and finally, the ability to observe connections through the metadata archive could be used to suggest connections and topic threads to readers as they consider one text. Future development on
These kinds of developments — the overall separation of the text from the display system and increasing reliance on a centralized archive — are important strides toward the long-term sustainability of both the text and any particular display system. More than likely, the web application will be a continual work in progress, which is simply the nature of modern web development. Thus, through rigorous separation of the text and the platform in which it is viewed, the text can survive and remain accessible through all the vicissitudes that come with the development of a web application.
Likewise, the dependency of an application like
Figure 3 provides an illustration of a comprehensive workflow from editing to web and print publication that makes use of the proposed archive.
As noted above, the technical requirements, while possessing their own challenges, are not particularly difficult to meet. With a little bit of funding, a small team of developers could easily implement and augment the workflow described above. The biggest challenge is contribution.
One obvious hurdle is getting people to edit their texts according to the TEI schema or in a format easily convertible to TEI. However, I imagine that even if users are allowed to submit in any format, and different scripts are designed to harvest the metadata from the various formats, people will still be reluctant to allow metadata to be harvested from their ongoing work.
In response to this concern, let me conclude with a couple of pertinent facts
about the above proposal. The first is that, while it is possible for the
archive to be a place to actually host the XML files themselves, this is not its
primary function, and it is not necessary for the text to be permanently handed
over to a third party. In the end, the ideal scenario would be for the raw texts
to be deposited into an editor's own institutional repository. In this case,
editors and projects managers can retain control over where their files actually
live. The second pertinent fact is that individual editors can still control
accessibility to these texts by indicating their publication status in the
In sum, two dominant principles continue to guide the present proposal. The first principle is a healthy appreciation for the need to support and foster collaboration. To support this collaboration, we need to create a finding aid that gives people enough access to a text to become interested in it. The second principle is an equally healthy respect for an editor's desire to control the content he or she is working on for both development and publication purposes. I believe the