Oxford
Retrieving relatives from historical data
Variation and change in relativization strategies has been well documented (e.g. Ball 1996: 46, Biber and Clark 2002, Biber, Johansson, Leech, Conrad and Finegan 1999, Johansson 2006, Lehmann 2002). Certain types of relative clause, namely that-relatives and zero relatives, were difficult to retrieve from plain-text corpora. Studies therefore either relied on manual extraction of data or a subset of possible relativization strategies. In some text types, however, the zero relative is an important member of the class of possible relativizers. Recent advances in syntactic annotation should have made that-relatives and zero relatives more accessible to automatic retrieval. In this article, we test precision and recall of searches on a modest-sized corpus, i.e. scientific texts from ARCHER (A Representative Corpus of Historical English Registers), as a preliminary to future work on the large corpora which are increasingly becoming available. The parser retrieved some false positives and at the same time missed some relevant data. We discuss structural reasons for both kinds of shortcoming as well as the possibilities and limitations of parser adaptation.
A Naive Bayes classifier for Shakespeare's second-person pronoun
In order to investigate in explicit detail the way that y- and th- pronouns alternate in the Shakespearean corpus, I have undertaken a collocational analysis of the full corpus of Shakespeare's 37 plays and found that (1) second-person pronouns can be disambiguated based on context alone, (2) y- pronouns seem to be used in more formal situations or when an inferior is addressing a social better, and (3) the th- pronoun is reserved for addressing peers, servants, or other familiar personages. Through the Python Natural Language Toolkit (Bird et al., 2009, Natural Language Processing with Python. Sebastopol, CA: O'Reilly Media), I implemented a Naïve Bayes classifier that in effect treats each occurrence of a second-person pronoun as a black box that must be resolved into either a y- pronoun or a th- pronoun based only on the surrounding words. Using tenfold cross-validation, the classifier achieves an accuracy of 78.3% when fellow th- and y- pronouns are excluded from the context and 88.0% when we allow fellow th- and y- pronouns to assist in classification. Most interesting, however, are the context words that prove most informative in categorizing the pronouns. Significantly, the words most useful in classifying a pronoun as a y- pronoun include high-register words such as lordship, madam, lords, and sir. After a group of conjugated second-person verbs like art and wert, the words most associated with th- pronouns are words such as torment, nuncle, lesser, and villain. The ability to discriminate between forms based only on context confirms the hypothesis that the two classes of second-person pronoun are indeed used distinctly in the Shakespearean corpus. The list of words most helpful in making that distinction strongly suggests a difference in formality. We can also gain additional insight into the plays by examining some of the unexpected words that collocate with either one form or the other.
The Potosi principle: religious prosociality fosters self-organization of larger communities under extreme natural and economic conditions
We show how in colonial Potosí (present-day Bolivia) social and political stability was achieved through the self-organization of society through the repetition of religious rituals. Our analysis shows that the population of Potosí develops over the time a series of cycles of rituals and miracles as a response to social upheaval and natural disasters and that these cycles of religious performance become crucial mechanisms of cooperation among different ethnic and religious groups. Our methodology starts with a close reading and annotation of the Historia de Potosí by Bartolomé Arzans. Then, we model the religious cycles of miracles and rituals and store all social and cultural information about the cycles in a multirelational graph database. Finally, we perform graph analysis through traversals queries in order to establish facts concerning social networks, historical evolution of behaviors, types of participation of miraculous characters according to dates, parts of the city, ethnic groups, etc. It is also important to note that the religious activity at the group level gave native communities a way to participate in the social life. It also guaranteed that the city performed its role as producer of silver in the global economic structure of the Spanish empire. This case proves the importance of religion as a mechanism of stability and self-organization in periods of social or political turbulence. The multidisciplinary methodology combining traditional humanistic techniques with graph analysis shows a great potential for other sociological, historical, and literary problems.
Natural language processing and early-modern dirty data: applying IBM Languageware to the 1641 depositions
This article provides an account of the steps involved in adapting IBM's Languageware natural language processing software to a large corpus of highly non-standard 17th century documents. It examines the challenges encountered as part of this process, and outlines the approach adopted to provide a robust and reusable tool for the linguistic analysis of early modern source texts.
The liberty of invention: alchemical discourse and information technology standardization
The Chymistry of Isaac Newton project, an online scholarly edition of Newton's alchemical manuscripts, has engaged in a process to include a number of core alchemical symbols into the Unicode standard, a standard for digital representation of characters and symbols from the world's languages, scripts, and writing systems. Our article explores the relationship between information technology standardization and humanities research. We discuss Newton's engagement with alchemy and explore the graphic dimensions of alchemical discourse. We illustrate this discussion with examples of Newton's use of alchemical symbols. We examine Unicode itself, particularly a core Unicode principle distinguishing between the abstract character and the image or glyph of the character, and we discuss the tensions between this core principle and the representation of graphic, symbolic, and pictorial discourse. We describe our experience with the Unicode proposal process and illustrate again—this time with an organizational scheme for the symbols—how the technical standardization process forced a reexamination of our historical materials. Our conclusions reemphasize the potential for mutually beneficial relationships between certain types of information technology standardization and humanities research and suggest that study of the graphic qualities of alchemical discourse, especially in light of competing theories of text represented by standards like Unicode, may contribute to our understanding of the increasingly graphic, iconic, and pictorial nature of information and communication.
Looking for translator's fingerprints: a corpus-based study on Chinese translations of Ulysses
This study is to investigate the translator's fingerprints as manifested in his/her style in translation. It reports a case study of two Chinese translations of Ulysses, adopting a corpus-based approach. The parallel subcorpora of the self-built Bilingual Corpus of Ulysses (BCU) consist of Joyce's Ulysses and its two Chinese versions produced by Xiao (1994 Tran. Ulysses, Nanjing: Yilin Press) and Jin (1997 Tran. Ulysses, Beijing: People's Literature Publishing House), respectively, and the comparable subcorpora include Xiao's original writings in Chinese. The comparison of the keyword lists shows that Xiao, the literary writer and translator, leaves some traces of lexical idiosyncrasy in his composition and translation. On the syntactic level the comparison reveals that due to the interference of the English language Xiao post-positions more adverbial clauses in translation than in composition, a feature that distinguishes the translated text from non-translated original writing. This indicates that the fingerprints of the translator are left on the translated text both as a result of his/her linguistic idiosyncrasy and of the interference and constraints of the languages s/he is dealing with in translation.