DHQ: Digital Humanities Quarterly
Volume 3 Number 2
2009 3.2  |  XML |  Discuss ( Comments )

Words, Patterns and Documents: Experiments in Machine Learning and Text Analysis

Shlomo Argamon  <argamon_at_iit_dot_edu>, Linguistic Cognition Lab, Dept. of Computer Science, Illinois Institute of Technology
Mark Olsen  <markymaypo57_at_gmail_dot_com>, ARTFL Project, University of Chicago
One of the emerging grand challenges for digital humanities in the next decade is to address rapidly expanding repositories of electronic text. A number of efforts, such as Google Book Search and the Bibliothèque numérique européenne, are digitizing the holdings of many of the world's great research libraries. The resulting collections will contain nothing less than, in Gregory Crane's view, the "stored record of humanity"  [Crane 2006]. This expansion beyond existing digital collections will be one of at least a couple of orders of magnitude, and will introduce a variety of new problems beyond simply scale, including heterogeneity of content and granularity of objects. The problems posed by the emerging global digital library offer opportunities for collaborative work between scholars in the humanities and computer scientists in many domains, from optical character and page recognition to historical ontologies [Argamon and Olsen 2006]. The papers presented here reflect initial collaborative work between the ARTFL Project at the University of Chicago and the Linguistic Cognition Laboratory at the Illinois Institute of Technology on one subset of the technologies required for a future global digital library: the intersection of machine learning, text mining and text analysis.
Traditional models of text analysis in digital humanities have concentrated on searching for a relatively small number of words and reporting results in formats long familiar to humanities scholars, most notably concordances, collocation tables, and word frequency breakdowns.[1] While effective for many types of questions, this approach will not scale effectively beyond collections of a relatively modest size, as result sets for even uncommon groups of words will balloon to a size not readily digestible by humans. Furthermore, this approach does not lend itself to abstract discussions of entire works, the oeuvre of an author or period, or issues related to the language of gender, genre, or ethnicity. It places the onus on the user to construct queries and assimilate results, without leveraging the capacity of machines to identify patterns in massive amount of data.
Machine learning and text mining approaches appear to offer a compelling complement to traditional text analysis, by having the computer sift through massive amounts of text looking for "suggestive patterns." The power of modern machine learning systems to uncover patterns in large amounts of data has led to their widespread use in many applications, from spam filters to analyzing genetic sequences. And the potential for using these sophisticated algorithms to find meaningful patterns in humanistic texts has been recently observed. Drawing a link between Ian Witten's general description of data mining and the practice of literary criticism, Stephen Ramsay states that "[f]inding interesting patterns and regularities in data is generally held to be of the deepest significance." Any such findings must be approached with critical prudence, he warns, as they will contain "the spurious, the contingent, the inexact, the imperfect, and the accidental in a state of almost guaranteed incompleteness"  [Ramsay 2005, 186]. Ramsay is quite correct to point out both the potential power and pitfalls of applying text mining to questions in the humanities. Our current work is to design sets of relatively constrained experiments using text mining systems on specific problems in order to examine what works, what does not work, and just what such results might mean.
To this end, the ARTFL Project has developed a set of machine learning extensions to PhiloLogic, our full-text search and analysis system.[2] PhiloMine replaces the notion of "searching" a database for one or more words with "task" submission. We currently view three broad classes of "tasks": predictive text mining, comparative text mining, and clustering/similarity analysis. Predictive mining approaches are widely used in applications such as spam e-mail filters, which are trained on samples of spam and non-spam messages and used to identify incoming junk mail. This supervised learning technique can be applied to a wide variety of tasks, such as learning on topically classified articles in the Encyclopédie and assigning these classes to unclassified articles or parts of other documents. It is common in digital humanities to work with corpora where many classifications, such as gender or nationality of author, are already known. In this case, machine learning algorithms may be used to compare texts based on different attributes. For example, one may compare works by American and non-American Black playwrights, returning measures of how well the classification task was performed, identifying incorrectly classified documents, and the features (often words) most characteristic of the distinction. Finally, document similarity and clustering is an unsupervised form of machine learning, designed to identify groups of documents statistically that share common features. We are using nearest neighbor document similarity, for example, to identify passages in one text that may have been copied from an earlier document.
The three papers which follow use all three approaches to attempt to shed light on specific research questions in the humanities. "Gender, Race, and Nationality..." examines how well machine learning tools can isolate stylistic or content features of authors and characters by gender, race, and nationality in a large collection of works by Black playwrights. In general, the classification results on a range of mining tasks were quite good, suggesting that these techniques can effectively distinguish, for example, the writing of male and female or American and non-American authors. In some cases, the results provide insight into the texts as literary works, but in others we found the intellectual value of the feature sets to be less interesting. We also found that, while classifying texts under binary oppositions is generally effective for the machine learning algorithms employed, doing so tends to reduce complex works and corpora to very limited sets of common features.
In "Vive la différence...", we examine a single binary classification, on gender of author in French literature predominantly from the 17th to the early 20th centuries. Using balanced male and female corpora, we found substantial agreement with Olsen's previous studies of gendered writing in published works, with our results supporting his observation of a more personal and emotional sphere of female authorship. Our results also comport with Argamon's previous work (with Koppel) on the British National Corpus, where female writing was found to be characterized by more frequent use of personal pronouns, with male writing characterized by more frequent use of determiners and numerical quantifiers. Additionally, a number of strong thematic groups of content words were found for both genders that were consistently useful in classification across the time period represented in the corpus, suggesting some enduring differences between male and female writing in the corpus.
The third paper, "Mining Eighteenth Century Ontologies...", uses predictive classification to examine the ontology of Diderot and d'Alembert's Encyclopédie. Our initial experiments attempting to classify the unclassified articles of the Encyclopédie led us to reconsider the coherency of the editors' classification scheme and overall distribution of classes in the entire work. Lastly, applying this ontology of the classes of knowledge to the Journal de Trévoux, an 18th century scholarly journal, we were able to make several new connections between the two corpora that went previously unnoticed.
The power of machine learning and text mining applications to detect patterns is clearly demonstrated in these papers, yet several issues arose during this work which we believe should be raised. The first is the surprisingly small size of the patterns detected. In all of the experiments, the systems dutifully created models to fit classes, but these were often based on quite tiny fractions of all of the available features -- a mere 60 surface words can adequately distinguish hundreds of American and non-American plays by black authors. Similarly, we find that for both predictive classification and clustering tasks, the number of features for most tasks used is a tiny fraction of all possible features. Resulting features may well reflect a "lowest common denominator" which, while perfectly adequate for specific mining tasks, may not be as useful in characterizing works in an intellectually satisfying fashion. The fact that our studies examining the issue of gendered writing arrived at similar conclusions regarding the differences between male and female writing and characterizations may thus in part be an artifact of the way learners and classifiers function. Finally, our classification tasks are generally considered to have produced a significant result when we achieve an accuracy of 70% or more, although the most successful tasks can surpass 95%. When examining the features most useful to the model, we must not assume that their importance holds for the documents whose class could not be predicted; indeed, their incorrect classification suggests that these documents may have quite different patterns of word usage.
The "lowest common denominator" problem would also appear to be related to a second concern which may be specific to machine learning on humanities texts. By treating relatively small numbers of documents with very large numbers of possible features, classifiers are thus given a wide range of features to accomplish any particular task. While we used different techniques to validate results, including n-fold cross validation and random falsification, there would appear to be some danger of obtaining results based on the construction of the task itself. Even if significant results are found, showing, e.g. that classification by a particular binary opposition can be performed reliably at 80% accuracy, in itself this says little about the underlying phenomenon under investigation. A binary opposition that is thus "empirically supported" may well be an epiphenomenon that is merely correlated with another underlying complex of causes, which remain to be teased out. So finding such "statistical patterns" is, ultimately, merely the first step in what must be a critically well-grounded argument, supported also by evidence external to the classification results themselves.
To help us argue for the general efficacy of machine learning approaches and address the concerns set forth above, we include a reaction piece by Sean Meehan, who writes about the anxieties of doing criticism by algorithm. Meehan raises the issue of distance in any critical endeavor, pointing out that interpretive analysis is always "a dynamic between tools and texts." In the end, he sounds the theme of scholarly circumspection and care that we try to bring out in all of the articles. Using machine learning tools on humanities texts requires the same understanding of the texts and degree of self-awareness that are necessary for any literary critical study.
As we hope these small scale experiments have demonstrated, text mining and machine learning algorithms offer novel ways to approach problems of text analysis and interpretation. One can pose questions of many hundreds or thousands of documents and obtain results that are interesting and sometimes even striking. It further seems clear that text mining will be a powerful technology deployed in order to make the emerging global digital library manageable and meaningful.


[1]The PhiloLogic text search and analysis package, developed at ARTFL, is one of many examples of such traditionally oriented systems. Documentation, downloads and samples are available at http://philologic.uchicago.edu/.
[2]See http://philologic.uchicago.edu/philomine/ for samples, documentation, and downloads. Not all machine learning and text mining tasks in the following papers used Philomine.

Works Cited

Argamon and Olsen 2006 Argamon, Shlomo and Olsen, Mark, "Toward meaningful computing," in Communications of the Association for Computing Machinery 49:4 (2006), 33-35.
Crane 2006 Crane, Gregory, "What Do You Do with with a Million Books" in D-Lib Magazine 12:3 (March 2006) [doi:10.1045/march2006-crane].
Ramsay 2005 Ramsay, Stephen, "In Praise of Pattern," Text Technology 2 (2005).