Introduction
Humanities data, for which cultural institutions such as libraries and museums are
becoming progressively more responsible, is like all data: increasing exponentially.
Many scholars have responded to this expanded access by augmenting their fields of
study with theories and practices that correspond to methodologies that use advanced
computational analysis. The very popular Digging into Data challenge is a testament
to the wide array of perspectives and methodologies digital projects can encompass.
In particular, the first (2009) and second (2011) rounds of awards include projects
that are using machine learning and visualization to provide new methods of
discovery. Some analyze image files (“Digging into Image Data to
Answer Authorship Related Questions”) and the word (“Mapping the Republic of Letters” and “Using Zotero and
TAPoR on the Old Bailey Proceedings: Data Mining with Criminal Intent”).
Others provide new methods for discovery with audio files by analyzing “large
amounts of music information” (the
Structural Analysis of
Large Amounts of Music and “the Electronic Locator of
Vertical Interval Successions (ELVIS)” project) and “large scale data analysis of audio -- specifically the spoken word” (the
“Mining a Year of Speech” and the “Harvesting Speech Datasets for Linguistic Research on the Web” projects).
[1] At
this time, however, none of these projects is looking at how we can analyze literary
texts for patterns of prosody and sound; none are looking at the sound of text as it
contributes to how we make meaning or interpret literature.
At a time when digital humanities scholars are enthusiastic about “Big Data” and are also struggling to make ties between theory
and methodology, this paper discusses theories and research tools that allow scholars
to analyze sound patterns in large collections of literary texts. For the most part,
researchers interested in investigating large collections of text are using analytics
such as frequency trending and collocation, topic modeling, and network analysis that
ultimately rely on word occurrence. The use case discussed here, which is supported
by the Andrew W. Mellon Foundation through a grant titled “SEASR
Services,”
[2] seeks to identify other features than the “word” to analyze
literary texts — specifically those features that comprise sound including
parts-of-speech, accent, phoneme, stress, tone, and phrase units. To this end, this
discussion includes a case study that uses theories of knowledge representation and
research on phonetic and prosodic symbolism to develop analytics and visualizations
that help readers of literary texts to negotiate large data sets and interpret aural
and prosodic patterns in text.
In this piece, we describe how computational analysis, predictive modeling, and
visualization facilitated our discovery process in three texts by Gertrude Stein, the
word portraits “Matisse” and “Picasso” (first published in Alfred Stieglitz’s
Camera Work in 1912 and in her collection
Geography
and Plays, 1922) and the prose poem
Tender
Buttons (1914). The following discussion focuses primarily on the
theories, research, and methodologies that underpin this discovery process. First, we
discuss the theories of knowledge representation and research into phonetic and
prosodic symbolism that underpin the logics and ontologies of aurality incorporated
in this project. This basic theory of aurality is reflected in our use of OpenMary, a
text-to-speech application tool for extracting aural features; in the “flow” we
coordinated to pre-process texts in SEASR’s Meandre,
[3] a data flow environment; in the instance-based
predictive modeling procedure that we developed for the project; and in
ProseVis, the reader interface that we created to allow
readers to discover aural features across literary texts. Second, this discussion
addresses new readings of the word portraits “Matisse” and
“Picasso” and the prose poem
Tender Buttons by Gertrude Stein that have been facilitated by these
modes of inquiry. This article outlines the theoretical underpinnings and the
technical infrastructure that influenced our process of discovery such that
humanities scholars may consider the efficacy of analyzing sound in literary texts
with computational methods.
Knowledge Representation
Theories of knowledge representation can facilitate our ability to express how we
are modeling sound in a computational environment. Before defining what we mean by
“the logics and ontologies of aurality,” however, it is useful to discuss
why these definitions are necessary at all. John F. Sowa writes in his seminal
book on computational foundations that theories of knowledge representation are
particularly useful “for anyone whose job is to analyze
knowledge about the real world and map it to a computable form”
[
Sowa 2000, xi]. He defines knowledge representation as “the application of logic and ontology
to the task of constructing computable models for some domain”
[
Sowa 2000, xi]. In other words, theories of knowledge representation are transparent about
the fact that computers do not afford representations of “truth” but rather
of how we think about the world in a certain context (the
domain).
For Sowa,
logic is “pure form” and
ontology is “the content that is expressed in that
form”
[
Sowa 2000, xiii]. When developing projects that include computational analytics but lack
logic, “knowledge representation is vague,
with no criteria for determining whether statements are redundant or
contradictory;” similarly, “without ontology” (or a clear sense of
what the content represents), Sowa writes, “the terms and symbols are ill-defined,
confused, and confusing”
[
Sowa 2000, xii]. Accordingly, if researchers and developers are unclear about
what we mean and
how we mean when we seek to
represent “sound,” it is difficult for literary scholars to read or
understand the results of any computational analytics we apply to that model.
In his seminal article, “What is Humanities Computing and What
is not?” John Unsworth completes the very useful exercise of identifying
various digital humanities projects that adhere to the aspects of knowledge
representation put forth by AI scientists Davis, Shrobe, and Szolovits [
Davis et al 1993]. Namely, the authors claim that knowledge representation “can best be understood in terms of
five distinct roles it plays”
[
Davis et al 1993]. In the interest of defining and explaining our logic and ontology for this
project, we will likewise map the development of our methodology project to these
same parameters. Listed below is each of the five roles that knowledge
representation plays in a project according to Davis, et al.
- A knowledge representation is most fundamentally a surrogate, a
substitute for the thing itself, used to enable an entity to determine
consequences by thinking rather than acting, i.e., by reasoning about the
world rather than taking action in it.
- It is a set of ontological commitments, i.e., an answer to the
question: In what terms should I think about the world?
- It is a fragmentary theory of intelligent reasoning, expressed in
terms of three components: (i) the representation's fundamental
conception of intelligent reasoning; (ii) the set of inferences the
representation sanctions; and (iii) the set of inferences it
recommends.
- It is a medium for pragmatically efficient computation, i.e., the
computational environment in which thinking is accomplished. One
contribution to this pragmatic efficiency is supplied by the guidance a
representation provides for organizing information so as to facilitate
making the recommended inferences.
- It is a medium of human expression, i.e., a language in which we say
things about the world.
[Davis et al 1993]
After defining the first and second roles of knowledge representation in more
detail in the first part of this piece, we aggregate a discussion of the next two
aspects in the second part. Finally, part three of this piece includes the final
role and a more comprehensive discussion of how all five roles are at play within
our specific readings of texts written by Gertrude Stein.
Intelligent Reasoning and Pragmatically Efficient Computation
The above theories in aurality and research in phonetic and prosodic symbolism
undergird the choices we have made in developing a technical, computational
infrastructure for analyzing the sound of literary texts. Shifting our attention to
consider two more of Davis’s roles of knowledge representation including knowledge
representation as a “fragmentary theory of intelligent reasoning” and as a
“medium for pragmatically efficient computation,” this section will discuss
three essential parts of the infrastructure that represents the sound of text in our
project. First we consider our decision to use OpenMary, a text-to-speech application
tool, that extracts aural features from literary texts; next, we discuss the data
flow we developed in SEASR’s data flow environment (Meandre) to produce a
representation of the data for modeling as well as the predictive modeling procedure
we implemented to analyze patterns across these extracted features; and finally, we
introduce ProseVis, the reader interface we created to
allow readers to discover and interpret these extracted aural features and patterns
in conversation with (not a replacements for) the literary texts.
OpenMary
In this project, we use OpenMary,
[6] an open source, text-to-speech system,
to create a text-based surrogate of sound. Developed as a collaborative project of
Das Deutsche Forschungszentrum für Künstliche Intelligenz (German Research Center
for Artificial Intelligence) Language Technology Lab and the Institute of
Phonetics at Saarland University, OpenMary captures information about the
structure of the text that make it possible for a computer to read text in
multiple languages (German, British and American English, Telugu, Turkish, and
Russian; more languages are in preparation) and create spoken text. We chose
OpenMary as a useful analytic routine for analyzing these texts after first
parsing our texts against the CMU (Carnegie Mellon University) Pronouncing
Dictionary and then validating sections of the Mary XML Output against
human-parsed sections. In a simple comparison based on analyzing Gertrude Stein’s
novel
The Making of Americans, we noticed that many
“unknown” words were returned in the CMU comparison. That is, many of the
words that Stein used in her lexicon, though common words, were returned as
“unknown” words in the results (such as “insensibility,”
“meekness,”
“well-meaning,” and “slinks”). OpenMary’s recommendation, on the other
hand, incorporates a “best guess” model in any given prosodic situation —
that is, it is based on an algorithm or a set of stringent rules that draws on the
kind of research that Tsur, Bolinger and others have mapped for how we make
meaning with sound, which includes part-of-speech, accent, phoneme, stress, tone,
the position of a word in a phrase (e.g., consecutive verbs or multiple nouns),
sentence type (e.g., a declaration or a question), and information structure
(e.g., given and inferable information in a dependent clause is frequently
de-accented) [
Becker et al 2006].
The documentation explains OpenMary’s system for Natural Language Processing
(NLP):
In a first NLP step, part of speech
labelling [sic] and shallow parsing (chunking) is performed. Then, a lexicon
lookup is performed in the pronounciation [sic] lexicon; unknown tokens are
morphologically decomposed and phonemised by grapheme to phoneme (letter to
sound) rules. Independently from the lexicon lookup, symbols for the
intonation and phrase structure are assigned by rule, using punctuation,
part of speech info, and the local syntactic info provided by the chunker.
Finally, postlexical phonological rules are applied, modifying the phone
symbols and/or the intonation symbols as a function of their
context.
[MARY TTS]
Further intelligent reasoning is reflected in OpenMary’s folksonomic technique for
representing words that are not in the CMU Pronouncing dictionary lexicon; this
technique involves generating a lexicon of known pronunciations from the most
common words in Wikipedia and allowing developers to enter new words manually
(“Adding support for a new language to MARY TTS”). OpenMary will make a
“best guess” at words that are not part of the CMU lexicon because its
rule set or algorithm — its “intelligent reasoning” — for how
it generates audio files is based on the research of both linguists and computer
scientists. As such, this highly technical description speaks to the deeply
interdisciplinary work that has formed the rules by which OpenMary represents the
sound of literary texts in a digital file — as an interface between
human-perceived rules for reading and methods for machine processing.
As a byproduct of this process, OpenMary outputs a representation of the sound of
text in XML that reflects a set of possibilities for speech that are important
indicators of how the text could potentially be read aloud by a reader.
Specifically, OpenMary accepts text input and creates an XML document (MaryXML) as
output with attributes like those shown in Figure 1. This example represents the
phrase “A kind in glass and a cousin, a spectacle and nothing strange” from
Gertrude Stein’s text Tender Buttons.
As shown above sentences (<s>) are broken into prosodic units and then
phrases (<prosody> and <phrase>), which are, in turn, broken into
words or tokens (<t>). These word elements hold the attributes that mark
“accent”, part of speech (“pos”), and “ph” — phonetic spellings
(transcribed in SAMPA),) broken into what we refer to as “sounds” separated
by “–”, with an apostrophe (“ ' ”) preceding stressed syllables. Other
information is included at the phrase level such as “tone” and
“breakindex.”
[7]
Meandre Data Flow Environment
The SEASR (Software Environment for the Advancement of Scholarly Research) team at
the University of Illinois at Urbana-Champaign has been working on creating a
computational environment in which users who are interested in analyzing large
data sets can develop data flows that push these data sets through various textual
analytics and visualizations.
[8] This environment, called Meandre, provides tools for assembling and
executing data flows. A data flow is a software application consisting of software
components that process data. Processing can include, for example, an application
that accesses a data store, one that transforms the data from that store and
analyzes it with textual analysis, and one that visualizes the transformed
results. Within Meandre, each flow is represented as a graph that shows components
as icons linked through their input and output connections (see Figure 2). Based
on the inputs and properties of a component, an output is generated upon
execution. Meandre provides basic infrastructure for data-intensive computation by
providing tools for creating, linking, and executing components and flows. As
such, Meandre facilitates a user’s ability to choose how her information will be
organized and ultimately the kinds of inferences that can be made from the
resulting data.
The ability to explore a text’s aurality was not represented within SEASR until we
added a Meandre component to use OpenMary (shown as the green box module in Figure
2). Meandre components were used to segment the book into smaller chunks of text
before passing it to OpenMary for feature extraction, because sending large
amounts of text to OpenMary created memory problems associated with processing the
complete document. Consequently, the flow processes each document in our
collection through the OpenMary web service at a paragraph level. Meandre is also
used to create a tabular representation of the data (see Figure 3). The features
represented from the MaryXML are part of speech, accent, phoneme, stress, tone,
and break index, because research shows that these features have a significant
impact on how we make meaning with sound.
[9] We also include
information that is useful in terms of framing the context of the sounds within
the document’s structure (chapter id, section id, paragraph id, sentence id,
phrase id, and word id). This allows words to be associated with accent, phoneme,
and part-of-speech within the context of the phrase, sentence and paragraph
boundary. Figure 2 shows the flow with the components that are used for executing
OpenMary and for post-processing the data to create the database tables. Green
components are for computing (i.e. the OpenMary processing component), the blue
components are transformation components (i.e. XSL transformation), the red
components are input components (i.e. loading the xml file), the dark gray
component is an output component (i.e. writing a file) and the yellow component
are control flow components, (i.e. forking - duplicating an output). Another
benefit of creating this flow in Meandre is that readers who wish to analyze these
results or who wish to produce data for their documents will have access to the
same flow
[10].
Once the features for aurality were extracted for a collection of documents, we
wanted to compare the aurality between the documents and identify the documents
that had similar prosody patterns. This comparison was framed as a predictive
problem, where we used the features from one document to predict similar
documents. We developed an instance-based, machine-learning algorithm for the
predictive analysis that can be broken into the following steps:
-
Defining a prediction problem:
Our hypothesis is that
several books in our collection have similar prosody patterns and should
“sound” more alike.
-
Defining examples for machine learning:
Figure 4 shows the
process we follow to create “examples” for machine learning, starting
with the OpenMary output, and transformation to a database table in Meandre.
Next, we use our predictive analysis algorithms to derive a “symbol”
from the OpenMary output at the sound level (i.e., each row of the tabular
data). This symbol is an id that represents a unique combination of
just those features we associate with prosody including part
of speech, accent, stress, tone, and break index. There are over six
thousand symbols because there are over six thousand combinations of these
attribute values. Once symbols are defined, we create a moving window — a
phrase window — across the sounds to create the examples we use for
comparison. We define the window size of this phrase window to be the
average phrase length produced by the OpenMary analysis. We select the
average length of a phrase in the data set, not in order to maximize
classification accuracy, but in order to best simulate how readers perceive
sound at the phrase level (Soderstrom, et. al). Shorter or longer phrase
windows are possible and window size does affect accuracy — these choices,
again, reflect the “intelligent reasoning” and “pragmatically
efficient computation” aspects of knowledge representation that Davis,
et. al have identified.
For our collection, the size of the phrase window is fourteen so the set of
input features are the fourteen symbol ids for the given phrase. In total,
there are 1,434,588 phrase windows of fourteen symbols from nine books.
Finally, we added the “class” attribute, which is an id for the book in
which the phrase window exists. The class attribute (the book) is the
attribute that we predict.
-
Modeling:
For the predictive analysis we use an
instance-based approach, which is based on learning algorithms that
systemically compare new problem instances with training instances. In this
project, we use a full, leave-one-out cross validation. That is, for each
prediction, the phrase window is compared to each phrase window from all
other books.
[11] The prediction is the probability that
the phrase window is in each class (or book), so the probability
distribution over all classes sums to 1.0 (as seen by the row values of the
bottom table in Figure 4). In our collection, there are nine class
attributes, one for each book.
To predict which book a given phrase window exists, this phrase window is
compared to all other phrase windows from all other books by computing a
distance function.
[12] In order to build the best prosody
model, one must systematically optimize the control parameters of the
machine-learning algorithm to maximize accuracy. There are over one million
phrase windows that need to be compared with each other requiring
twenty-eight trillion window comparisons. This amount of computation needs
to be done for each bias parameter setting considered during bias
optimization. Each bias is a new experiment to run with different parameters
so adding a book is a new parameter and changing the phrase window from 14
to 15 is a new bias.
[13]
We describe these extensive processes to show that intelligent reasoning and
pragmatically efficient computation require extensive amounts of processing power.
As such, these are not experiments that can be run on a home computer. Changing
the way we analyze text (moving away from the grapheme and towards the phoneme) is
complicated by the need to collaborate across disciplinary realms (such as an
English Department or School of Information collaborating with a Supercomputing
Center and Visualization Lab). Further, the results produced by these processes
comprise another set of large amounts of data that must be made comprehensible to
readers or scholars interested in analyzing sound patterns in text.
ProseVis
An essential aspect of this project is ProseVis, a visualization tool we developed
to allow a reader to map the features extracted from OpenMary and the predictive
classification data to the words in the contexts to which readers are
familiar.
[14] We developed this project with the ultimate goal of
facilitating a reader’s ability to analyze sonic features of text and research has
shown that mapping the data to the text in its original form allows for the kind
of reading that literary scholars engage: they read words and features of language
situated within the contexts of phrases, sentences, lines, stanzas, and paragraphs
[
Clement 2008]. Recreating the contexts of the word not only
allows for the simultaneous consideration of multiple representations of knowledge
or readings (since every reader’s perspective on the context will be different)
but it also allows for a more transparent view of the underlying data. If a reader
can see the data (such as sounds and parts of speech) within the contexts of the
text with which they are familiar and well-versed, then the reader is empowered
within this familiar context to read what might otherwise be an unfamiliar,
tabular representation of the text.
Using the data produced by Meandre, ProseVis highlights features of a text. Figure
5, for example, shows two short prose pieces by Gertrude Stein called “word
portraits” and titled “Matisse” and “Picasso.” Stein’s word portraits were writing projects in
which character development progresses without narrative, much like still-life
portraits of a person that also “do not
tell a story”
[
Stein1988c, 184]. Rather, portraits provide a telling
snapshot in time. Stein draws the comparison to portraits because her attempt to
create written portraits was much like what she considered a painter’s ought to be
— to create “a picture that exists for and
in itself” using “objects
landscapes and people” without being “deeply conscious of these things as a
subject”
[
Stein 1990, 497]. In this first ProseVis example, we see Stein’s portraits “Matisse” and “Picasso” rendered
as a series of rows with colored blocks. Because ProseVis maintains a list of
unique occurrences of each attribute extracted by OpenMary, the reader can choose
to color the visualization by any of these attributes such as part-of-speech,
tone, accent, word, and phonetic sound. (Figure 6 shows the same information,
zoomed to illustrate how the words are legible beneath the blocks of color.) The
panel on the right is the control panel where the reader can choose how to display
the text and prosody features. Options include the ability to show lines by
phrases or sentences or chapter or stanza group. In these ways, the reader can
examine prosodic patterns as they occur at the beginning or end of these text
divisions.
When visualizing the text at the sound level, we encountered three primary issues:
(1) The set of unique sounds in a given text is too large to assign each one an
easily discernible color in the display; (2) When doing a string-based comparison
of one complete sound to another, it is not possible to detect subtler, and
potentially critical similarities that form patterns such as alliteration and
rhyming.
[15] To address the second
issue, we broke each syllable down into three primary constituents, and allowed
for the display to target each of these constituents individually. The
constituents that we identified as the most informative were the leading consonant
sound, the vowel sound, and the ending consonant sound:
[16]
Word |
Sound |
Lead Consonant |
Vowel Sound |
End Consonant |
Strike |
s tr I ke |
S |
AI |
k |
This breakdown provides the reader with a finer-grained level of analysis, as well
as a simplified coloring scheme. As a result, if the reader chooses to color the
visualization by the sound, they have the additional option of coloring by the
full sound, or by a component of sound such as a leading or ending consonant or a
vowel sound. Figure 7, Figure 8, and Figure 9 show these alternate views. Further,
a reader can render all the words as phonetic spellings (“sound”),
parts-of-speech (“POS”), or take out the underlying information altogether
(to leave just color) instead of text.
Finally, under the “Comparison” menu, readers can see
the predictive modeling data layered on top of the text. Here, each color
represents a different book (listed on the right) and each sound is highlighted
according to which book it is most like. When all the books are selected, the
color reflects which book has the highest probability or comparison for a given
sound.
[17]
Reading the portraits “Matisse” and “Picasso” and Tender Buttons in
ProseVis
As discussed, one of Gertrude Stein’s early modes of experimentation was to create
word portraits in the modernist mode. At the same time, she sensed an immediate
connection between the acts of speech (talking and listening) and her work creating
portraits of people in words. Derrida minimizes the distinction between writing and
speech or voice (and therefore sound) by showing how both are perceived by the
différance that is signification. In reading Gertrude Stein’s
work, however, Scott Pound argues that “Derrida obscures a distinction between
written and spoken language that a discussion of poetics cannot do without.
Poststructuralism’s demonstration of the difference writing makes must
therefore be set in relation to the difference sound makes”
[
Pound 2007, 26–27]. As discussed in the first part of this essay, in order to investigate the
difference that sound makes, we are transparent about the fact that a representation
of sound is subjective. What is most significant for this discussion, therefore, is
Derrida’s claim that “[i]n order to function, that is, to be
readable, a signature must have a repeatable, iterable, imitable form; it must
be able to be detached from the present and singular intention of its
production”
[
Derrida 1991, 106]. In this project, the form of the “signature” or sound of text is the
iterable, repeatable data that is produced by computational analysis.
Further, we can imagine this imitable form as a layer of data (a reading or another
“text”) that we are using as an overlay on the “originary” text as a
means or a lens to read the literary text differently. This “new” perspective on
Stein’s texts is not only important for understanding her creative work; it is
important for reconsidering what we have learned not to consider. For instance, Craig
Monk argues that Gertrude Stein lost favor with Eugene Jolas, founding editor of
transition, for political and personal reasons. Yet, the history can
be and has been read differently: that Jolas preferred James Joyce’s writing to
Stein’s because Joyce was held up as the revolutionary writer of his time. According
to Monk, Jolas laid down a gauntlet in 1929 when he published his “The Revolution of the Word Proclamation” (issue 16/17 of
transition). This revolution, Jolas writes, requires “[t]he
literary creator” or writer “to disintegrate the primal matter of words
imposed on him by textbooks and dictionaries” (“Introduction”). Joyce,
argues Monk, epitomized Jolas’s revolution with his “neologisms and portmanteau words”
[
Monk 1998, 29] while Stein’s “little household words so dear to Sherwood Anderson never
impressed [Jolas]” ([unpublished autobiography], 201 qtd. in Monk 32). As a
result, while Jolas would publish much of Joyce’s work including a serialization of
his ‘Work in Progress’ (which subsequently became
Finnegans
Wake) as well as essays by prominent authors who wrote about Joyce’s work,
he only published one more piece of Stein’s after his 1929 manifesto. Finally, in
1935, Jolas publicly denounced her writing in a
Testimony
Against Gertrude Stein (a supplement to
transition, July 23).
Perhaps the most salient observation Monk makes for this discussion is his conclusion
that “it was only as Jola’s preference for the
verbal in poetry began to emerge clearly that that the discussion of the visual
analogies used often to describe Stein’s works might be read, in hindsight, as
implicitly derogatory”
[
Monk 1998, 30].
In fact, the idea that James Joyce’s mode of experimentation incorporated elements
from music while Stein’s works, in contrast, reflected influences from the visual
arts has been debunked and explored and complicated by too many scholars to rehearse
again in this space, but the fact remains that as a culture, we are not far removed
from the situation in which
transition’s audience found itself: we have
been summarily prohibited from reading sound patterns by a system of production that
favors one mode of interpretation over another: the grapheme over the phoneme. Sound
patterns are difficult to discern. Using computational analysis to mark (to make
imitable and repeatable and
visual) expressions that correspond to sound
is a step in attempting to discern the relationships between all the various features
of text that contribute to meaning. Stein describes this confluence of features this
way: “I began to wonder at about this time
just what one saw when one looked at anything really looked at anything. Did
one see sound, and what was the relation between color and sound, did it make
itself by description by a word that meant it or did it make itself by a word
in itself”
[
Stein1988c, 91].
Primarily, the last part of this discussion is an exploration, using ProseVis and the
data from OpenMary and Meandre’s processes in reading sound patterns in Gertrude
Stein’s portraits “Matisse” and “Picasso” and her prose poem
Tender Buttons.
In this exploration, we are concerned above all with Davis’s final role of knowledge
representation, namely as “a medium of human expression, i.e., a
language in which we say things about the world”
[
Davis et al 1993]. What is at stake in this section is not to create new readings of Stein’s
texts (this would take much more deliberation and space) as much as it is to
demonstrate how we have come to analyze literary texts in digital environments as
visual texts that are divorced, quite often, from attributes of sound. In
computational environments, productive and critical representations of knowledge
should show a consideration for the multiplicity of ways that humans express and
understand themselves through how we say things about the world with literature.
Considering how to represent and analyze the sound of text in these readings
represents a step towards pushing computational discovery practices past singular
representations of the word and, thus, singular modes of interpretation.
“Matisse” and “Picasso”
(1912)
The relative success of Stein’s methods for creating the rhythm of a character is
evident in the response of scholars. Wendy Salkind argues that with her portraits
“Matisse” and “Picasso”,
Stein expresses a “disenchant[ment] with Matisse and his painting” and a
sustained “belief in the genius of Picasso.” In particular, Salkind notes the
ways in which sounds and rhythms work to create these readings:
We can hear that adulation and disappointment in the phrase
repetitions she uses in both pieces. She writes about the effort of creating
art, the struggle to be constantly working, to be consistently expressing
something, and to find greatness among your followers… When the Picasso description above is spoken
aloud, the repetition of the “w” sound continuously brings
your lips forward, as if in a kiss. The monosyllabic sentence flows and
arrives on the emphasis of the double syllable resolution of
the final word. Although also monosyllabic, the Matisse description is pedestrian, lacking
fluidity. When spoken, her words describing him don't feel
nearly as good in your mouth.
These same patterns are evident in the ProseVis visualization in Figure 7 in which
the beginning consonant sound “w” of words like “was,”
“one,” and “were” is represented in red. Clearly, there are fewer
concentrated patterns in the “Matisse” portrait on the
left than in the “Picasso” portrait on the right but
“Matisse” has 283 “w” sounds out of 2129 total
sounds (7.5%), which is actually more than “Picasso,”
which has 271 “w” sounds out of 1246 sounds (4.5%). The visualization
suggests that rather than volume of sounds, Salkind’s reading may have more to do
with the close repetition of the “w” sounds in “Picasso” — the successive opening and closing of the lips to make these
sounds could mimic kissing more readily than the sporadic lone “w” sounds
used in “Matisse.” Further, if we color the text
according to the “accent” data (see Figure 11 and Figure 12), we see the dark
blue areas that indicate high pitch or accented words that are more prevalent in
“Picasso” than in Matisse. These representations
invite us to ask more questions: what is accent doing in the text to contribute to
readings like the one Salkind proposes? What is the role of sound and prosody in
this text?
Other comparative patterns are clear as well. In Figure 5, in which full sounds
are represented, and each vertical line is a phrase, there is an inversion of
patterns between the two pieces. In “Picasso,” phrases
(represented in each line) begin with the yellow/red pattern and evolve into the
blue/green pattern at the end of the phrase (or line). The reverse is true of the
color sequences in “Matisse.” A closer look at these
patterns in Figure 6 shows that Stein starts phrases about Picasso with specific
referents to him such as “This one,” while phrases about Matisse begin with
more general referents such as “Some” as in “Some of a few.” Conversely,
while the “Picasso” phrases evolve into expressing an
abstract notion of a thing as in “something” and end again with the specific
reference to him again in “one,” the “Matisse”
phrases start with vague language (“Some”) and get more specific in the
middle of the phrase (referring to “he”) and ending with vague terms
referring to an abstract “thing”. These patterns or expression are
highlighted in this visualization because they are emphasized by certain sounds.
In the “Picasso” phrases, ending sounds are ones
created by first opening and then closing your lips such as the “o” and
“m” and “n” sounds in “some” and “one” — nonetheless with
the “om” sound, your throat remains open. The prevalent Matisse sounds are
one would make by beginning with closed lips and ending with widened or more
opened lips such as the “i” and “e” sounds in “thing” and “he”
— in this case the reader is closing off the breath, squeezing it with her mouth.
One could argue that Stein’s play with sounds shows how she represents these
artists: the “Picasso” sounds are open, contributing to
the sense of “fluidity” upon which Salkind remarks; the “Matisse” sounds, on the other hand, shorten the breath and restrict the
mouth’s movement into the next sound. The visualizations facilitate our ability to
examine how the words in context correspond to these sound patterns.
Tender Buttons (1914)
For our predictive modeling study, we compared the sounds of Gertrude Stein’s
Tender Buttons to that of
The
New England Cook Book
[
Turner 1905]. Margueritte S. Murphy hypothesizes that
Tender Buttons
“takes the form of domestic guides to
living: cookbooks, housekeeping guides, books of etiquette, guides to
entertaining, maxims of interior design, fashion advice”
[
Murphy 1991, 389]. By writing in this style, Murphy argues, Stein “exploits the vocabulary, syntax, rhythms, and cadences of
conventional women's prose and talk” to “[explain] her own idiosyncratic
domestic arrangement by using and displacing the authoritative discourse of
the conventional woman's world”
[
Murphy 1991, 383–384]. Murphy sites
The New England Cook Book
(
NECB) as a possible source with which to compare
the prosody of
Tender Buttons:
Toklas, of course, collected recipes, and
she later published two cookbooks, The Alice B. Toklas
Cookbook (1954) and Aromas of Past and
Present (1958). Through Toklas then, at least, Stein was familiar
with the genre of the cookbook or recipe collection and would appropriately
“adopt” and parody that genre in writing of their growing intimacy.
Significantly, Toklas’s name as “alas” appears repeatedly in “Cooking” as well as elsewhere in Tender Buttons.
[Murphy 1991, 391]
It is immediately clear from a simple frequency analysis that the word “Alas”
only appears in the one “Cooking” section in Tender Buttons, albeit it appears there thirteen times.
In order to analyze whether the texts had similar prosodic elements, however, we
attempted to make this comparison evident with predictive modeling.
To focus the machine learning on this hypothesis, we chose nine texts for
comparison and only used features that research has shown reflect prosody such as
part of speech, accent, stress, tone, and break index. The nine texts we chose
were “Picasso” (1912), “Matisse” (1912),
Three Lives (1909),
The Making of Americans (1923),
Ulysses by James Joyce (based on the pre-1923 print
editions),
The Iliad, translated by Andrew Lang,
Walter Leaf, and Ernest Myers (1882),
The Odyssey,
translated by S.H. Butcher and Andrew Lang (1882), and of course,
Tender Buttons (1914) and
The New
England Cook Book (1905).
[18] The features we included from the
OpenMary data do not include the word or the sound. The break index, which marks
the boundaries of syntactic units such as an intermediate phrase break, an
intra-sentential phrase break, a sentence-final boundary, and a paragraph-final
boundary, is particularly important because readers (and correspondingly the
OpenMary system) use phrasal boundaries to determine the rise and fall or emphases
of particular words based on their context within the phrase (Soderstrom, et. al).
As mentioned, to further bias the system towards the manner in which readers make
decisions on sound, we selected a window size that represented the average size of
a phrase across the nine texts. We also hypothesized, in order to measure the
tool’s efficacy, that
The Odyssey and
The Iliad are most similar.
First, we defined a prediction problem for machine learning to solve: Predict from
which book the window of prosody features comes. Figure 13 visualizes the results
of our predictive analysis. The analysis results show that machine learning makes
the same similarity judgment that Murphy had made: Tender
Buttons and NECB sound more similar to
each other than they do to any of the other books in the set. In the
visualization, row four shows the results for Tender
Buttons in which the analysis has chosen NECB as the matching text more often (at 14%) than any other book
including others by Stein (“Picasso” is chosen 10% of
the time). As well, row two, which shows the results for NECB, shows that Tender Buttons is chosen
more often (10%) than any other book when trying to predict the actual class or
the book itself. Another prediction that shows the algorithm’s success is
expressed in rows six and seven, which indicate that the computer confuses The Odyssey and The Iliad –
texts that are known to be very similar in terms of prosody — the highest
percentage of times. Interestingly enough, while both the results for the Iliad and the results for the Odyssey show a high correlation with the cookbook (14% and 12%
respectively), neither the results for Tender Buttons
or for the cookbook show a high correlation with Homer’s texts. This seems to
indicate that the aspects of Tender Buttons and the
cookbook that make them sound like each other are not those that make the cookbook
sound like Homer’s texts. Further, the fact that Tender
Buttons has very little correlation with texts that are seen as similar
to the NECB, shows how strongly Tender Buttons is correlated with the other texts in the set.
Consequently, its consistent correlation with the cookbook is a much stronger
match than might otherwise be indicated.
Using the ProseVis interface, we can see within the context of the text where
these associations have been made. Figure 14 shows Tender
Buttons and NECB in ProseVis. In both
panes, each sound is highlighted according to which book it is most like. When all
the books are selected, the color of the book that has the highest probability for
a given sound is shown. As well, sounds are brighter or less so depending on the
level of probability. Figure 15 shows Tender Buttons
compared to “Picasso” with only a subset of texts
“turned on” including NECB, The Making of
Americans, “Matisse,” and Three Lives. In this view, Tender
Buttons again shows a comparatively higher number NECB (pink) matches than the shorter “Picasso” text. In the close-up view in Figure 16, it is easier to see
the lighter and darker shades of pink (in the line “alas the back shape of
mussle”) and yellow (in the line “alas a dirty third alas a dirty
third”). The darker shades indicate that the probability that the sound is
more like Tender Buttons or NECB is greater.
[19]
The story the visualizations tell is two-fold. First, these visualizations are
useful in allowing us to test or generate hypotheses about prosody and sound in
the texts. Figure 16 shows us that the section surrounding “alas” is, in
fact, more like NECB than the other books, supporting
Murphy’s hypothesis that the area around “alas” has the rhythm of NECB. At the same time, we can see in Figure 15 that all
of the sections of Tender Buttons are
not most like NECB. In this figure,
the top of the view is colored red, grey, and light blue, indicating that this
area is more like Three Lives, The Making of Americans, and “Matisse”
respectively. A subsequent research question could concern the nature of this
difference. Further, Figure 17 shows two views of Tender
Buttons visualized in ProseVis. On the left, the same list is divided
into two colors showing that half the list is correlated with NECB and half the list is more strongly correlated with The Making of Americans. On the right, there is another
list with half of the list correlating more strongly to the Odyssey and the bottom half correlating to “Picasso.” This visualization immediately engenders questions concerning
why the first part of the list is different than the second half when the two
halves seem remarkably similar.
Second, other questions and hypotheses may be raised concerning how the algorithm
and ProseVis work together to generate these visualizations. These latter kinds of
questions can be considered in terms of the data sets and the documentation we are
providing as well as in respect to articles such as this one. In other words, the
goal is not accurate text identification using prosody features, but rather to
test hypotheses that consider the sound and prosodic similarities of texts. Part
of what we interested in digital humanities are the mistakes we perceive are made
by the computer and what these errors reveal about algorithms we are using to
gauge the significance of textual features. In other words, one benefit to
scholarship represented in this research is determining where the model breaks
down and where the ontology must be tweaked. For example, currently, the machine
learning system is not being tuned to produce the most accurate classifications:
using more context such as a larger window size (i.e., a larger number of phonemes
to consider as part of a window) increases the classification accuracy
dramatically.
[20] As well, if we take
parts of speech out of our analysis, our results are less clear. Keeping in mind
that we are modeling the possibility of sound
as it could be
perceived opens space for discovery and illumination since what we are
not only identifying in this process which text is more like the other (though
this is interesting). Rather, by focusing on where the ontology breaks down under
the weight of computation, we are learning more about how knowledge
representations (our modeling of sound, for instance) are productive for critical
inquiry in literary texts.
Conclusion
Previously, digital humanities scholars have also used phonetic symbolic research to
create tools that mark and analyze sound in poetry. For instance, Marc Plamondon
created AnalysePoems to analyze the Representative Poetry Online (RPO) website
(http://rpo.library.utoronto.ca). Plamondon’s goal with AnalysePoems was to “automate the identification of the
dominant metrical pattern of a poem and to describe some basic elements of the
structure of the poem such as the number of syllables per line and whether all
lines are of the same syllabic length or whether there are variations in the
syllabic length of the lines in an identifiable pattern”
[
Plamondon 2006, 128]. Like our project, AnalysePoems is not a
tool that attempts to represent the “reality” of a spoken poem, a feat that is
impossible because of the ephemeral elements of performance of which a poetry reading
comprises. Instead, AnalysePoems is “built on the prosodic philosophy that a full
scansion of a poe[
Plamondon 2006, 129]. Plamondon’s work has been
important for the development of the processes described here as it creates a
precedent and model for analyzing sound from a perspective that also values
pre-speech (aurality) and phonetic symbolism.
Another tool that was built to examine how the “phonetic/phonological structure of a poem may contribute
to its meaning and emotional power” is Pattern-Finder [
Smolinsky and Solokoff 2006, 339]. Smolinksky and Sokoloff’s hypothesis — “that feature-patterning is the driving
force in the ‘music’ of the poetry”
[
Smolinsky and Solokoff 2006, 340] is also important for the creation of our
visualization tool, ProseVis. With this tool, we are also interested in allowing
readers to identify patterns in the analyzed texts by facilitating their ability to
highlight different aspects or “features” of data such as parts of speech,
syllables, stress, and groups of vowel and consonant sounds. Like the creators of
Pattern-Finder, we are also interested in allowing readers to make groupings of
consonant sounds that include plosives, frications, and affricates and groupings of
vowels that include those formed in the front or the back of the mouth. Phonetic
symbolic research and the creation of these tools demonstrate a precedent for
facilitating readings that use these features to analyze text for meaning.
Practically speaking, our system for predictive modeling and the ProseVis tool are in
the early stages of development but we are encouraged with the results so far. We
predicted that Tender Buttons and The New England Cook Book would be most similar and that The Odyssey and The Iliad
should be most similar and these predictions were confirmed on our first attempt, but
our work in determining whether or not analyzing sound or interpreting with sound in
these ways is critically productive requires much more research, development, and
experimentation. Future development plans include collecting more use cases from
multiple experts and doing cross validation studies before we can have high
confidence that we have a useful system that compares well to expert predictions.
Similarly, ProseVis has only been used by a few scholars. Use case studies are needed
to establish if and how examining sound in this way can be useful to or change the
nature of scholarship in areas of text and sound.
At the same time, while usability studies are still a future goal in the project,
developing the algorithm and the ProseVis interface has already been productive in
terms of interrogating the efficacy of our underlying theories of aurality. Our work
is a new and promising approach to comparing texts based on prosody, but what is
equally promising is that we are ultimately basing our ontology for creating a
machine learning algorithm on an underlying logic of potential and inexact sounds as
they are anticipated in text. Further, the “success” of the comparison of sounds
between texts is based on the extent to which the computer is “confused” about
these possibilities. The fact that this is a “best guess” methodology, which
stems from theories in artificial intelligence and knowledge representation and is
based on potentials and probabilities, suggests that the algorithm and the tool
incorporate and function within a space that invites hypothesis generation and
discoveries in the sound of text.