Sounding Stein’s Texts by Using Digital Tools for Distant Listening

[this is the MLA2012 talk I gave as part of the Gertrude Stein and Music panel arranged by the Lyrica Society for Word-Music Relations and the Association for the Study of Dada and Surrealism. Presiding: Jeff Dailey, Five Towns Coll.]

At a time when modernist writers were concerned with the extent to which medium played a direct role in characterizing the inward self, and technological innovations and new media comprised, more and more, a larger part of their means for communicating their sense of their ambiguated selves, music seemed like a medium for expression that represented an essential abstraction or expression of the inner self, unmediated. “All art constantly aspires to the condition of music,” Walter Pater wrote in 1873 (86).

In particular, Gertrude Stein’s texts provide an opportunity to question some of these assumptions about meaning making with music and text and technology. A common remark made by readers is that Stein’s texts makes more sense—are “clarified” or more “concrete” or less abstract—when set to music or computed. Scholars argue that both musical and computational interventions provide for a more comprehensible reading of the text, an expression of something inherent to the text that was otherwise obscured in a more traditional reading. For me, a question remains: what do these two interventions (both musical and computational) have in common and what can this activity of computationally or aurally altering Steinian texts tell us about the nature of reading or interpreting literary texts with digital tools in general?

What computational and musical renditions tell us about the nature of interpreting literary texts

At first glance, it seems that there are conflicting effects: setting a Steinian text to music heightens its emotional aesthetics while computing them flattens them out. For example, Virgil Thomson composed the score for Stein’s opera “Four Saints in Three Acts”, which opened on Broadway in 1934. In the introduction to the libretto published later, Carl Van Vechten wrote that the music was a “[p]erfect complement to the finely singable text which it always enhances and never obscures. The music is as transparent to color as the finest old stained glass and has no muddy passages” (Van Vechten 6).

Brad Bucknell goes further, saying that Thomson’s music does more than complement the meaning of Stein’s text; it is the space in which meaning making happens: “Indeed, somberness, spirituality, gayness, and so forth, are really, if present at all, made so by the conventional image repertoire created by the music. The language itself will never completely concretize anything, as we have seen” (211). Kenneth Goldsmith has a similar response to Gregory Laynor’s 2008 recording of Stein’s The Making of Americans (1925) in which Laynor reads and then starts to sing the text on page 135: “It is numbing; it’s repetitive; it’s really boring,” Goldsmith says of Laynor’s initial reading, but then he notes, “and so what happens by page 135 is that Mr. Laynor becomes interpretive and expressive and he begins singing The Making of Americans” (4:12-4:44). For Goldsmith, the text is inexpressive or “numbing”—it inhibits the production of meaning—but the singing is “expressive” and “interpretive”—it gives a sense of narrative and knowledge production to the text. As Wendy Salkind says of her theater students who were confused by a Stein text and began reading it out loud: “they turned it into a square dance or a waltz. They started clapping it out” (Personal interview). In each of these cases, music can bring context or a story with it.

This attempt to disambiguate the meaning of the text by setting it to music is similar to attempts to read Stein computationally. Edith Thacher Hurd, the wife of Clement Hurd, who illustrated many printings of Stein’s children’s book The World is Round describes this book as a text that is best read within a culture that is enmeshed in advanced technologies and new media. In a companion essay first printed in the 1986 edition, she considers the book’s limited success after its first and second printings with Young Scott Books in 1939 and 1966 to its increased popularity after its 1986 Arion Press edition:

The core of meaning in the round songs and the rhyming prose is more comprehensible than it was when the book was first published. Perhaps the electronic age, the age of television and the computer, has enabled us to move along the lines of thought with a speed of cognition that can keep up the swift pace of this expatriate genius (Hurd 158).

Kenneth Goldsmith describes composer Warren Burt’s rendition of “Miss Furr and Miss Skeene”, which is read aloud by computer voices, as indicative of the “true nature of the structure and the form of Gertrude Stein’s repetitious texts;” Burt, Goldsmith says, has successfully “tak[en] the emotion out of Gertrude Stein’s voice and presentation” (emphasis added). Like Hurd, Goldsmith thinks that Stein’s texts are more understandable within the context of computers since “this was a type of repetition that people weren’t accustomed to in the early part of the century,” but “transposed to the computer voices that we’re so accustomed to today, Gertrude Stein’s text makes absolute sense; it’s a sort of emotional flattening, freeing up of the text to become self-sufficient.” Goldsmith consistently praises computationally composed renditions as more “natural” and “self-sufficient” (i.e., understandable) texts because they are “emotionally flattened” much as Bucknell heralds the musically inspired Steinian versions as emotionally inspiring. In either case, the text’s meaning making potential is heightened by these interventions.

Linda Dusman and Wendy Salkind recently composed a performance of Stein’s story “Miss Furr and Miss Skeene” that exemplifies the process through which a composer can use both computational means and musical systems to “sound out” or read the text. In an interview, Dusman admits that they chose Stein’s story because they “fell in love with abstraction” and “to read it out loud suddenly it all made sense — the rhythm,” but Dusman, also perceived that there was a system in place in the composition of the text that was making sense to her: “I needed to be true to the text,” she says, “Stein is so rhythmical so I used just percussion . . . used dead strokes . . .not flowery” (Personal interview). Because she believed that Stein had a compositional system, Dusman’s process of setting Stein’s story to music became a systematic translation in which she transposed the sounds into numbers for graphing:

Red is the “a” sound, blue is the “th” sound . . . green is the “e” sound . . . I did it by sentence . . . I looked at the average for blue, for the “th” sound across the entire piece . . . it goes further and further away from the average. So there’s a kind of rhythm there for the average number and for the “a” sound the average is a little higher. But they all come together in the middle . . . which is the longest and most complicated paragraph . . . that big paragraph 12 . . . “the dark heavy men” . . . [Stein] totally changes at that point, everything changes . . . and then there’s this kind of fight that goes on with “a” and “e” get very close here . . . and the “th” and the “e” get very close, but eventually the “e” sound takes over here . . . and that “e” brightness was why towards the end of the piece they’re all only symbol sounds, shiny, shimmery kinds of sounds so that I kind of reflected that transformation so I turned that into a musical score . . . the “th” sound was a muted tom-tom; the “ing” sound was a medium gong; flexatone was the word “pleasant”; and then when she used the word “voice,” I used a high timpani bend; the “e” sound is a high piece of metal; and then Furr is a low wood block and Skeene is a high wood block. (Personal interview)

Linda Dusman's Graph of MIss Furr and Miss Skeene by Gertrude Stein

This is not to say that musical composition (as well as computing) isn’t another level of abstraction. When asked if she wanted to make the sounds correspond exactly to the text, Dusman admitted, “I didn’t want to be that obvious about it, and that would have been too cluttered so I did a kind of averaging . . . sometimes it lined up with her voice and sometimes it didn’t line up with her voice.” Dusman uses a mixture of mathematical terms such as “averaging” and specific musical instruments to correspond with particular sounds such as “th” or words such as “pleasant” as well as the characters Miss Furr and Miss Skeene but she also refers to elements of music that are more abstract: “I wanted,” she says about matching the text with sounds, for “it to sometimes be spot on and sometimes just be like an aura . . . there would be kind of like a spatialization or a sound world that would be created for each paragraph so you’d hear the changes from paragraph to paragraph and you’d hear the changes across the course of it but it wouldn’t be really obvious . . .” It is in this space that the concrete and abstract natures of the textual, the musical, and the computational become so inextricably mixed. Just as juxtaposing words can create multi-layered meanings, mixing musical notes can create auras; as well, computing or quantifying textual features creates a space in which to discover questions.

What computational and musical renditions tell us about the nature of representing literary texts

Dusman’s example shows us that it is at the moment of performance that we come to understand the relationship between the text and musical adaptations of it. In the digital humanities, we access data, most regularly, as a visualization on a computer screen. Likewise, composers visualize their compositions in musical scores. Arguably, it is at the moment at which the image of the text—whether it is in paragraphs on the typographical or manuscript page or in musical staffs or bar graphs—interfaces with the reader that the space of interpretive activity happens. The musical score is understood as an attempt to represent complex relationships such as the co-occurrence of multiple elements across time and space: it is meant to be played, to be spatialized in time and embodied by voices. I am arguing that the same is true of computational visualizations of text. They are meant to be played or performed.

To demonstrate this point, I will briefly perform a computationally adapted reading of Gertrude Stein’s The Making of Americans for you.

The first tool to discuss is Audacity, which is a free, open-source tool that allows a reader to create waveforms and spectrograms with audio files. What I have visualized in the first example is three waveforms of three readings (one per line) of the same three sections of Gertrude Stein’s The Making of Americans.

Waveform created with Audacity of three readings (by OpenMary, Gertrude Stein, and Gergory Laynor) from Gertrude Stein's <em>The Making of Americans</em> ” /><br />
Waveform created with Audacity of three readings (by OpenMary, Gertrude Stein, and Gergory Laynor) from Gertrude Stein’s <em>The Making of Americans</em></p>
<p>I created the first reading represented in the first line using <a href=OpenMary (Modular Architecture for Research on speech sYnthesis), an open-source text-to-speech system that “reads” texts (creates audio files) with a computer-generated, female-gendered, American dialect. The second reading, in line two, is by Gertrude Stein who originally recorded this reading in 1934 The third reading in line three is by Gregory Laynor who, as mentioned previously, created his reading of The Making of Americans in 2008. At first glance, this is an interesting visualization, because the change in the visualization (represented by the vertical line in the center of each reading) is the point at which there is also a change in the format of the reading, a break between a more traditionally narrative in which Stein is telling the story of a man and his son and their discovery that pinning butterflies is cruel and the last part of the reading in which Stein uses repetition heavily. Here are two representative sentences from Part B and Part C:

Part B:
One of such of these kind of them had a little boy and this one, the little son wanted to make a collection of butterflies and beetles and it was all exciting to him and it was all arranged then and then the father said to the son you are certain this is not a cruel thing that you are wanting to be doing, killing things to make collections of them, and the son was very disturbed then and they talked about it together the two of them and more and more they talked about it then and then at last the boy was convinced it was a cruel thing and he said he would not do it and his father said the little boy was a noble boy to give up pleasure when it was a cruel one.

Part C:
Any family living going on existing is going on and every one can come to be a dead one and there are then not any more living in that family living and that family is not then existing if there are not then any more having come to be living. Any family living is existing if there are some more being living when very many have come to be dead ones.

One reading of Figure 1 is that this is a visualization of Goldsmith’s point that computer voices bring Stein’s “tropes of repetition to a computer inspired level” of intensity. Indeed, in this figure, we see that the computer voice is more dynamic (higher peaks and lower valleys) when it comes to the repetitious section whereas the human voices (of Stein and Laynor) seem flattened in comparison to the previous part. While this is an exciting hypothesis, the visualization is misleading: a waveform simply identifies volume and tempo. In fact, Charles Bernstein makes the claim that waveforms can only identify part of what makes poetry audio files interesting. “There are four features or vocal gestures, that are available on tape but not page that are of special significance for poetry,” he writes; these include: “the cluster of rhythm and tempo (including word duration), the cluster of pitch and intonation (including amplitude), timbre, and accent” (126). Considering that waveforms only show the first two, it would be a stretch to argue that that amplification makes anything more intense or “inspired.”

The second visualization is a spectrogram created within Audacity of the same readings and the third is a close-up of the same spectrogram on the words “ . . . some such thing. Family living . . .”
Second Viz
Spectogram created with Audacity of three readings (by OpenMary, Gertrude Stein, and Gergory Laynor) from Gertrude Stein’s The Making of Americans
Third viz
Spectogram created with Audacity of the line “. . . some such thing. Family living . . .” (by OpenMary, Gertrude Stein, and Gergory Laynor) from Gertrude Stein’s The Making of Americans

Unlike a waveform, a spectrogram shows the information necessary to plot prosody features that include timbre and accent, features to which Bernstein and others have attributed meaning making properties. While the spectrogram does not show the same changes that the waveform indicates in Figure 1, the spectrograms shows the subtle differences that close reading phrases has always shown, but in this case, we can see how loudness or amplification changes or corresponds to different frequencies. After enough experience viewing spectograms a reader can start to imagine what the image sounds like. For example, one can see in this figure that the consonants and vowels look completely different. The consonants are red floating clouds, while the vowels are bright white spots. Like Stein, the reader can suddenly “hear more pleasantly with the eyes than the ears” (Autobiography 90) and see sound and the relationship between color and sound.

While Audacity is a powerful tool that can be used to demonstrate features of text that can be visualized, the problem with audacity is that it is a visualization of “data” or “given” information rather than what Johanna Drucker calls “capta” or “taken” information. In the digital humanities, we strive to model data and create tools and analyses that help us analyze this data based on our ideas concerning the hermeneutics and interpretive activities with which we are concerned. In this sense, the data is not “given” but rather consciously constructed as a means for reading according to our understandings of how interpretive activity happens. For example, ProseVis, a visualization tool we developed to allow a reader to map the features extracted from OpenMary to the words in context. Research has shown us that mapping the data to the text in its original form allows for the kind of human reading that literary scholars engage: words in the context of phrases, sentences, lines, stanzas, and paragraphs (Clement 2008). Recreating the context of the page not only allows for the simultaneous consideration of multiple representations of knowledge or readings (since every reader’s perspective on the context will be different) but it also allows for a more transparent view of the underlying data. This data is more “capta” than “given” in the sense that we had to define sound, much like a composer who must choose the key in which she composes, the chords, the length of a sound and its amplification depending on her understanding of what that musical movement portrays. Likewise, developing a tool in the digital humanities means choosing the textual features (whether they be parts-of-speech or sentimental phrases) that we believe make meaning; it means choosing the analytics that facilitate the kinds of interpretive activities we have defined as important as a community; it means creating a space in which the aura of interpretive activity takes place for the reader. For example, in using OpenMary data to capture features of aurality, we are defining sound as the pre-speech potential of sound as it is signified within the structure and syntax of text. Charles Bernstein calls “aurality” the “sounding of the writing” while “orality” has an “emphasis on breath, voice, and speech . . .Aurality precedes orality, just as language precedes speech” (Bernstein 1998, 13). In this way, we are gauging the reading of one imperfect speaker with ProseVis: the computer. That is, in this project we are using the OpenMary text-to-speech system to create a text-based surrogate of sound. It is a choice based on certain research and theoretical models. OpenMary’s rule set or algorithm for generating audio files is based on the research of both linguists and computer scientists. As a result, OpenMary captures information about the structure of the text (features) that make it possible for a computer to read and create speech that is comprehensible to readers of multiple languages. At the same time, using the OpenMary system to create our sound surrogates allows us to represent the aurality of sound or the potential of sound as a “best guess.” Dwight Bolinger writes “in the total absence of all phonological and visual cues, the psychological tendency to impose an accent is so strong that it will be done as a ‘best guess’ from the syntax” (Bolinger 1986, 17). In other words, when we encounter a written word, we make “best guesses” based on the possibilities of sound that are represented by the structural features of a word within its syntactical context. In this project, the OpenMary system also makes a best guess based on the structure of the text. The OpenMary XML output represents potential sounds since the utterance never happens. In other words, we are not using the resulting audio file as the source for our data; we are using the XML transcription that OpenMary produces as a best guess in the process of creating the audio file. We have chosen the framework for the data—as such, the data is “capta”—and we have developed the interface to facilitate its exploration.

Fourth viz
The Making of Americans excerpt in ProseVis showing full sounds and accent data

Fifth viz
Excerpt from The Making of Americans showing vowel sounds

In thinking about what a digital humanities tool like ProseVis is showing, it is perhaps more useful to think of computational representations like musical scores: they allude to the “aura” or to the spatialization or embodied nature of the interpretive act. Visualizations are not the end product. Tsur notes that the sounds comprising Baudelaire’s poetry “are perceived and compared to each other; the reader, however, cannot focus his awareness on any of these strings because his attentive perception has been distracted from one string by another, so that a network of highly significant sounds has been generated in rich effects, but only semiconsciously perceived” (58). Tsur asserts that this act of perceiving the minute patterns without consciously rendering them into “information” (rather than noise) lets the reader perceive those larger structures across Shakespeare’s poetry: “Then the sensuous opposition solid flesh ~ resolve into dew is subliminally reinforced on the phonological level of non-referential sound patterns, where a more differentiated phonological system is perceived as dissolving into a less differential one” (Tsur 61). Computational representations, at heart, are as inexact at representing the reading, interpretive, or meaning-making act as a musical score is in representing an opera: they represent a means for starting to read, for putting on the interpretive performance. In the case of ProseVis, for example, Figure 4 and Figure 5 represent a look at larger structures, to see which vowel patterns happen within the context of what phrases, but the subtleties here show that the data is similar to the same data that we have seen in Figure 3. In the below example we see lines taken from Figure 4 and Figure 5 that show the line “ . . . some such thing. Family living . . .” The first example shows (like Figure 3) that OpenMary does not place emphasis on “some” while “th” and “ing” are both emphasized. If we look at vowel sounds, we see in Figure 5 that “some” and “such” have a similar vowel sound as does “come” and “one”.

Sixth viz
The Making of Americans excerpt in ProseVis showing full sounds and accent data

Seventh viz
Excerpt from The Making of Americans in ProseVis showing vowel sounds

Understanding the underlying data as “capta” and given the opportunity to toggle back and forth between these various representations provides for the interpretive space (the performance space) in which scholars can consider all the manners in which a text makes meaning with sound.


Literary texts can be discrete objects as well as complicated and multi-leveled systems. Rendering the richness of a text is usually seen as the point of musical and digital interventions that often result in seemingly flat, two-dimensional visualizations, but, as I have indicated, I would like to also pose the hypothesis that it is the provocation to performance and thus the discovery of the larger structures that is the main point of these interventions. In “The World is Round,” Stein alludes toward the relationship between text, sound, and the performance space of reading that helps us understand how computational and musical adaptations are useful for reading her texts:

The teachers taught her
That the word was round
That the sun was round
And that they were all going around and around
And not a sound.
It was so sad it almost made her cry
But then she did not believe it
Because mountains were so high,
And so she thought she had better sing
And than a dreadful thing was happening
She remembered when she had been young
That one day she had sung,
And there was a looking-glass in front of her
And as she sang her mouth was round and was going
around and around.

Like the little girl who suddenly understands her position (her scary responsibility) in the universe, as the agent of not only the sound making but the meaning making, Van Dyke realizes the awesome responsibility that a computational reckoning of a Steinian text entails: “To rectify the noise in each sentence,” she writes, “as I have done for the first three paragraphs, would render Lucy Church Amiably so rich in information as to be incomprehensible, unless larger structures could be found” (186). Likewise, I have written about “distant-reading” Stein’s The Making of Americans using digital tools (Clement 2008). These interventions show us that many Steinian texts require that we pose as performers who are making meaning by any means that allows us to see the larger structures we need in order to make sense.

Works cited
Bernstein, Charles. Attack of the Difficult Poems: Essays and Inventions. University Of Chicago Press, 2011. Print.
Bucknell, Brad. Literary modernism and musical aesthetics: Pater, Pound, Joyce, and Stein. Cambridge University Press, 2001. Print.
Drucker, Johanna. “Humanities Approaches to Graphical Display.” Digital Humanities Quarterly 5.1 (2011): n. pag.
Dusman, Linda and Salkind, Wendy. Personal Interview. 13 July 2011.
Pater, Walter. The Renaissance. Oxford: Oxford University Press, 1986. Print.
Stein, Gertrude. The Making of Americans: Being a History of a Family’s Progress. Normal, IL: Dalkey Archive Press, 1995. Print.
Stein, Gertrude. The world is round. New York: Young Scott Books, 1966. Print.
Van Vechten, C. “Introduction.” In Thomson, Virgil, and Gertrude Stein. Four saints in three acts. A-R Editions, Inc., 2008. Print.

This entry was posted in Uncategorized. Bookmark the permalink.