What the Computer Said

This section of the website presents the results of our ongoing effort to develop an algorithmic method for automatically detecting free indirect discourse (FID) in fiction.

We are aware of only one previous effort to theorize a method for automatically detecting FID: that of Michael Toolan. Toolan’s theoretical work, which was never developed into a working model, proceeded in a purely mechanical manner: working from a set of a priori assumptions about the conditions under which FID is most likely to appear, he developed a set of rules that he hypothesized could be applied to any text.

From the beginning, our work has sought a very different approach. Our methodological starting point was a remark by Erich Auerbach in his analysis of To the Lighthouse in Mimesis. Building a general understanding of modernist fiction from Woolf’s “multipersonal representation of consciousness,” Auerbach writes,

These are the forms of order and interpretation [that] modern writers […] attempt to grasp in the random moment — not one order and one interpretation, but many, which may either be those of different persons or of the same person at different times; so that overlapping, complementing, and contradiction yield something we might call a synthesized cosmic view or at least a challenge to the reader’s will to interpretive synthesis. (549)

Following in the spirit of Auerbach’s account of modernist dialogism, we sought a method for building our algorithmic understanding — our “synthesized cosmic view” of FID — out of the “overlapping, complementing, and contradicting” interpretations of human readers. Employing the methods outlined in the “What the Class Said” part of this website, we asked 320 students to annotate a particular short section of To the Lighthouse, with 3-4 students marking up each individual section. The goal was to apply machine learning to these student annotations — to “train” an automatic method of detecting FID using artificial intelligence. Eschewing Toolan’s top-down approach and a priori assumptions, we would “crowdsource” the question, building an automatic model out of actual readers’ interpretations.

Since we only had student annotations for two parts of To the Lighthousethe first four and the last seven chapters — the test of our method would come in our algorithm’s ability to interpret the remainder of the text.

The current algorithm

At this stage of our work, the goal of an algorithm based entirely on machine learning has proven unworkable. The annotations have proven too “messy” — have exhibited too much of Auerbach’s vaunted “contradiction” — to train a reliable model. In most cases, automatic models based solely on student readings tend simply to guess that the entire novel is Mrs. Ramsay’s silent FID — a good guess, statistically speaking, but hardly a revealing one.

The current algorithm for automatic annotation of To The Lighthouse is thus based on a mixture of machine learning and rule-based methods. We tackle separately the two aspects of the task: the determination of type of discourse and identification of the character whose speech is being representated. For the former, we first use some simple heuristics (use of quotation marks, presence of relevant verbs and phrases like “she thought”) to identify obvious instances of direct speech and clear examples of narration. When analyzing the portions of the novel not covered by student annotation, we use student data from the tagged sections to build a linear regression model that assigns each clause in the text a value on a narration vs. direct speech scale. (We lump indirect discourse with narration because both are in the voice of the narrator.) Extreme instances on the scale are classified as the appropriate category, while those in the middle — mixtures of the speech patterns of the narrator and a character — are considered free indirect discourse. The regression is based on the presence of common words and word types (parts-of-speech). For less common words, we use stylistic profiles automatically derived from a much larger corpus (Brooke and Hirst).

Presently, the assignment of characters to spans of direct and free indirect discourse is based entirely on the distribution of names and pronouns in the text. Again, we use a linear regression model: for each sentence and each character, the algorithm derives a value indicating how likely it is that a given character generated that sentence, based on the local distribution of names in the text prior to and near the clause. (This is a process not unlike that in which confused human readers engage when trying to determine who is speaking in the novel.) The most likely character is chosen for each instance of character speech, though we have included a rule that prevents a character of the opposite gender from being selected in the case when a particular gender is indicated by the narrator (e.g., “she said”). Among the various features we employ, the most relevant is the last character to whom speech has been explicitly attributed (e.g.,“Mrs. Ramsay said”). Because the algorithm relies so heavily on this feature, once it makes a mistake it tends to repeat the mistake until it reaches another explicit attribution of speech.

You can now see the current algorithm in action in the two sections annotated by the class (I:1-4 and III:7-13), as well as in the section that did not receive student annotation (I:5-III:6).

Going forward

We do not anticipate our work stopping here. Going forward, we would like to move closer and closer to realizing our goal of an algorithm based entirely on the interpretations of human readers. To do this, we need your help — specifically, your annotations! You can provide these via the tools in the “Have Your Own Say” section of the website. Playing the “Who Reads Best: Student, Computer, or Expert?” game will also help us to improve our algorithm.

Even when we have improved our algorithm for reading FID in To the Lighthouse, much work will remain. Our overall goal is to develop a quantitative, automatic method for detecting FID in any work of fiction. Given the ethical, political, and historical significance of modernist FID, we feel a reliable quantitative index of a text’s “dialogism” — its “multi-voicedness,” its inclusion of a variety of competing perspectives — would provide a meaningful basis for comparing particular novels, authors, and even literary periods. Though we’ll try not to get ahead of ourselves.

We will take a step toward building this more general algorithmic understanding of FID in the Fall 2013 session of “The Digital Text” when we will take on another large-scale annotation project, this time focused on James Joyce’s “The Dead.”

Understanding Our Visualizations

For more on reading our editions, see the “Understanding our Visualizations” page.

Works Cited

Brooke, Julian and Graeme Hirst. “Hybrid models for lexical acquisition of correlated styles.” Proceedings of the International Joint Conference on Natural Langauge Processing (IJCNLP ’13). Nagoya, Japan: 2013.

Toolan, Michael. “Narrative Progression in the Short Story: First Steps in a Corpus Stylistic Approach.” Narrative 16.2 (May 2008): 105–120.