Thursday, December 10, 2009

Reading by numbers


This news story from the BBC website almost sounds like a faux-academic fantasy written by Jose Luis Borges: physicists at Umea University in Sweden, using statistical analysis on the works of three classic authors, conclude that every author has a unique linguistic fingerprint. The BBC writes:

The relationship between the number of words an author uses only once and the length of a work forms an identifier for them, they argue.

This happens to be the same team whose insights into traffic jams we highlighted on the blog earlier this year. The team seems fond of using methods from physics to make observations on systems you wouldn't normally think of being in a physicists' realm, including fads and internet dating. This time they took the complete opuses of Thomas Hardy, Herman Melville, and D.H. Laurence to statistical task. The paper is free to view here.

The BBC writes that the graph of the number of unique words versus the number of total words gives a curve that is unique to each author. The paper itself seems to do something slightly different: it graphs the number of different words, or the book's vocabulary, against the total number of words for texts of different sizes, defining unique as "different" rather than "only appearing once." Obviously, the greater amount of different words there are, the more words there will be that appear only once, so it's a minor quibble. The curves for Hardy (H), Melville (M) and Lawrence (L) look like this (just the first graph):



Here M is the total number of words, and N is the number of different words. Looks like Melville's vocabulary was wider than Hardy's. I wonder if that makes Melville a better read?

To analyze shorter texts, the researchers would excerpt works or look at short stories; to get longer texts, they would look at a few books together, finally using the entire oeuvre. For this reason, Hardy's relative paucity of words seems natural; all his novels were set in a fictional shire in the south of England he called Wessex, a pre-Industrial world of milkmaids and shepherds, wagons and plows, Christmas feasts and country fairs. His books are a delight because they are a world unto themselves; I imagine that repeating words throughout one novel or across several helped Hardy establish Wessex as a real place in the reader's imagination.

This visual curve just might be the embodiment of that ethereal, ineffable stuff that makes one author different from another: the Hardy-ness of Hardy, the Melville-ness of Melville. Disappointingly, the paper makes no mention of the most common words, least common words, or words that only appear once for each author. Thank goodness there's Amazon's "Statistically Improbable Phrases" stats, which they run for books with the "Look Inside" feature, are a computer algorithm's attempt to answer why one book is different from another in a couple of words. To generate a book's SIPs, they compare its phrases to phrases in all the other books with this feature. The result? Literary criticism by numbers, in a way. Amazon explains: "For works of fiction, SIPs tend to be distinctive word combinations that often hint at important plot elements." The SIPs in Far from the Madding Crowd are "spring waggon and new shepherd," which do evoke the novel's pastoral setting. It doesn't always make sense, though; the SIP for The Return of the Native is "editorial emendation," which comes not from the novel's text but this particular edition's hefty appendix.

I think Amazon means for SIPS to help potential buyers see at a glance whether they're likely to enjoy the book; SIPS like "new shepherd" and "spring wagon" certainly would send some people running in the opposite direction. But I wonder whether it's somehow more accurate than simply reading the first couple of pages of a book and seeing whether the plot hooks you, and whether, from this tentative sip, you like the flavor of the author's language.

Besides creating the fingerprint-curves for each author, the Umea group also found that an author's word-frequency distribution, or the probability of finding a word that appears K times in a text, was the same for a short story, novel, or several novels put together. If an excerpt from a novel looks the same as the novel, through this analysis, then the novel, and, by extension, every work the author has produced, can be seen as a mere excerpt of "an imaginary complete infinite corpus written by the same author," what the authors call the "metabook." Computer simulations revealed that Thomas Hardy's looks something like this:


But really, the concept should be intriguing to literary critics. The authors write:

The writing of a text can be described by a process where the author pulls a piece of text out of a large mother book (the meta book) and puts it down on paper. This meta book is an imaginary infinite book that gives a representation of the word-frequency characteristics of everything that a certain author could ever think of writing. This has nothing to do with semantics and the actual meaning of what is written, but rather concerns the extent of the vocabulary, the level and type of education, and the personal preferences of an author. The fact that people have such different backgrounds, together with the seemingly different behavior of the function N(M) for the different authors, opens up the speculation that everyone has a personal and unique meta book, in which case it can be seen as an author's fingerprint.

They told the BBC that this concept could be used for literary sleuthing: "As their collection of fingerprints grows, Mr Bernhardsson said, they will try to identify the authors of anonymous works." Some people think the plays we attribute to Shakespeare weren't all written by the same man—what about scanning Shakespeare's plays to see if they all bear the same "fingerprint?"

1 comment:

  1. The paper itself seems to do something slightly different: it graphs the number of different words, or the book's vocabulary, against the total number of words for texts of different sizes, defining unique as "different" rather than "only appearing once."

    ReplyDelete