Wednesday, January 03, 2018

Meet the Scientist Using Physics Techniques to Solve Linguistic Mysteries

"A good idea is useless if does not convince others. An idea that is only convincing to oneself is dead."

These wise words represent a hard-learned lesson for Dr. Ramon Ferrer-i-Cancho, a scientist in the Complexity and Quantitative Linguistics lab at the Universitat Politècnica de Catalunya. Ferrer-i-Cancho has spent nearly two decades fleshing out a mathematical theory to describe the natural elegance of languages, fighting skepticism and intellectual inertia every step of the way. Now, with a publication in the American Physical Society's journal Physical Review E, he hopes to both refute and convert his dissenters once and for all.

When you read or hear a sentence, your brain goes into an amazing matching routine, pairing up adjectives to the words they modify, applying action verbs to their objects, and sorting clauses to create a sensible picture of the speaker's overall meaning. But in any language, the way a sentence is spoken is usually streamlined, optimized with implicit rules to make it easy to understand. For instance, we tend to keep subjects close to their verbs, or modifiers close to the nouns they modify, e.g. "John ate quickly" rather than "Quickly, John ate". Sometimes there are hard-and-fast grammar conventions about this kind of thing, but more often these implicit rules identified by linguists are statements about what sounds right to our ears.

There are a number of other rules that seem to emerge in just about every language, and Ferrer-i-Cacho has been working for years to unravel the mysteries of one...and maybe to disprove its existence. The rule is simple: "Don't cross dependencies"

"It would be bad."
Image Credit: Columbia Pictures, via metrograph
In linguistics, a dependency describes the relationship between any two words. In our example from earlier, there's a dependency between "John" and "ate", and between "ate" and "quickly"—"quickly" modifies "ate", and "ate" is something that "John" did. There's no dependency between "John" and "quickly", because "quickly" only modifies the verb.

A crossed dependency, then, is when words with dependencies between them start to get interspersed with other words that also have dependencies between them.

In sentence (b), the line linking "ate" to "yesterday" crosses the line linking "apple" to "which"—a crossed dependency.
Image Credit: Gomez-Rodriguez & Ferrer-i-Cancho (2017), Physical Review E
In reading sentences (b) and (c) in the above figure, you can hopefully see what I meant earlier about "what sounds right to our ears". While sentence (b) is still understandable, it sounds stilted and could create confusion. The phrase "which was red" obviously doesn't apply to "yesterday", but a different sentence with a similar structure could be taken multiple ways.

For instance, imagine if sentence (b) said "good" instead of "red". Was it a good apple, or is John a picky eater, so it was good that he ate some fruit? Without other context, I'd assume the latter, and this supports the idea of a no-dependency-crossing rule: if you look at the dependency lines on sentence (b), you'll see that if "which was good" applies to "ate" (rather than "apple"), there are no crossed dependencies.

This rule shows up in most languages, but in trying to understand where it comes from, Ferrer-i-Cancho came up with a controversial hypothesis: the no-crossing rule isn't a rule in its own right.

"The inspiration of this article came during my PhD (early 2000)," he says. "At that time I was very interested in understanding the origins of the scarcity of crossings dependencies in languages. By chance, I met Pau Fernandez Duran, who was working on applications of the minimum linear arrangement problem of computer science to network theory research."

Sensing an opportunity, Ferrer-i-Cancho applied the technique to his own problems. "In the context of a sentence, the minimum linear arrangement problem consists of finding an ordering of the words where the sum of the distances between syntactically related words is minimized."

Solving the minimum linear arrangement problem in a sentence is essentially finding the mathematically optimal application of the "keep related words close together" rule, which helps speakers and listeners keep track of what's going on in a complex sentence with lots of parts.

"I thought that languages had to solve a similar problem due to cognitive constraints. Surprisingly, the number of crossings was practically zero in these peculiar orderings, providing support for the hypothesis that the scarcity of crossing dependencies in languages could be a natural consequence of cognitive pressures (the longer the dependency, the longer the cognitive cost)."

The word orders in mathematically optimized sentences are "peculiar", he says, because "real languages don't reach the actual minimum sum of dependency lengths". Although these sentences are theoretically the easiest to process, "reducing the distance between syntactically related elements is in conflict with other word order principles, and...languages have to reach a sort of compromise."

As a result of the complexity at work, the idea—that the scarcity of crossed dependencies is the product of a simpler rule, rather than a rule of its own—was met with intense skepticism by his colleagues in the field. Ferrer-i-Cancho, however, would not give up. To prove his theory, he's had to build an argument starting from foundations that other linguists took for granted.

For instance, in linguistics it's simply assumed to be a fact that dependency crossings are verboten and don't show up very often. But to present their findings in a physics journal, the authors of the current article had to first lay the groundwork, analyzing 30 different languages to prove that dependency crossings really are rare.

Ferrer-i-Cancho wanted to prove that dependency crossings, where they do exist, are the result of the "keep-words-together" rule running up against the other rules in a given language. Proving this rigorously, though, is more difficult than it might seem—and would turn out to require more math than your average linguistics paper.

He decided to attack the problem by using artificial language structures with a small number of dependency crossings—like you'd find in real languages—to develop an argument that could apply to any language. In 2014, he had what seemed to be a breakthrough, when he developed a method to predict the number of crossings that would show up in his simulated language structures. "The argument was powerful because it was abstract enough to be valid to any language," he says, "However, colleagues did not believe that it could work as expected on real sentences: the artificial dependency structures that we were using were generated randomly, and the predictor was based on the assumption that words were placed at random in the sentence."

In spite of—or perhaps because of—his peers' doubts, Ferrer-i-Cancho and his collaborators have persevered, hoping that hard math will convince doubters where other arguments have failed. But to be truly robust, the theory needed to be tested on real language data, a process which wouldn't have been possible without the paper's coauthor, Dr. Carlos Gómez-Rodríguez.

"The article in press in Physical Review E culminates the project, summarizing all alternative hypotheses, synthesizing progress in the mathematical theory of crossings and, last but not least, showing that our arguments work in real sentences in spite of their simplifying assumptions", says Ferrer-i-Cancho. Whether or not this unprecedented rigor will convince linguists remains to be seen, but anyone arguing against the theory will have their work cut out for them.

To those outside linguistics (and especially those outside academia overall), the existence—or non-existence—of the no-dependency-crossing rule may seem like an odd hill to die on, an unusual battle to spend more than ten years fighting. But consider the analogous situation in physics: imagine a scientist finds a way of looking at things such that a law of nature turns out not to be a law at all, simply a new manifestation of a more fundamental principle.

In fact, just such a thing happened in the 1970s, when it was realized that the "weak force" that governs radioactive decay and the electromagnetic force that governs charged particles are two aspects of the same thing. When two laws are found to be different manifestations of the same principles, we call it "unification": a step toward the grand unified theory of the universe that theoretical physicists have searched for since the dawn of the modern day.

The development of electroweak theory earned its originators a Nobel prize, and rightly so—they brought us toward a more elegant perspective on the universe. Although we may not see a "grand unified theory" of language as a result of Ferrer-i-Cancho's work*, so much of science is a quest to discover the basic rules—the simplest set of axioms or equations necessary to produce the universe we know, in all its stunning complexity and beauty. In that regard, whether or not this latest paper is the final word, Dr. Ferrer-i-Cancho has already succeeded.

—Stephen Skolnick

Dr. Ferrer-i-Cancho informs us that he has, in fact, made significant progress toward a unified theory of word order.


  1. Linguistics of the published variety is so riddled with errors that it's easy to knock things down. The real stuff is not published because it is too valuable to AGI development, so no one should get excited about any published advance as it is all trivial and out of date to the real leaders in this field.

  2. David Cooper
    That would be easy to say of almost any field, and equally hard to prove or disprove.