Language Scraps: How Google understands language like a 10-year-old

by James Temple, San Francisco Chronicle Staff Writer
Monday, October 18, 2010

Language has long been one of the most difficult challenges in artificial intelligence research, mainly because programs are based on rules, while native tongues cobbled together over hundreds of years tend to flout them.

Researchers only began to make major strides in the last 15 years or so, once they began supplementing rules with a so-called statistical approach.

Put very simply: By analyzing huge quantities of human text, initially labeled and dissected in much the manner of English class sentence diagramming, machines eventually begin to detect the patterns that define the use of language. After a certain stage of development, the algorithms can be unleashed onto raw or unstructured data, and continue to refine their understanding.

The same process has led to similarly momentous advances in language translation tools, and machine perception technologies like facial and voice recognition.

The success of this approach has been further propelled by two key developments: The sudden availability of massive amounts of digital text in the way of the Internet, and the enormous computing power available to researchers through server farms strung together across the planet.

Now when Google's computers confront a word with multiple meanings, they can rely on the same clues that humans use to understand the meaning.

Take the word "can." It might be a noun (a metal container), a verb (to put something into such a container) or a modal verb (to be able to do so). You can can something in a can.

Based on the billions of examples its algorithm has analyzed, Google knows it's highly likely that if "can" is preceded by a pronoun ("you") it's most likely the modal verb. If it's followed by an object ("something") it's most likely a verb. If it comes after an article ("a") it's most likely a metal container. (And in just about every case other than the one in the preceding paragraph, two cans in a row would probably denote a dance.)

The search engine has also begun to understand which words are synonyms for others. That's why today Google knows that a user typing the query "change memory in my laptop" would probably be interested in a string of text online that reads "install laptop RAM," even though only one word is the same. Google was incapable of a match like that as recently as three years ago.

These improvements have allowed users to increasingly express their queries using natural language, instead of breaking down their wants into three-word Boolean expressions. As consumers have caught on to this, the length of average queries has steadily grown.

Artificial intelligence isn't a silver bullet to online search, however. Google is continually tweaking its algorithms to address shortcomings, but some problems can be quite difficult to solve.

For instance, "pain killers that don't upset stomach," a fairly common query, trips up the engine because it's not great at negation. Typically, the words in a query represent things people do - not "don't" - want to find.

And sometimes probability works against the search engine: Google tends to think that Dell and Lenovo are the same thing because so many similar words show up around the names of the two computer manufacturers.

The algorithm's understanding of language "has moved from a 2-year-old infant to something close to an 8 or 10-year-old child," said Amit Singhal, a Google Fellow, an honorific reserved for the company's top engineers. "They're still not approaching the conversations you'd have as a teenager."

Monday, October 18, 2010

How Google understands language like a 10-year-old

1 comments:

Scrap me

Tag Cloud

Scrap Enthusiasts

Recent Comments

Blog Archive

Contributors

Personal Blogs of Our Contributors (and other related blogs...)

Language Blogs We're Reading

Language Log

Languagehat.com