Friday, December 17, 2010

"Culturomics," Google Labs Ngram and "Quantitative Analysis of Culture Using Millions of Digitized Books"

I'd like to close out the year by spreading the word on Google Labs' new tool for searching its digital storehouse of words and phrases and mapping how they appear over time in literature.

You may have seen the original paper published in Science Magazine:

Quantitative Analysis of Culture Using Millions of Digitized Books

Abstract:
We constructed a corpus of digitized texts containing about 4% of all books ever printed. Analysis of this corpus enables us to investigate cultural trends quantitatively. We survey the vast terrain of "culturomics", focusing on linguistic and cultural phenomena that were reflected in the English language between 1800 and 2000. We show how this approach can provide insights about fields as diverse as lexicography, the evolution of grammar, collective memory, the adoption of technology, the pursuit of fame, censorship, and historical epidemiology. "Culturomics" extends the boundaries of rigorous quantitative inquiry to a wide array of new phenomena spanning the social sciences and the humanities.



Or yesterday's article in the New York Times:
In 500 Billion Words, New Window on Culture

"...represents the first time a data set of this magnitude and searching tools are at the disposal of Ph.D.’s, middle school students and anyone else who likes to spend time in front of a small screen. It consists of the 500 billion words contained in books published between 1500 and 2008 in English, French, Spanish, German, Chinese and Russian.

The intended audience is scholarly, but a simple online tool allows anyone with a computer to plug in a string of up to five words and see a graph that charts the phrase’s use over time — a diversion that can quickly become as addictive as the habit-forming game Angry Birds."



I agree that it can be addictive and a huge time suck - I was almost late for work today comparing the popularity of dog vs. cat or Mickey Mouse vs. Abraham Lincoln.

The tool can be accessed via: http://ngrams.googlelabs.com/
The raw data is downloadable, and you can read more about the tool at: http://ngrams.googlelabs.com/info

No comments: