Google have just opened up their text mining project, a vast and ambitious project to allow searching their digital library for the frequency of words and phrases. It’s an astonishing resource, not only for its research potential but also for its ludic possibilities, not to mention the time-frittering capabilities.
It’s easy to play. Just go to http://ngrams.googlelabs.com/, put in some words or phrases, separating multiples with commas, adjust the settings as you wish, and press the button. Up comes a graph showing distribution across your chosen time period. Thrill to the peaks! Gasp at the troughs! Wonder at the abundances! Curse the absences!
Like all good games, there’s much wisdom to be gained from playing. For one thing, it tests the sources, both the original works and their translation into digital format. (Some of Google’s metadata is bizarrely inaccurate.) It makes one think about possible reasons: whatever the results of a query are, they never explain why. It questions the technologies of language, whether printed or digital, including orthography, typeface and grammar. (Note that it only covers written language, not that spoken or sung, with rhythm, accents and inflections.) In all, it indicates the ambiguity and instability of language, as against any needs or claims of clarity and transparency of meaning.
Below are four jests: testing a grandiose claim, a revealing anachronism, typographical obscenity and the deleterious effects of popular culture on the Queen’s English.
Benjamin was right!
And never mind population, trade, dominions and suchlike. But note that had I included ‘Rome’, Benjamin would have been refuted by the number of works on ancient history.
Surrealism: A Victorian Creation?
Although the word ‘surrealism’ was originally coined by Apollinaire in 1917, and given substance in the 1920s by Andre Breton, Google finds it in mid-Victorian English:
There’s an interesting error behind this. In parsing The London Review, volume 15, 1860-1, the last hypenated word of one page, and the first of the next, the chapter heading rather than the continuation of the text proper, have been run together by Google’s OCR software. Such bugs, the ‘revealing errors‘ of logic, can be considered a manifestation of the surrealist spirit.
Early modern printing is notorious for using ‘f’ in place of ‘s.’ Oh the comedic potential, as is born out by this graph:
The alternative explanation is, of course, that we were a foulmouthed bunch until the Victorians, and only since the 1950s have we begun to throw off those moral shackles.
Star Trek and the split infinitive
That most controversial of grammatical issues had a historical turn in the 1980s, when “to boldly go” overtook “to go boldly.” A way of interrogating the corpus for the frequency of split infinitives isn’t obvious, but the results would be very interesting.
Google have provided some basic, but literate, documentation. And the datasets are freely available under a creative commons license. A Guardian article serves as a decent introduction, although it exaggerates the originality of the techniques. See also the New York Times. And don’t ever, ever, use the pseudo-word ‘culturomics.’