Visualizing the Gnu GPL

My suggestion for the Decoding Digital Humanities meeting has been accepted, by both the London and Melbourne groups, for next Tuesday (24th August) here in the Great Wen, and next Thursday (26th August) down under. I’m feeling the warm glow of internationalism!

One reason I suggested the Gnu GPL as a text was for its unfamiliarity of form. It’s a software license, a genre often viewed but rarely read. I’ve clicked through many, barely registering the dense legalese, meaning I’ve probably promised to sacrifice my first-born to Bill Gates. The GPL, to its great credit, has a clear and concise preamble. But nevertheless, it is a legal document, written to withstand exacting juridical scrutiny.

As digital humanists, we shouldn’t be frightened of such things, for we make tools to deal with such difficulties. Whether the texts are in another language, damaged, obscured, fragmentary, long-winded, self-referential, or simply too numerous – not forgetting that no text is so transparent that one simple reading will comprehend it entirely -we can hack them.

One popular way of doing this is with wordles. These are, in essence, visualized concordances. The words are weighted according to frequency, then displayed as clouds. There are various options for colour, layout and font, but these do not reflect any aspect of the text, being more for aesthetic appeal, and as such a cause for their popularity. (The creator of Wordle, Jonathan Feinberg, discusses this in Viégas et al, “Participatory Visualization with Wordle.”)

So here I present the three versions of the Gnu GPL as wordles. They are made from the 100 most used words, filtered for the common and ordinary (‘the’, ‘and’). I have attempted to minimize the extraneous as much as possible, having the words displayed horizontally, (near) alphabetically, in plain, plain black and white.

Wordle of the 100 most used words in the Gnu GPL v.1.

Wordle of the 100 most used words in the Gnu GPL v.1.

Wordle of the 100 most used words in the Gnu GPL v.2.

Wordle of the 100 most used words in the Gnu GPL v.2, 1991.

GPL v.3: Wordle of 100 most used words

Wordle of 100 most used words, GPL v.3, 2007.

By taking the three versions, I’m treating the GPL historically, as changing over time. The most obvious and startling finding is that the term ‘program’ has dramatically declined in use from version 2 to version 3, changing the whole picture from being arrow-shaped to more cloud-like. (The algorithm for laying out the words is in Viégas et al.)  Its synonym, ‘Work’ has risen in its place. ‘Free’ has declined proportionally,  but in absolute terms, the story is quite different: it features in v.1 23 times, v.2 28 times, and v.3 20 times. ‘Freedom’, not found in the graphics above, rises from 3 usages in v.1, to 4 in v.2, and 8 – doubled – in v.3.

I could spend all day pouring over these things, but I’ve probably spent too long already when I have a dissertation to write. In any case, the purpose has been to suggest ways of reading the Gnu GPL, and will leave discussion to the convivial atmosphere of the meetings.

NB: The code behind wordle.net is owned by IBM, and closed. A free version, that allows adjusting and playing with the code, would be most desirable.

Reference: Fernanda B. Viégas, Martin Wattenberg, Jonathan Feinberg, “Participatory Visualization with Wordle,” IEEE Transactions on Visualization and Computer Graphics, vol. 15, no. 6, pp. 1137-1144, Nov./Dec. 2009, doi:10.1109/TVCG.2009.171 Behind a paywall, sadly, but abstract available.

DH 2010, day two

I really don’t do mornings. But somehow I got to Kings on time (8.30!) and started work watching over the TEI (Text Encoding Initiative) session in the bowels of the Strand building.

Errands meant I only heard the first of those talks, given by Flanders on TEI documentation. To be honest, I wasn’t expecting much, but it proved to be a very important paper. Although it was focused on the needs and capabilities of TEI, the fundamental idea – that people need different forms of documentation, but basically the same information – has far wider application. From this Flanders identified nine (!) different types of document, and ways ‘bricks’ of information could be re-used. This is moving ‘help’ from being a bundle of text files to being a proper software application. I think the TEI ODD (‘One Document Does it All’) system has some similarities with Perl’s POD (Plain Old Documentation) mark up, though not knowing a great deal about either means I may be (very) wide of the mark.

In the afternoon I attended the Archives session. First up was Dirk Roorda talking about “The ecology of longevity“, using evolutionary theory to think about the preservation of data. Normally, such biological metaphors have me reaching for my proverbial revolver, but here they were used with some subtlety and care. Unfortunately, a great leap was suddenly made into some thoroughly specious economics, which the audience rightfully picked on in the questions. How,  after discussing the complexity and chaos of biology, could the speaker throw up platitudes dating from a century before Darwin?

Schlosser and Ulman’s talk on preserving digital projects had an interesting dialectic going on between the academic and the archivist, and – very important to me – recognized that not all digital projects are ambitious, heavily funded, grand collaborations, but also ‘fragile vessels’, projects that are on the margin, not mission critical. Buchanan then spoke on building Digital Libraries of Scholarly Editions. The problem here is aggregating individual projects into a library: each edition has its own aims, quirks and standards, and a library has to create some uniformity. Buchanan spoke of the difficulties in building such libraries; it occurred to me later that perhaps the problem has to be solved by the makers of the editions, and portability is their responsibility.

Late afternoon was spent looking round the poster displays, noting especially the cartography projects. Google maps was used, though some were chaffing against its limitations. There is a real need for an easily deployed, standalone mapping CMS using free data. (And it’s on my to-do list).

DH 2010, day one

For the next few days I’m a student assistant at Digital Humanities 2010, doing a bit of everything, from giving directions to waving microphones under people’s noses

The first day of the conference proper (there’s been many associated events in the last few days) was mainly dealing with organization, with only a few events. I missed the second day of THATCamp London, twitter proving more frustrating than informative as it just made me want to be there more than ever, but managed to catch Dan Cohen afterwards for my first interview.

The only event I attended, was the launch of the CHARM (Centre for the History and Analysis of Recorded Music) sound files. These are digitisations of out-of-copyright, lesser known, 20s and 30s 78 rpm records, and are freely downloadable. Hallelujah for free, because there’s some gems to be discovered. Check out Mischa Spoliansky’s excellent, jaunty version of Gershwin’s Rhapsody in Blue (seemingly no static URLs, but the search interface is easy to use). And thank you to CHARM for not locking the music up: both the speakers spoke with an enthusiasm they wanted to share. Got interviews with them too.

Duties meant I missed the opening ceremony – which also featured CHARM – but had a snigger at the tweets about paleography provoked by the words of Kings’ lamentable principal.

Serious seminars start tomorrow. Perhaps serious blog posts too.