Visualizing the Gnu GPL

My suggestion for the Decoding Digital Humanities meeting has been accepted, by both the London and Melbourne groups, for next Tuesday (24th August) here in the Great Wen, and next Thursday (26th August) down under. I’m feeling the warm glow of internationalism!

One reason I suggested the Gnu GPL as a text was for its unfamiliarity of form. It’s a software license, a genre often viewed but rarely read. I’ve clicked through many, barely registering the dense legalese, meaning I’ve probably promised to sacrifice my first-born to Bill Gates. The GPL, to its great credit, has a clear and concise preamble. But nevertheless, it is a legal document, written to withstand exacting juridical scrutiny.

As digital humanists, we shouldn’t be frightened of such things, for we make tools to deal with such difficulties. Whether the texts are in another language, damaged, obscured, fragmentary, long-winded, self-referential, or simply too numerous – not forgetting that no text is so transparent that one simple reading will comprehend it entirely -we can hack them.

One popular way of doing this is with wordles. These are, in essence, visualized concordances. The words are weighted according to frequency, then displayed as clouds. There are various options for colour, layout and font, but these do not reflect any aspect of the text, being more for aesthetic appeal, and as such a cause for their popularity. (The creator of Wordle, Jonathan Feinberg, discusses this in Viégas et al, “Participatory Visualization with Wordle.”)

So here I present the three versions of the Gnu GPL as wordles. They are made from the 100 most used words, filtered for the common and ordinary (‘the’, ‘and’). I have attempted to minimize the extraneous as much as possible, having the words displayed horizontally, (near) alphabetically, in plain, plain black and white.

Wordle of the 100 most used words in the Gnu GPL v.1.

Wordle of the 100 most used words in the Gnu GPL v.1.

Wordle of the 100 most used words in the Gnu GPL v.2.

Wordle of the 100 most used words in the Gnu GPL v.2, 1991.

GPL v.3: Wordle of 100 most used words

Wordle of 100 most used words, GPL v.3, 2007.

By taking the three versions, I’m treating the GPL historically, as changing over time. The most obvious and startling finding is that the term ‘program’ has dramatically declined in use from version 2 to version 3, changing the whole picture from being arrow-shaped to more cloud-like. (The algorithm for laying out the words is in Viégas et al.)  Its synonym, ‘Work’ has risen in its place. ‘Free’ has declined proportionally,  but in absolute terms, the story is quite different: it features in v.1 23 times, v.2 28 times, and v.3 20 times. ‘Freedom’, not found in the graphics above, rises from 3 usages in v.1, to 4 in v.2, and 8 – doubled – in v.3.

I could spend all day pouring over these things, but I’ve probably spent too long already when I have a dissertation to write. In any case, the purpose has been to suggest ways of reading the Gnu GPL, and will leave discussion to the convivial atmosphere of the meetings.

NB: The code behind wordle.net is owned by IBM, and closed. A free version, that allows adjusting and playing with the code, would be most desirable.

Reference: Fernanda B. Viégas, Martin Wattenberg, Jonathan Feinberg, “Participatory Visualization with Wordle,” IEEE Transactions on Visualization and Computer Graphics, vol. 15, no. 6, pp. 1137-1144, Nov./Dec. 2009, doi:10.1109/TVCG.2009.171 Behind a paywall, sadly, but abstract available.

DH 2010, day four

For me, the final day was the important one, with both the geography and history sessions taking place. The former saw three excellent presentations, from the University of North Carolina, Ian Gregory and the Hestia project. But the big news is that the UNC have built a locally-deployable, open source map server, called Main Street Carolina and available sometime this summer. There’s not much information available, but it is used for many of their projects including Going To The Show, and there’s a blurb and blogpost online. I have seriously high hopes for this, as a way of easily putting maps on the web without having to go down the Google route.

The highlight of the Professional Reflection strand was Claire RossPointless Babble or Enabled Backchannel, a witty and zippy analysis of twitter usage during three Digital Humanities conferences in 2009. Far more than 140 characters, without any excess and plenty of time for questions.

The History strand saw two very good presentations. And one that had me gawping in disbelief. Roorda’s Letters, Ideas and Information Technology, on visualizing seventeenth century correspondence, and Sainte’s Reading Darwin Between The Lines, analysing Darwin’s rare use of the term ‘evolution’, were very fine. But Blaney’s Developing a Collaborative Online Environment for History – The Experience of British History Online was a trip into the digital netherworld.

What British History Online wanted to do was crowdsource the Calendars of State Papers, those abstracts of government paperwork compiled in Victorian Times and now showing their age. So what do they do? Raise obstacles to participation. First, the CSP are behind a paywall, and as far as I can tell, there are no institutional subscriptions available. So the academics they hoped would annotate the documents had to pay for the honour. Then, to minimise contributions either malicious or erroneous, they deliberately put in obstacles and constraints to make annotation difficult. *rollseyes* Do they have any idea what crowdsourcing is?

Contributions were, unsurprisingly, sparse.

One of the audience asked about re-use. We were informed that the XML was locked up, the documents copyrighted (even though much of the material on BHO has long since passed into the public domain), but generously, we can print off as many copies as we wish. This was the only time I heard such sentiments expressed at DH2010; everyone else understood the importance of openness, of re-use, of contributing corrections and improvements, of sharing. It’s called community. And if you look at the graphic below, you’ll see it’s one of the prominent words (used 25 times) in the closing address from Melissa Terras, Present, Not Voting.

Wordle of Melissa Terras' speech at DH2010
Wordle of Melissa Terras’ speech at DH2010

(Click to view full size)

‘Transcribe’ and ‘Bentham’ also feature as this is a crowdsourcing project Terras is involved in. As she says:

one of the things we want to do with Transcribe Bentham is to provide access to the resulting XML files so that others can reuse the information (via web-services, etc). The hosting and transcription environment we are developing will be open source, so that others can use it. And this sea change, from working in small groups, to really reaching out to users is something we have to embrace, and learn to work with.

The prospect of easily setting up such collaborations is mouthwatering. Access, re-use, reaching out, yes yes yes. Sharing is fundamental to what we do, and we are stronger when we share. And right now the Digital Humanities community – like everyone else – faces terrible pressure, from government and university management, and needs to get stuck in:

We need people who are not just prepared to whine but prepared to roll up their sleeves and do things to improve our associations, our community, and our presence in academia.

Her whole speech was barnstorming, critical but not despondent, electrifying the audience, and the highlight of a conference that, for all the heat and rushing around and getting up way too early, truly inspired me.

DH 2010, day three

Not such an early start, so I missed Joshua Sternfeld’s talk on Digital Historiography. Annoying, but a sign of a good conference is that there’s too much of interest rather than too little.

For me, the important presentation in the Teaching/Managing strand was Nowviskie and Porter’s “The Graceful Degradation Survey: Managing Digital Humanities Projects Through Times of Transition and Decline.The afterlife of digital projects – and websites in general – is not only very important, but quite neglected, seemingly being done on an ad-hoc, voluntary basis. It was more to do with project management, organization and funding; I had hoped to hear something about technical solutions. It did suggest that there is a move to creating smaller, more preservable packets of information: a granular approach insuring against complete meltdown.

Another suggestion was that Digihum projects are increasingly being operated outside the academy. There’s a subterranean current here at DH2010 of extra-academic projects, ‘fragile vessels’ (as mentioned yesterday), small unfunded projects. One of those – a graduate project now continuing independently   – is contextus, which featured in the Scanning Between the Lines: The Search for the Semantic Story panel in the afternoon. Aside from being a very clear and useful introduction to RDFa (foaf etc), and being sprinkled with Doctor Who references, the speakers showed the great potential of the ‘semantic web’, about which I’d previously been a bit doubtful.

Many of the posters displayed, as on day two, were also for small, semi-independent or semi-official projects, using whatever tools are available free (in the financial sense). Somehow, this aspect of the Digital Humanities isn’t getting the full recognition it deserves. The lack of money shouldn’t mean abandoning a good or interesting idea, nor should it be considered a denial of permission to do what we want to do. It’s an obstacle, yes, but not insurmountable. Ways of operating on a shoestring need to be shared. And there is the advantage that without funds, one isn’t beholden to funders.

DH 2010, day two

I really don’t do mornings. But somehow I got to Kings on time (8.30!) and started work watching over the TEI (Text Encoding Initiative) session in the bowels of the Strand building.

Errands meant I only heard the first of those talks, given by Flanders on TEI documentation. To be honest, I wasn’t expecting much, but it proved to be a very important paper. Although it was focused on the needs and capabilities of TEI, the fundamental idea – that people need different forms of documentation, but basically the same information – has far wider application. From this Flanders identified nine (!) different types of document, and ways ‘bricks’ of information could be re-used. This is moving ‘help’ from being a bundle of text files to being a proper software application. I think the TEI ODD (‘One Document Does it All’) system has some similarities with Perl’s POD (Plain Old Documentation) mark up, though not knowing a great deal about either means I may be (very) wide of the mark.

In the afternoon I attended the Archives session. First up was Dirk Roorda talking about “The ecology of longevity“, using evolutionary theory to think about the preservation of data. Normally, such biological metaphors have me reaching for my proverbial revolver, but here they were used with some subtlety and care. Unfortunately, a great leap was suddenly made into some thoroughly specious economics, which the audience rightfully picked on in the questions. How,  after discussing the complexity and chaos of biology, could the speaker throw up platitudes dating from a century before Darwin?

Schlosser and Ulman’s talk on preserving digital projects had an interesting dialectic going on between the academic and the archivist, and – very important to me – recognized that not all digital projects are ambitious, heavily funded, grand collaborations, but also ‘fragile vessels’, projects that are on the margin, not mission critical. Buchanan then spoke on building Digital Libraries of Scholarly Editions. The problem here is aggregating individual projects into a library: each edition has its own aims, quirks and standards, and a library has to create some uniformity. Buchanan spoke of the difficulties in building such libraries; it occurred to me later that perhaps the problem has to be solved by the makers of the editions, and portability is their responsibility.

Late afternoon was spent looking round the poster displays, noting especially the cartography projects. Google maps was used, though some were chaffing against its limitations. There is a real need for an easily deployed, standalone mapping CMS using free data. (And it’s on my to-do list).

DH 2010, day one

For the next few days I’m a student assistant at Digital Humanities 2010, doing a bit of everything, from giving directions to waving microphones under people’s noses

The first day of the conference proper (there’s been many associated events in the last few days) was mainly dealing with organization, with only a few events. I missed the second day of THATCamp London, twitter proving more frustrating than informative as it just made me want to be there more than ever, but managed to catch Dan Cohen afterwards for my first interview.

The only event I attended, was the launch of the CHARM (Centre for the History and Analysis of Recorded Music) sound files. These are digitisations of out-of-copyright, lesser known, 20s and 30s 78 rpm records, and are freely downloadable. Hallelujah for free, because there’s some gems to be discovered. Check out Mischa Spoliansky’s excellent, jaunty version of Gershwin’s Rhapsody in Blue (seemingly no static URLs, but the search interface is easy to use). And thank you to CHARM for not locking the music up: both the speakers spoke with an enthusiasm they wanted to share. Got interviews with them too.

Duties meant I missed the opening ceremony – which also featured CHARM – but had a snigger at the tweets about paleography provoked by the words of Kings’ lamentable principal.

Serious seminars start tomorrow. Perhaps serious blog posts too.

Two Gnus: The Gnu Project and the Gnu GPL

For the next Decoding Digital Humanities meeting, I’d like to propose reading two fundamental documents of the free software movement, Richard Stallman’s Gnu Project and the Gnu GPL (General Public License). These texts build on the last meeting’s reading of Eric Raymond’s The Cathedral and the Bazaar, but are less about the process of coding, and more on the programmer in the world. The first is a brief history of sharing code and a plan for a completely free operating system, the second the most popular free software license, designed to protect both sharing and code.

They’re relevant to the Digital Humanities, and what we’ve been discussing, in numerous ways:

  • They show the human culture around the code, both implicitly (styles of writing, ways of thinking about a problem) and explicitly (Stallman’s description of sharing at MIT). The humanity around the digital, one can say.
  • We face very similar problems with sharing other things, like data and findings. That sharing is fundamental to learning; too much material is being locked up under dubious copyright claims and illiterate t&cs, never mind paywalls.
  • Talking of paywalls, both texts have a subtle attitude to commerce, seemingly unconcerned with money but overtly opposed to monopolisation.

And of course, we use the fruits of these works.

More than that, I think these texts can be read in very different ways: beyond being a license, the GPL can be seen as a ‘hack’, repurposing copyright into copyleft; a history of debate and struggle is found across its three revisions (and its offspring for web-deployed software, the Affero GPL); the Gnu Project is history, philosophy, polemic and an embodiment of sheer will. Reading differently is what the (digital) humanities does.

Further discussion is for the pub; this is just to suggest some suitable – and interesting! – reading around which to talk.