Making the TCP texts accessible, part 2 [Updated]

Nearly five years ago, I uploaded over two thousand eighteenth century works in plain text from ECCO (Eighteenth Century Collection Online) to the Open Knowledge Foundation’s Datahub. Unfortunately, in a recent server migration, the texts disappeared from that repository; I hope to replace them once the limits on file sizes are raised. In the meantime, they are available courtesy of the Early Modern OCR Project, through their Github account.

Since my original post, the Text Creation Partnership have released many more early modern texts under an open license, from the EEBO, ECCO and Evans collections. (The first two are early modern and eighteenth century British texts; the last is of early American texts.) There is also another collection, EEBO-TCP Navigations, a selection of early travel literature. Note that it is not the entirety of these collections that have been freely released; many texts are still only available with a subscription of some sort, and are not downloadable in bulk.

The free texts are marked up in XML, rather than plain text, and can be access in various ways: as text displayed online through the University of Michigan, with a variety of useful tools for advanced searching; and downloadable in bulk through Github repositories.

ECCO web interface

ECCO on Github

EEBO web interface

Evans Collection web interface

Evans Collection on Github

Navigations on Github

The EEBO collection is on Github via the TCP, but not as a single repository downloadable in one zip file. The TCP do offer a script ‘‘ to download them all.

The ECCO, EEBO and Evans collections are also accessible through the Oxford Text Archive, which holds both the freely available and the restricted access texts, filterable through the menu bar. The Navigations collection doesn’t have a web interface yet; apparently one is coming via Michigan ‘in early 2016.‘ Those are available at the Oxford Text Archive, but only with a subscription. Other ways of accessing the texts, generally via subscription, can be found at Heather Froehlich’s very handy post Ways of accessing EEBO(TCP), which has links to many other useful resources, and materials relating to her talk on 10 things you can do with eebo-tcp phase 1.

So there’s lots of texts available, more or less organised, from multiple sources. As to the remaining texts, and the TCP project as a whole, the outlook is unclear. After asking on Twitter, it seems it is currently dormant, ‘not dead just resting’:

Neither the TCP twitter account nor their website been seen any activity since December 2014. Better news is that over one hundred thousand texts have been uploaded to 18th Connect, through which the OCR can be corrected, whereupon the text will be made freely available. But this is going to be a long and painstaking task, reliant on volunteers with the patience to go through large volumes line by line. I seriously doubt anyone will get through the 500-odd pages of mind-numbing legalese contained in a single volume of Statutes At Large.

There are other problems to be aware of. With multiple repositories, and without a cannonical set, co-ordinating corrections and additions is going to be very difficult. Consequently, citation becomes more important and more difficult. The arbitrary division into EEBO, ECCO and Navigations collections is problematic, especially as there may be different versions of the same texts. This is to a considerable extent a consequence of different private companies digitizing the material, and as such not one with an intellectual grounding. (Evans seems entirely self-contained, but I could be wrong about that.)

Taken together, these are obstacles to building community resources around the texts. The reason for my interest in these texts is that I have my own personal itches to scratch, most immediately as to how debt was discussed in the eighteenth century. To use these corpuses for such an investigation requires a considerable amount of preparatory work, that takes time and effort, and can be a discouraging prospect. A community can ameliorate this to some extent.

In due course, I will follow this post up with a survey of the ways these texts can be manipulated, the tools that can be applied to them, and – a long way off yet! – what can be found through them of debt in the long eighteenth century.

As an addendum, some other sources of digitized English early modern and 18th century printed works, which may hold items not openly available, or not even within the EEBO and ECCO collections at all.

Google Books, Hathi Trust and the Internet Archive are the obvious first ports of call, but are bedeviled by poor OCR, lousy metadata and, for Google and Hathi, access problems.

English Broadside Ballad Archive is a magnificent collection of printed songs.

Project Gutenberg has some pre-1800 texts, some of which are post-1800 editions. It can be found on Github, courtesy of Gitenberg. As yet, I haven’t found a way of extracting those texts from the collection.

The Wellcome Library has a large digital collection of medical, in the widest sense, materials, including eighteenth century texts.

Update: I have found some other early modern and eighteenth century text archives. The Lampeter Corpus contains 120 texts spanning 1640 to 1740. A complete run of the London Gazette is online and searchable, notwithstanding the usual OCR problems. More ballads are available via the Bodleian.

I have also started collecting these texts together in a Github repository. Very much a work in progress, it so far has 4 sets: Lampeter, Evans, Navigations, and the original set of plain texts.

