Making the TCP texts accessible, part 3: An Index

I have previously posted about the vast collection of early printed texts released by the Text Creation Partnership. To recap: the TCP have released vast numbers of early modern, eighteenth century texts. But they are not easily discoverable or downloadable. So I have put the 6 different sets up on Github. The sets are:

ECCO: 2,473 eighteenth century publications in XML.

EEBO: 32,853 texts in XML.

Evans: 4,977 early American texts in XML.

Navigations: 1,482 texts relating to travel, in XML.

TCP Plain Texts: the first set of texts released, 2,188 of them in plain text format.

Unfinished: 628 texts that haven’t been proofread of checked, in XML.

(I’ve also created two further sets, of the Lampeter Corpus and of TCP texts I am transcribing via 18th Connect. But here I only discuss the collections released by TCP.)

This doesn’t actually add up to 43,973 different texts. (Not least because every time I count I get a different total.) Some publications fall in to two or more collections: generally because there is an XML and a plain text version, but it seems that there are different XML variants in different collections as well. Locating these doubles is difficult, and is just one of a number of problems. Simply knowing what’s included in these great corpuses is problematic, with the files having code names, not something comprehendable to humans. And only three of the sets actually come with any sort of inventory.

So the next step towards making these texts accessible is simply to list the contents, in .csv files that I have uploaded to Github. There is a list for each of the six sets, and an all-encompassing one amalgamating them. Three simply rejig the contents lists that came with the Plain Text, Evans and Navigations sets (the last particularly sparse). The rest I generated through an XSL stylesheet, pulling out information from the header files.

The column headers on these files are consistent:

UID: Unused as yet, will be a unique ID for each individual text file.

COLLECTION: Collection in which the file is to be found.

TCP, EEBO, ECCO: Project codes for the Text Creation Partnership, Early English Books Online and Eighteenth Century Collections Online.

VID: Reference to original page images of publication.

Book ID: Unclear.

STC: Various short title catalogue references.

ESTC: English Short Title Catalogue number.

Status: This column should be ‘Free’ throughout, as all the texts have been made freely available.

Author: Author, sometimes with years of life; missing for Navigations.

Date: Date of publication, generally year; data irregular and full of stray brackets.

Title: Title, sometimes full to overflowing, sometimes very clipped (as for the Navigations series.)

Terms: Library category headings.

Publisher: Publisher, often including the place of publication.

Pages: no. of pages; data available for only some texts.

Word Count: as it says; data available for just a few texts.

This data is very far from perfect. In fact it’s pretty poor, not to mention ugly. There are a great many missing fields, irregular and inconsistent data, typos, corrupted characters, inconsistent spelling, and no doubt further horrors I have yet to discover. But for now, it is good enough, and time allowing I will work to improve it. Errors can be pointed out and corrections made through Github’s facilities, either by forking and editing the documents, or by raising an issue.

Meanwhile, enjoy this gratuitous  word cloud of the titles of the texts.

Wordle of TCP titles

Wordle of TCP titles

This entry was posted in digital history, digital humanities, historical texts, history, texts and tagged , , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.