Home » Uncategorized » Our thoughts on “Crowdsourcing and Variant Digital Editions —some troubles ahead” (2/2)

Our thoughts on “Crowdsourcing and Variant Digital Editions —some troubles ahead” (2/2)

Yesterday, I posted the first half of a reply to the JISC Digitisation Programme’s recent blog post, “Crowdsourcing and Variant Digital Editions: some troubles ahead.”

In that post, I discussed the TCP’s stance on our texts being made available through an increasing number of access points, or “variant digital editions”: we’re all for it! But, as the JISC post forewarns, the situation becomes more complicated when these multiple platforms allow texts to be edited, not just consumed. In this post, I’ll focus on how the TCP is thinking about this new challenge.

So far, there has really only been an outward flow of data from the TCP to scholars and libraries, which has led to the creation of all kinds of end products, each of which must be evaluated on its own merit. Although the TCP texts have been edited, cleaned up, enhanced, and manipulated by many users, those changes have not been incorporated back into the source data. Instead, we have focused on producing more and more new texts and making them available to our partners—an undertaking that will occupy most of our staff’s time and attention through 2014, at least.

But as our production winds down, we will turn our eyes to the question of curating this archive we’ve created. This challenge will prove especially interesting as restrictions are lifted from the EEBO-TCP texts, and they become freely available for anyone to use. The biggest change on the horizon is not the extent to which TCP texts can be edited, manipulated, or transformed. It is the fact that anyone, not just users at partner institutions, can do it, and the results can be shared with, used by, and further changed by anyone.

For us, the question is, “How far do we go in keeping track of these variants, and using them to improve the the corpus?” We’re already starting to get a taste of this challenge.

This spring, we released 2,231 ECCO-TCP texts to the public, free of any restrictions on distribution. Already, we have seen a lot of interest in these texts, mostly from text encoding experts working on converting our XML to other formats, such as compliant TEI P5 and EPUB. It didn’t take long for the question to arise: what is the relationship between these transformed versions and the TCP? If corrections are made to the text along the way, how can they be captured, put back into the data, and distributed to those accessing the texts in other ways?

The JISC post points specifically to crowdsourcing as an example of how data can be improved by users. But crowdsourcing is a method, not an end in and of itself. In order to consider how such a project might affect the TCP texts, we must be more specific about what information the crowdsourcing is meant to produce.

For instance, projects like UCL’s Transcribe Bentham or the New York Public Library’s What’s on the Menu, while excellent examples of successful crowdsourcing efforts, don’t quite match up with the aims of the TCP, because the TCP texts have already been transcribed. We’re very interested in the possibilities of  leveraging the knowledge of our users to improve the TCP corpus, but our starting point is different from projects like those above. In those cases, the platform comes first, and the data is both created and published in this environment. As Ben Brumfield explains,

When we’re talking about crowdsourced editions, we’re usually talking about user-generated content that is produced in collaboration with an editor or community manager. Without exception, this requires some significant technical infrastructure — a wiki platform for transcribing free-form text or an even more specialized tool for transcribing structured data like census records or menus. For most projects, the resulting edition is hosted on that same platform — the Bentham wiki which displays the transcriptions for scholars to read and analyze is the same tool that volunteers use to create the transcriptions.

In the case of the TCP, the bulk of the data already exists, and the technique of crowdsourcing could be applied to carry out many kinds of work.

We can imagine a project whose aim is to improve the quality of this data, for example, by inviting users to fill in “gaps” where our keyers were unable to confidently identify a letter or word. The TCP would certainly want to be involved in reviewing this kind of work, and incorporating it back into the archive.

On the other hand, a project to identify literary themes in a certain subset of texts could also be crowdsourced, but we would probably consider this a distinct scholarly project that would stand on its own—an enhancement, rather than a correction.

As the creators of this massive archive, we are committed to curating it once it is complete—that is, to maintaining, correcting, and improving the data, and making sure those improvements are consistently available to all. Formalizing a process by which this kind of work can be done iteratively, driven by users working directly with the data, and captured and incorporated into the archive in a consistent way, will be a new challenge for us. Our hope is that by starting with the manageable ECCO-TCP corpus, we’ll be able to develop a strategy and put it into place by the time access restrictions are lifted on the first 25,000 EEBO-TCP texts in 2015.

~Rebecca Welzenbach

Leave a Reply

Your email address will not be published. Required fields are marked *