Although humans can read scanned pages (just as we do paper pages), a computer understands these pages as no more than image files with black and white areas. In order to search, index, analyze, or rearrange these books, we need to provide the computer with electronic text as well as page images.
Like most things, this text can either be generated by hand or by machine. The automated creation of electronic text is generally known as optical character recognition (OCR). In this process, software attempts to “read” the page images and map each chunk of black pixels it sees to the correct character or symbol. OCR technology is used in large scale digitization projects like Google Books and Hathi Trust, as well as in common software such as Adobe Acrobat. The full-text search functions that are offered in the ECCO and Evans databases rely on OCR technology.
OCR software is improving all the time. It works very well on modern books, but the older the book, the more the software struggles. For books printed before 1700, and for images that are blurry, spotty, or have other quality issues, it fails almost entirely, which is why the stand-alone EEBO database doesn’t include any searchable electronic text at all.
That’s where the TCP comes in: we work with vendors manually key in the letters they see on the page. Each page is processed by more than one person, and the results are compared against one another, to generate electronic text that is 99.995% accurate. Keying is the greatest expense in the TCP’s budget. However, done by a person trained to identify the features of early modern texts, it is actually more cost effective than sorting through and correcting poor-quality OCR.
Why OCR Won’t Work:
Here is an example of a scanned page from EEBO, followed by the same page as read by OCR software:
Even when OCR works well, it is usually hidden from the view of the user and used only for indexing/searching, because it contains enough errors to be distracting to the reader. An added benefit of the TCP’s manually keyed text is that it is clean enough to be displayed. Therefore, in addition to supporting full text searching, the TCP offers modern-type transcriptions of these texts.