Home » Guest Post » Using EEBO/TCP Texts for Lexicons of Early Modern English

This guest post was written by Ian Lancashire and Ruth Peidi Zhao, of the University of Toronto. We’re delighted that a number of TCP texts have been included in the LEME project, and welcome feedback and corrections from the editors, as well as from anyone working with our text files. If you would like to contribute a post describing how you use the EEBO-TCP texts in your research, please contact us at tcp-info[AT]umich.edu. 

Lexicons of Early Modern English (http://leme.library.utoronto.ca ; LEME, pronounced like “lemma”, that is, rhyming with “hem” and bearing a final unstressed e) currently offers tools to search, display, and offer bibliographical information about 617,000 word-entries in 181 lexical works from about 1475 to 1702. The University of Toronto Press publishes LEME, and the University of Toronto Libraries – Sian Meikle, LEME’s designer (currently Interim Director of Information Technology Services, Digital Library and Web Services) – hosts it (2006-). Serious researchers license the database for searching. The bibliography and one-off searches are free.

LEME transcriptions started in the late 1980s with John Palsgrave’s Lesclarcissement (1530) and Thomas Thomas’s Latin-English lexicon (1587). In 1996, 16 lexical texts were released freely online with a student-written search engine. A generous grant from the Canada Innovation Foundation via Geoffrey Rockwell’s TAPoR turned LEME into an SQL database and expanded it to 150 texts. Dr. Marc Plamondon (Nipissing University) was the programmer.

The unit of the EEBO-TCP collection is a book, but that of LEME is a single word-entry. In displaying a dictionary page, LEME shows only the headwords of the word-entries on that page. When clicked, the entire encoded entry opens. Normally, researchers run searches on the entire database. They enter words, phrases, or collocations for searching, and LEME delivers a chronological list of matching word-entries, each abbreviated but expandable. We do not publish digitized books and so do not compete with scholarly editions or even EEBO-TCP itself.

We enter, proofread, and encode lexical texts to be added to the database each year. In this respect, LEME resembles a journal publication. Since 2006, LEME has increased in size by about 20 percent. Lancashire is editor, assisted by dedicated students. Recently, Zhao transcribed Claudius Hollyband’s French-English dictionary (1593) and Lancashire is now editing John Thorie’s Theatre of the Earth (place-names; 1601, transcribed by Janet Damianopoulos), John Rider’s Bibliotheca Scholastica (1589), and Ortus Vocabulorum (Latin-English; 1500, now being transcribed by Zhao), as well as Guy Miège’s French-English and English-French dictionary (1677) and Thomas Blount’s Nomo-Lexikon (law; 1670). The first four are newly transcribed, and the second two are adapted from EEBO-TCP transcriptions.

LEME delivers explanations and translations of English words written by men alive in the Early Modern English period. LEME is different from the OED: the two overlap by less than five percent, and LEME does not devise its own word-entries. The database shows the size of English vocabulary, information on when words entered the language, typical transcriptions of Early Modern English words, word-senses that were dominant at the time, hard and easy words, synonyms or translations in non-English languages, and evidence of which terms then in foreign languages had no English equivalent (LEME records words in 37 different languages and serves researchers in Renaissance Latin, Italian, French, and Spanish). LEME has an interest in any historical dictionary that includes Early Modern English.

When we adapt an EEBO-TCP transcription like Miège’s (using its source text), we replace its tags by our own private tag-set. We encode word-entries rather than display features. The LEME tag-set identifies the word-entry and, within that, its form (or headword), its explanation, their subforms and sub-explanations, and the language of all strings. Approximately eighty percent of the headwords in Miège’s contain subentries, and commonly there are 5-10 subentries under each headword. These illustrate idioms, common expressions, and related word-forms. For example, the French headword “Age” has close to thirty subforms, including

<subform lang="fr">Bas&acirc;ge,</subform> <subxpln lang="en">infancy, youth, tender years.</subxpln>

<subform lang="fr">Des mon bas &acirc;ge,</subform> <subxpln lang="en">from my In&shy;fancy.</subxpln>

Because all word-forms in Miège tend to be French, and all words in explanations tend to be English, we can use attributes within the form and explanation tags to label the language of those strings. Often, however, forms and explanations include strings in unexpected languages. To tag each word by its language, and by its role within a word-entry (form, explanation) is laborious and exacting.

Proofreading and correcting an EEBO-TCP transcription, and supplying its illegible characters – our first tasks — are much easier. EEBO-TCP transcriptions generally are well done. There are typos in its texts, but these occur in LEME too; and the conscientiousness of EEBO-TCP transcribers can be seen in how they signal illegible strings. In proofreading recent EEBO-TCP transcriptions, we have noted rare, unidentified astrological characters in James Moxon’s mathematical dictionary (1679); and, in Miège’s text, miskeying of circumflexes as grave accents, owing to worn or damaged typeface, as well as representation of digraphs ae and oe as two letters. Intentional non-transcription of Greek words might be mentioned here. With Unicode, identifying entity references for Greek letter-accent combinations is straightforward enough.

After proofreading and encoding, we process each text repeatedly to find tagging errors or inconsistencies. We use a programming text-editor such as UltraEdit or Notepad++ to enter and edit LEME texts. These offer search-and-replace functionality with regular expressions and macros. We also process each encoded text repeatedly with specially-written LEME software to flag bad characters, tagging errors, and inconsistencies. The LEME processing program, written in perl, automatically locates bad characters and tagging mistakes. It also lists all words by their language and offers a mock-up of the page-displays LEME gives.

In adapting EEBO-TCP texts we also place, in the lexeme attribute of our form or explanation tags, a modern-spelling, standardized form of a word-entry’s English headwords and translations (the infinitive for verbs, the nominative singular for nouns).  Our lexemes normally follow the OED headword. LEME also supplies information that can supplement the OED, such as unrecorded word-forms, antedatings, etc. Here are examples from two recent French-English dictionaries, the first by Claudius Hollyband in 1593 (not in EEBO-TCP), and the second by Guy Miège in 1677 (from the invaluable EEBO-TCP transcription).

<wordentry type="h">
<form lang="fr">Bougre,</form>
<xpln lang="en">he that committed such a fact and sodomite villanie: a bug&shy;gerer: burne them all.

<wordentry type="h">
<form lang="en">OROBE, a kind of pulse,</form>
<xpln lang="fr">oro&shy; be, ers, sorte de legume.</xpln>
<lemenote>Antedates earliest citation in the OED (1714).</lemenote>

Together, these two texts have over 110,000 word-entries.

Early dictionaries have a well-earned place among our primary sources of information about Early Modern languages in Europe. They make for oddly delightful reading. We expect that researchers will want to edit and analyse dictionaries separately as literary influences. Ben Jonson regarded John Florio, the Italian lexicographer, as a “loving father and worthy friend,” and would have been influenced by the two editions of his World of Words (1598, 1611). Edward Phillips, Milton’s nephew, devised a hard-word dictionary (1658 and after) that, for example, might illuminate his uncle’s Paradise Lost if the two were concorded together. For that reason we try to send EEBO-TCP a list of corrected typos and of identifications of illegible words in those source transcriptions we adopt from it.

May 29, 2013

