Home » Why SGML encoding?

Why SGML encoding?

The TCP texts are encoded in Standard Generalized Markup Language, or SGML (the pre-cursor to XML), using a schema based on the P3 version of the Text Encoding Initiative Guidelines. TEI P3 was the version of the guidelines widely in use at the time that the TCP was getting started.

The purpose of this encoding is to explicitly mark the parts and structure of the text, and the relationships between these parts, so that the document’s structure can be understood by a computer just as human readers make sense of the layout of a typset page. This encoding creates the potential for many possible uses, such as targeted searching (for example, searching for a term only when it appears in stage directions), flexible rendering of the texts (for example, placing displaying all notes in a margin, or at the end of the document), and more.

The current de facto standard for text encoding is TEI P5. Although the TCP’s schema does not conform to this version of the TEI guidelines, there is a great deal of interest in developing an automated way to convert the TCP corpora to TEI P5. At Oxford, this conversion has already been done for the ECCO-TCP texts. At the University of Nebraska-Lincoln, research is ongoing to develop and fine-tune a tool called Abbot that will make it possible to do this work on a very large scale.


In 2000, a working group of experts came together to develop a schema for the TCP’s encoding practice. The Document Type Definition (DTD) they developed outlines which tags may be used, and under which circumstances, throughout a document. A summary of the initial encoding decisions is available to read here.

Because EEBO contains so many different types of text, and because the corpus contains so many page images, the group determined that the DTD would reflect a low and fairly generic level of tagging. Although there are some exceptions, as a general rule , the TCP emphasizes capturing block-level elements (such as paragraphs, chapters, tables, and figures) but not phrase level elements (such as individual names, places, and dates mentioned in the text). The goal is that the TCP’s text will be a useful starting point for scholars who wish to enhance a text (or many texts) with additional, more granular markup.