Home » DTD Working Group Notes

DTD Working Group Notes

A task force convened in Washington on 2 March 2000 to develop encoding instructions for text conversion agencies working on the EEBO project. Much of the group has worked together in the past to develop the TEI Text Encoding in Libraries: Draft Guidelines for Best Encoding Practices.

Charge

  • Specify the most appropriate levels of encoding for the EEBO text corpus. Identify costs and implications of employing greater or lesser levels of encoding.
  • Make recommendations to the selection task force (meeting later this month) regarding the implications or costs associated with including titles or classes of titles with certain physical attributes (e.g., language, typeface).
  • Create a broad outline of a DTD and element naming conventions.
  • Explore the relationship between recommended EEBO encoding specifications based on TEI academic conventions and current commercial practices for marking up humanities text.
  • Consider the opportunities for integrating metadata Proquest Information and Learning indexes that describe features of EEBO works (e.g., illustrations) into text encoding instructions.
  • Members
  • Members of the task force are:
  • Perry Willett (Indiana, chair)
  • Natalia Smith (North Carolina – Chapel Hill)
  • Nancy Kushigian (UCDavis), Chris Powell (Michigan)
  • David Seaman (Virginia), Paul Caton (Brown)
  • Lou Burnard (Oxford), LeeEllen Friedland (LC)
  • Chris Ruotolo (Virginia)
  • David Case (Chadwyck-Healey).
  • If number of texts, length of project, and amount of money available are fixed, the level of encoding is constrained. EEBO should be encoded with a TEI-based XML dtd.

Working Group Recommendations (March 2000):

  • Level 3 of The Draft Guidelines for Best Encoding Practices seems to be most appropriate, with some additions.
  • The size and scope of the collection forces us to think at Levels 1 and 2; since this is not a strictly machine created corpus, we’ve added elements of Levels 3 and 4.
  • The number of phrase-level elements is limited, but those included can be identified by the vendor and will facilitate later enhancement.
  • The next level of encoding would require much greater costs, experienced encoders, and time.
  • TEI Lite may possibly be appropriate for the finalized version of an EEBO text, but some dtd removing the header and enforcing encoding practices may be necessary for the vendor
  • It is better to do less, than to do wrong or mislead.
  • All encoding decisions should allow for enhancement and avoid tag abuse.
  • In using EEBO, it should be easy to move from transcription to page images. Elaborate encoding schemes to describe rendition are not necessary.
  • Cleaning up mistakes is very difficult. In general, vendors get better at encoding as projects progress. Some kind of pre-processing at beginning of the project would save a great deal of post-processing. It may be necessary for the first 1 to 2 years of the project to have a position within the EEBO Text Creation Partnership for reviewing texts before sending out to vendor, in order to identify structural divisions, difficulties (epigraphs, arguments, subheads). The person in this position will want to work with the TEI Consortium in identifying areas for altering the DTD (e.g., title pages, see below). The first task of this position would be to create a guidebook to encoding for vendors, with examples.
  • The working group assumes that all types of documents could appear among those that are selected for transcription, which will make specifying exacting encoding standards difficult. However, the committee had some thoughts about selection:
    • We would recommend that texts with significant amounts of non-Western scripts be excluded.
    • We recognize that legibility of the original and the microfilm image has implications on the value of inclusion—illegible images make these texts all the more desirable for transcription. A plan should be developed for identifying and locating physical copies of these texts.
    • Selectors should plan to “over-select,” because there will be a percentage of texts that are too illegible, or are written in non-Roman scripts. Pre-production screening will be important in this process.

General Transcription and Encoding Practice

One of the primary goals for the project should be accurate transcriptions. By this we mean

  • Letter-for-letter, not modernized, not expanding abbreviations.
    • Superscripted letters, macrons, accented characters will be encoded as entity references.
    • Double v will be recorded as “vv”; long s as “s”
    • Left and right double quotation marks will be recorded as “; left and right single quotation marks as ‘.
  • Non-western scripts. The vendor will transcribe non-Western scripts as much as possible, and those which cannot be transcribed will be encoded as <FOREIGN><GAP/></FOREIGN>.
  • Don’t transcribe: catchwords, signatures, book plates, running heads.
  • Don’t use: entities for ligatures (ct, st, ae, oe), long s.
  • Don’t mark line breaks except in verse and on title pages.
  • End-of-line hyphens will be recorded as an entity reference &eolhy;.
  • End-of-page hyphens will be encoded with element, using the REG attribute to record the entire word without the hyphen, even if the word normally contains a hyphen. Examples:
    • <ORIG REG=”condition”>condi-<PB>tion
    • <ORIG REG=”footsoldier”>foot-<PB>soldier.
  • Letters upside down should be turned right-side up (except upside down “n” used as “u”-retained as “u”)
  • Blank pages should be marked <PB>, but no further encoding to indicate that they are blank

Structural Divisions

  • Most works will consist of a single , with multiple numbered
    s for sections. Those works that contain multiple texts will use <GROUP> with multiple <TEXT> and numbered <DIV>. Typically, these will be collected works of an author with an overall introduction, and then reprints (or rebindings) of separate works with their own title pages. These new title pages will be strong evidence for a new <TEXT>.
  • <DIV>: Use numbered DIVs. Structural divisions within a text are difficult to identify, and consistency within a work is essential. It is common for vendors to add an extra layer of DIVs, but they seem able to handle nesting. Fewer <DIV>s are better than more. Use visual cues as evidence of <DIV>. Strong evidence includes:
  • Headings that appear in a table of contents
  • Blank page followed by a new heading
  • Heading followed by drop cap or ornamental letter
  • Numbering scheme associated with headings

    Weaker evidence:

    • Ornamental device
    • Marginal numbered headings
  • Use other types of chunking if the evidence for a new DIV isn’t clear: <P>, <LG>.</li>

Individual Elements

  • Personal letters: It is common to find personal letters quoted in a text. We would recommend the creation of an element for inserted letters with separate openers and closers, at least within vendor DTD—could be later expanded to <Q><TEXT><BODY><DIV>.
  • Figures:
    • Typists should indicate <FIGURE> with Bell & Howell ID
    • <P> within <FIGURE> contains any printed text associated with figure. It doesn’t have to make sense, but should include captions, border text.
    • A figure starts with a heading or beginning of a graphic, and ends with a caption or end of graphic. If unclear, it is better to record the caption outside of the <FIGURE>, than to include text not associated with the figure inside of <FIGURE>.
    • In the display of the transcription, we recommend to not display text within figure, but instead pointing to the page image. This text would be displayed in any intermediary search results as keywords-in-context.
  • Font shift: <HI> should be used to record all changes in font. When text is mostly italicized, non-italicized text would be encoded as . If the text is mostly Roman font, then italicized, gothic or black-letter text, even if only part of a word, would be encoded as <HI>. No attributes.
  • Poetry:
    • In a conventional book of poetry, each poem is encoded as a numbered <DIV> always using <LG>.
    • In cases where poetry is interspersed within prose paragraphs, first close paragraph, then open an <LG>, and encode verse lines with <L>, even if the actual paragraph or even sentence isn’t finished.
    • Numbered lines will use N attribute.
    • Braces that group verse lines (such as triplets) will be disregarded.
  • Drama:
    • Cast lists are encoded as a <LIST>. For complex cast lists, use nested lists and labels to indicate cast groupings. Character names, actor names, role description should be included within same <ITEM>.
    • Use , .
    • Stage directions that occur between speeches or occur within brackets [], and thus easily identifiable, should be encoded as <STAGE>.
  • Title pages: The content model for <TITLEPAGE> is very strict, and doesn’t allow for some of the free form title pages occurring in EEBO. We recommend that the content model be altered to include an (anonymous block).
  • Languages: Use LANG attribute only if it applies to an entire DIV or TEXT. For parallel texts, transcribe all text of one language within a DIV, and all of the other within another.
  • <TEIHEADER>: The text will be identified by the STC number. The metadata from the MARC record and STC database should be extracted for <SOURCEDESC>. Vendor-specific information can be added in an automated manner to <FILEDESC> and <EDITORIALDESC>. Use the recommendations by the TEIHEADER group from the TEI and Libraries workshop.
  • Openings and closings. For those instances that are obvious:
      • <TRAILER> for “Finis” “Amen” etc.
      • <CLOSER><SIGNED>
      • <CLOSER><DATELINE> <OPENER><SALUTE>
      • <OPENER><DATELINE>
  • <ARGUMENT>: Arguments usually summarized the prose or verse that follows. Use <ARGUMENT> only if “argument” is mentioned within the text block. If in question, use <HEAD>.
  • <LIST>: Lists would be defined as those easily identifiable, with repeated categories of data of regular length. Numbered paragraphs would not normally be considered as lists. Tables of contents are lists.
  • <TABLE>: Those texts with important spatial relationships should be encoded as tables; otherwise, the text should be recorded serially in lists or paragraphs.
  • <EPIGRAPH>: These present many difficulties for vendors. They are routinely encoded as notes, heads, arguments. Those epigraphs that are verse, with or without citations and that can be identified, should be encoded as <EPIGRAPH>. If not sure, they should be encoded as <HEAD TYPE=”sub”>
  • Additions, deletions, gaps, etc.
    • <ADD><GAP/></ADD> for handwritten insertions. The vendor should not attempt to transcribe the handwritten text. It will be evaluated later by project staff.
    • <DEL><GAP/></DEL> for deleted text, or text that has been crossed out. Similar to <ADD>: the vendor will not try to transcribe the deleted text, but it will be evaluated later.
    • <UNCLEAR> for illegible printed text </li>
    • <GAP/> where page is torn or not printed. </li>
  • <NOTE>: Anything printed outside of a text flow. Asterisks, numbers, daggers, double bar etc as N values, and not included in text.
    • Use visual cues to group notes. If it looks like one note, it’s one note. Notes can appear as indented into text block.
    • Text is transcribed until the beginning of the note, then the entire note is transcribed, and then text is resumed. If a note extends over a page, the entire note should be included as a single note. Notes without obvious references are transcribed where they occur.
    • PLACE attribute required: inline, margin, foot, end
    • <NOTE> will always include <P>
  • <PB> Page numbers will be recorded, as they appear on the page and regardless of where they might occur on the page, as N attribute values in the PB element at the top of the page. ID values will be used to link to Bell & Howell page images.