Encoding Guide for Early Printed Books

What is Text Encoding?

There are any number of places where you can read about text encoding: the TEI Guidelines, the documentation of good text encoding projects, or articles describing specific practices and approaches. However, most of these sources address the subject at a more practical level, and aim their discussion at people who are already involved in text encoding of some sort. This Guide attempts to take up the subject at a more fundamental level and starts by considering what text encoding is and how it functions as an intellectual activity, from the standpoint of humanities scholars. We identify here several premises that inform current text encoding practices, and consider the critical freight they carry when applied to the kinds of materials scholars work with.

What is data? What is markup?

At its heart, text encoding is a process of transforming a source into data, using markup. This process is analytical, strategic, and interpretive: the resulting data strongly reflects the disciplinary assumptions and motivating goals of the encoder (or of the encoding system being used), and may encapsulate an entire theory of the text. The textual representation captured in the markup may be very complex and full of nuance, or it may be very simple and lacking in descriptive power, but even the dullest and most routine encoding nonetheless represents a perspective on the text: this may be a structural perspective that does not seem very interesting or it may simply say I am not very interested in texts at all.

For purposes of humanistic research, it is useful to think of data as content in which the information needed to accomplish some systematic process or analysis has been represented explicitly by some means, and thereby rendered accessible to the computer. Clearly the domain of data thus defined represents a broad continuum from the barely processable on one end to the highly processable on the other, depending on how extensive and how successful an analysis has gone into the representation. The process could be something as simple and basic as transcription: making explicit the identification of letters and spaces (and in some texts, this might constitute a very significant intervention). It could be something very analytical: e.g. identifying the part of speech of every word (like the encoded materials in the WordHoard project), or creating a database which constitutes a complete lexicon and index for the document (like Roberto Busa’s Index Thomisticus). It is often something in between: identifying the major chunks of text (paragraphs, subdivisions, headings, etc.) and the content features (names, dates, bibliographic references, etc.) according to some scheme such as TEI.

Markup is the means by which you make these identifications explicit. In XML (and in other markup systems as well), this is done by means of tags embedded in the text, which are kept distinct from the content by special characters (called delimiters) such as the < and > sign.

Content is a deceptively simple term. By content we often mean the information from the source that we care about: the parts of the source that we are going to capture in a transcription. In the encoded document itself, we may consider content, colloquially, as being the part of the document that is not markup. But when we think about markup and what it represents, it is worth probing more deeply. In any textual artifact, whether physical or digital, there exist signifying systems that mark the text’s structure and offer it for our comprehension. These include word spacing, paragraphing, indentation, margins, font differences, capitalization, and all of the other conventions that have evolved in print and manuscript culture. All of these are markup, in an important sense: just like XML tags, they define the boundaries of textual features and make it possible to use and understand them. When we create an XML representation of a physical document, we are potentially doing two things. We may be attempting to create a new document that represents the same textual structures as we identify in the original, and we may also be attempting to replicate the original evidence through which those textual structures were brought to our attention. In both cases, our preservation of information will necessarily be selective, and will be based on our own understanding of what is significant for us about the source document. For a project concerned with linguistic information, presentational features like list numbering and the square brackets around stage directions may be inessential, serving only to tell us of the presence of lists and stage directions. In such a case, identifying the stage direction through markup (e.g. as stage) preserves the essential fact, signalling it in a way that is part of the native representational system of the digital text, just as it was signalled previously with square brackets using the native presentational system of the printed text. Conversely, for a project concerned with printing practices, or with the varying rhetoric of the visual document, such presentational details might be essential, and the encoding system used would need to take account of their details.

XML and databases

It’s worth asking what the difference is between markup of this kind and something like a database. A database, after all, stores records in which the individual fields contain separate pieces of data, just as the elements in markup contain separate pieces of content. In fact, these two ways of representing information are converging: they are now very close. However, there are several differences, which amount in many cases to differences of degree rather than of kind. First, a database emphasizes the repeatability and predictability of structural patterns, the structural identity between records. It is good for situations in which you want to ensure that all records contain the same kinds of information. However, for cases where the information you’re representing is variable (and in particular cases where the variability is meaningful) a database is less well adapted. In addition, within the fields of the database there is not much opportunity to do further analysis. You could in theory capture a set of poems in a database: for instance, you could have one field for the title and another for the poem, or a repeatable field to capture all of the individual stanzas. You could also have fields representing the metrical or rhyme structure of the poems. However, if you wanted to represent things like caesuras, quoted material, allusions to other texts, names of people, and the like, the database format would start to feel the strain.

XML encoding can also be used to represent very regular structures like the records and fields of a database, and it is often used in this way (for instance, in business or industrial applications). But its real strength, and what distinguishes it from a classic database model, is in representing information which is predictable only in complex ways, as is typical in a written document. These are cases where you know some things about a document’s structure (headings always come first, if they appear at all; paragraphs usually repeat; lists contain two or more items; poems may contain quotations and allusions) but you don’t know the exact structure of the document (the number of paragraphs, whether this document contains any lists) in advance. In these cases, you need to be able to express a set of rules that describe the patterns that are common to all of the documents you are interested in, without knowing exactly how the rules are enacted in any given document instance. The very strict, database-like use of XML is simply a very highly constrained case of this kind of rule-building, in which all rules are expressed as requirements rather than options: for instance, every essay must have a heading, a single introductory paragraph, three body paragraphs, and a conclusion. For scholarly research purposes, the most useful rule sets are those which balance nuance and descriptive power against constraint, so that they accurately describe a wide range of documents in a great deal of detail.

In the past, one of the chief differences that might determine whether to use a database or XML encoding (in cases where the two are fairly close) was the question of what you wanted to do with the data. Database software has traditionally been much faster and more capable of handling very large quantities of data, so for searching very large collections a database would provide much better functionality. XML software has tended to be much better at handling functions that involve being aware of the document structure: for instance, queries that involve complex containment (such as find me all of the cases where a quotation from Shakespeare is found in either the epigraph or the title of a poem or list all of the stage directions involving a non-speaking character in the last act of any of this set of plays). However, XML databases are now emerging that combine these two capabilities, allowing for fast searching even on fairly nuanced data.

Descriptive markup

The kind of markup we have been describing here is often referred to as descriptive markup, and it is worth examining what this term denotes and what assumptions underlie its usage. The term descriptive is used to distinguish this kind of markup from procedural markup. Where procedural markup prescribes procedures and behavior (for instance, giving formatting commands to a text processor), descriptive markup simply describes a state of being: for instance, the existence of a heading or a quotation in the text. The encoded file, in other words, makes no assumptions about how this information will be used, whether for layout (enclosing the quotation within quotation marks or guillemets) or for searching (find the word serene only within quoted material) or some other purpose (extract all of the quotations and list them by length).

Most of the text markup systems now in use for scholarly work (for instance, the Text Encoding Initiative Guidelines, the Encoded Archival Description, the Multi-Element Coding System) use descriptive markup, and it has been shown to be a powerful and very useful tool for creating high-quality digital representations of text documents for scholarly analysis. However, it rests on a number of beliefs and assumptions which are both interesting in themselves and also consequential in determining how this kind of markup can be used, and the kinds of purposes to which it lends itself. In recent years there has been significant debate about the role of descriptive markup in digital scholarly work, which has yielded important methodological insights into text markup, textual scholarship, and how digital texts function in the academic landscape.

One of the most important tenets of descriptive markup is the separation of structure and content from presentation. This separation began as one of the primary tenets of text encoding when the Standard Generalized Markup Language (SGML) was first emerging in the late 1980s, and is an important motive in the historical development of text encoding, as described in the work of Renear et al. and others. By identifying the structure of a document apart from its specific presentational detail, descriptive markup systems are able to represent the document at a level of abstraction that—for many purposes—is more powerful and flexible than a procedural approach. For the kinds of systems being created at the time, this power and flexibility were extremely important and the benefits of a descriptive approach (about which more below) were particularly vivid. Above all, the ability to describe structure and control the presentation of a document through a stylesheet, rather than knowing nothing about a document apart from its appearance, constituted a significant benefit for anyone needing to create different forms of output from the same data source. The advantages of descriptive markup, however, turned out not to be limited to the industrial communities from which it originally emerged. From an academic standpoint as well, descriptive markup offers the opportunity to identify information about a textual source that is not linked to presentation at all. It can be used to differentiate textual features that are presentationally indistinguishable from one another (such as direct speech, material quoted from a source outside the text, and ironized usages enclosed in scare quotes), or to identify textual features that are not presentationally distinct at all, such as names, dates, and bibliographic citations, to say nothing of more abstruse information such as prosody, rhyme, or literary allusions.

There are several underlying assumptions here which deserve closer scrutiny. Descriptive markup is founded on the idea that presentation derives from function, and furthermore that the three following statements are true:

  • that the relationship between structure and presentation is consistent (even if perhaps complex)
  • that presentation is not decorative but functional: that is, that it exists to express function, not for any other significant purpose
  • that presentation is variable while structure is constant: that is, that structure expresses something fundamental about the document while presentation expresses something secondary or derivative.

These assumptions are largely true for the kinds of information which were first motivating the development of SGML: for instance, technical documentation, legal forms, documents generated and used by the military and the Internal Revenue Service, all of which needed to be encoded for long-term storage, maintenance, and output in multiple formats (including formats that couldn’t be foreseen). If these were the only kinds of documents one had to deal with, the idea that presentation might be important in itself—apart from whatever structural information it might convey— would likely never arise. Even for a significant range of literary and historical documents, these assumptions can be taken as practically true much of the time without significant intellectual discomfort. Furthermore, in cases where these assumptions operate, there are obvious practical benefits to separating presentation and structure, which are familiar to anyone who has worked with large quantities of documents. Being able to adjust the output of a document to suit a particular need (on the web, in print, in Braille, as an audio book) is profoundly useful. Even in HTML, an encoding system not known for its intellectual ambitions, we have seen over the past decade a steady shift away from presentation-oriented encoding (such as the font or blink tags) towards elements that focus more on structure, with presentation handled increasingly by stylesheets.

There are also conceptual benefits, once you move beyond these kinds of prosaic organizational information and start to consider humanities texts and the ways they might be studied in a digital environment. Making an explicit representation of significant textual structures and content features enables you to treat the text as an object of analysis: literary analysis, historical analysis, rhetorical analysis, linguistic analysis, etc. Even simple digital activities like searching are more effective when they can be focused through structural markup. In even a moderately sized text collection like the WWP, being able to limit the scope of a search to a single textual feature (find a word only within verse lines; find a word only in the footnotes) or to exclude certain features (find a word anywhere except the footnotes) can be a huge advantage. In a large collection, it may make the difference between useful and useless results.

The experimentation with scholarly text encoding that has taken place over the past fifteen or more years has demonstrated its value and has also prompted deeper critical reflection on the nature of digital representations. In particular, it has become clear that for many kinds of documents that are central to humanities research, the relationship between the informational structure of the document and its presentational detail is more complex than the assumptions described above would suggest. The assumption that presentation consistently reflects structure is clearly false in many older texts. The most vivid example is the cases where typesetting constraints cause the relineation of verse as prose and vice versa in the setting of drama. But one also sees huge numbers of cases where the formatting of headings, quotations, and other features is completely variable, either through accident, or carelessness, or practical constraints such as the need to fit more or less onto a given page. Similarly, the assumption that presentation is functional and is intended to represent the document’s structure bears reexamination. In many types of texts, presentation may well be decorative rather than (or in addition to being) functional: it may exist to comment on, ironize, adorn, or distract from the content. Further, there has been an important line of commentary within both editorial theory and text encoding theory, arguing that the distinction between an essential or fundamental content and a variable or inessential presentation is suspect: that in fact the presentation and the physical substance of the document are constitutive of meaning and inseparable from it.

Perhaps most importantly, in many kinds of documents (older texts, manuscript materials, and books that deliberately foreground and play with their own design), the material and visual presence of the document itself may itself be among the most significant features from the standpoint of scholarly research. Even if one were to regard presentation as secondary, for humanities scholars it turns out to be a very important secondary. Indeed, what may be most useful is not a representation that foregrounds presentation at the cost of structure, but rather one that can represent both in ways that allow them to be used effectively in research. Finding ways of doing this is now a significant research topic within the scholarly text encoding world.