Glossary of terms

schema documentation linking and pointing validation transformation

Content model

A content model is a tiny piece of a DTD, which defines the constraints for a particular element. The content model for a given element is the set of rules that govern what it may contain: what other elements, in what order, and whether or not they are required. The content model for some elements is quite strict: for instance, the outermost element in a TEI document, TEI.2 must contain a single teiHeader element, followed by a single text element, in that order. For most other elements there is more flexibility, reflecting the heterogeneity of textual information. The content model for a p element (to encode prose paragraphs) allows a wide range of phrase-level elements as well as text characters, in any order. The content model for the div element allows: first, any members of the group of elements that typically begin a division (things like headings, salutations, epigraphs, datelines, arguments); next, any members of the group of elements that typically appear in the middle of a division (paragraphs, lists, quotations, and other chunk-like elements, or further div elements); and finally, any members of the group of elements that typically end a division (things like salutations, closers, trailers).

DTD, Document Type Definition

The document type definition, or DTD, for an XML file is basically the set of rules that govern the encoding language that is being used to encode the file. These rules determine the vocabulary of the language—the elements or tags that are permitted—and the syntax of the language: the way elements may be sequenced or nested inside one another.

Elements

The elements in an XML document are the textual features being encoded, together with the tags that mark their boundaries. A p element consists of a p start-tag, a /p end-tag, and the paragraph’s content in between.

ID

An ID is a unique identifier for an element in your document. It is encoded as an attribute on the element being identified. You might think of it as something like a social security number: it is used to identify an element so that it can be pointed or linked to by other elements. IDs are encoded in TEI using the id attribute. For example:

 <note id="n01">The echo of Shakespeare here is surely deliberate.</note>  

An ID must start with a letter, and thereafter may contain any sequence of alphanumeric characters, plus the underscore, hyphen, colon, and period. Note that while IDs and IDREFs are an important feature of P4, they do not exist in P5 and their functionality is provided in a different way.

IDREF

An IDREF is a form of pointer, which points to an ID elsewhere in the document. IDREF is not itself the name of any particular attribute. It denotes a kind of attribute which behaves in a certain way: namely, to point (REFer) to an ID. There are quite a number of TEI attributes which function as IDREFs, including corresp, target, sameAs, lang, who, and several others. They all share the property that they must point to an ID value somewhere else in the document, but their meaning (and the meaning of the relationship between elements which that pointing sets up) varies considerably.

Note that while IDs and IDREFs are an important feature of P4, they do not exist in P5 and their functionality is provided in a different way.

P4, P5

P4 and P5 are the informal names given to the two most recent versions of the TEI Guidelines. The prefix P means proposal and was first used with the first release of the Guidelines (P1) in 1990, as an initial proposal to the organizations that sponsored the TEI’s development. Subsequent proposals were numbered P2 (1993), P3 (1994, the first widely released version), and P4 (2002, the first XML version). Work on P5 was begun in 2002 and a full release appeared in November 2007; P5 is now the current stable version of the TEI Guidelines. More information about the differences between the versions is available at the TEI web site.

Parser

An XML parser is a piece of software that reads through your document and examines its structure, as represented by the encoding. As it reads, it builds something like an intellectual model of your document, which expresses the nested structure of the elements in your document. It can detect and report problems with the structure: for instance, an element that is missing its end-tag, or an element that overlaps another element. A parser is useful in itself, and most XML editing software has a parser built into it. But it is even more useful when used in tandem with an XML validator.

Schema

A schema is essentially a set of rules that define an encoding language. These rules identify both the vocabulary of the language—the lexicon of elements and attributes that may be used—and the grammar of the language—the rules that determine how the elements may be nested and where they are permitted to appear. A Schema is written in any one of a number of formal languages collectively called schema languages.

The oldest of these schema languages is the DTD language, and you may often hear schemas written in this language being referred to as DTDs (or document type definitions). Two other very popular and widely available schema languages are Relax NG and the W3C XML Schema Language.

The TEI schema for P4 is written in the DTD language. P5 is written in a schema language developed by the TEI, from which schemas in various languages, including Relax NG, DTDs, and W3C XML Schema, can be generated. Since the native language of P5 is closest to Relax NG, it is a good idea to use Relax NG for validating TEI P5 documents. There are document constraints that TEI requires that are checked for by the Relax NG schemas, but not by DTDs.

However, there are other constraints TEI requires that cannot be checked for using any of these languages. For some of these constraints TEI P5 uses another schema language called Schematron, for spot-checking, as it were. E.g., the TEI says that the hand attribute of the add element should point to a hand element (probably, but not necessarily, in the header of the current document). This constraint cannot be expressed in a DTD or in Relax NG. Thus it is checked via a Schematron rule.

Tags: start-tag, end-tag

The tags in an XML document are the individual pieces of markup that mark the beginning and end of an element. Tags are always bounded with angle brackets: < and >. A start-tag contains the name of the element (for instance, p for a paragraph or list for a list in TEI) and may also contain one or more attributes which describe the element. Some sample start-tags: quote, castList, titlePage. The end-tag has a slash to distinguish it (for instance, /quote, /castList, /titlePage), and contains no attributes. Tag names are always case-sensitive, and since capital letters are sometimes used within tag names to improve readability, it’s important to pay attention to case.

Unicode

Unicode is a single, unified system for representing nearly all written characters for nearly all human languages. It assigns to each character a unique number or code point which identifies the essential definition of the character (independent of font or platform). Characters can be transcribed into your document by making reference to the appropriate code point. For instance, the Unicode code point for an e with an acute accent (é), expressed in hexadecimal notation, is 00E9. To include this character in an XML document, you can express it using an entity reference, e.g. &#x00E9;. Charts of all the Unicode characters are available at the Unicode web site.

Validation

An XML file is said to be valid when it obeys the rules of its schema: when it uses only elements and attributes that are defined by the schema, and uses them in ways that the schema permits. Validation is the process by which the file is tested against the schema to determine whether it is in fact valid. A valid file may still be incorrectly encoded (for instance, it may misidentify a poem as a piece of prose), so validation does not really check the correctness of the encoding as a representation of the text. However, it does check to make sure that no nonsensical errors have been made (such as putting a chapter in the middle of a heading).

Validator

A validator is similar to a parser, in the sense that both are software that interests itself in the structure of your document (as represented through your encoding). A validator has a more specific set of constraints to observe, however, because instead of simply looking at whether elements nest properly, it reads and understands the DTD. This enables it to test whether a given element is included in the vocabulary of the encoding language you are using, or whether a given element is permitted in a specific place. For instance, the validator can tell you that a p element is not permitted inside a head element, or that within a div, a head may come before a p but not the other way around.

XSLT, eXtensible Style Language Transformations

XSLT is one of the central tools for manipulating XML data. In essence, it is a language for transforming XML files into other formats: into other kinds of XML, or into non-XML formats such as RTF. It is very commonly used to transform TEI files into HTML for web publication, and also for converting the encoding of TEI files into other kinds of TEI encoding: for instance, to change a set of documents encoded in P4 into P5.