SGML

Standard Generalized Markup Language (SGML) is strictly speaking not an encoding system but a system for defining encoding systems. It is in effect a lingua franca which has been defined by the international standards community as a way to describe one's encoding system in a manner which makes it universally intelligible and usable. To be SGML-conformant, an encoding system must fulfill the following two requirements:

These two requirements allow considerable advantages, both practical and conceptual, and also create concomitant challenges of both sorts which are beyond the scope of this discussion but which are addressed in a growing body of literature; see for instance "Refining Our Notion of What Text Really Is" (Renear, Durand, and Mylonas, 1996). For our present purposes, however, it is important to note the fundamental advantage of SGML, which is that it is both non-proprietary and platform-independent: it separates the encoding of the data from any particular computer, software, or company. This independence frees the data from vulnerability to both the whims of the market and inevitable technological changes, and allows the undertaking of large-scale encoding projects without the fear that a change in platform will require the re-encoding of all the data. Although at present users of SGML may find that "software independence" means that there is little software available specifically for use with SGML, the widening acceptance of the standard and its various applications (such as the Text Encoding Initiative) is generating a steady increase in software for editing, converting, and using SGML- encoded texts. At the time of writing (September, 1996), most major word-processors have some SGML capabilities.

To specify what is meant here by "encoding": in an SGML encoding system, encoding is a process of enclosing each textual element within two tags which mark its beginning and ending. A textual element may be a single letter or word, a phrase or grammatical unit, a chunk of text such as a list, a paragraph, or a poetic stanza, a larger conceptual unit such as a poem, a letter, or a play, or an entire text. All tagged elements (as described above) must nest; they may not overlap. Each element may carry additional information describing its appearance, its function, the language in which it is written, or any other details the encoder wishes to record.

SGML has been implemented for many different purposes by a wide range of users and organizations. Its adoption by humanities researchers has been comparatively recent; it was first applied by industrial, military, and bureaucratic organizations. The tag sets and document structures which these organizations developed, however, were obviously little suited to humanities purposes. The Text Encoding Initiative (TEI) was established in order to develop an application of SGML which would accommodate the needs of humanities text encoding projects. The result was the TEI Guidelines for Text Encoding and Interchange , which defines a tag set and document structure for humanities texts as diverse as dictionaries, linguistic corpora, literary texts, and newspapers. The encoding provided by the TEI Guidelines can be applied flexibly, depending on the needs of the project in question; some projects may elect to use very few of the elements available, while others may wish to tag more intensively and record more detail and nuance of the source material. Denser tagging is of course more expensive, but also yields more function in the end: users can perform more effective, carefully targeted searches and get more precise results from more carefully tagged material; they gain more control over output; their analyses yield more meaningful results. Another more familiar example of an application of SGML for a particular purpose is Hypertext Markup Language (HTML), which was developed to provide a simple system for encoding text which could be reliably exchanged, interpreted and displayed. With simplicity as one of the chief goals in its development, power and nuance could not also be accommodated, with the result that HTML is unsuitable for most serious humanities encoding purposes, or for the creation of large-scale research tools such as electronic dictionaries, encyclopedias, or text corpora. For more information on SGML, the SGML Web Page created by Robin Cover is an exceptionally useful resource.

Return to Argument