Encoding Guide for Early Printed Books

publication

Publication Tools for TEI Documents

Simple Approaches: CSS

The simplest way to publish a TEI-encoded document (besides emailing it to a colleague) is to make it visible on the web directly. Because the TEI is expressed in XML, any XML-aware internet browser can read the TEI markup. If given a stylesheet that contains information about how to display this markup, the browser can format and display a TEI document just as it can HTML. At the time of writing, most major browsers are capable of displaying TEI/XML documents in this way.

There are some limitations to this approach. At the moment, there is no way of creating working hyperlinks; browsers can only process links in HTML documents. There is also no way of altering the order in which the document’s information is presented, although you can adjust the positioning of chunks of text and also suppress them altogether.

What you need to do:

  • Create a CSS stylesheet: Cascading Stylesheets (CSS) are a simple technology that associates each XML element with a set of style information (such as font size, indentation, color, and so forth). The CSS language is simple and easy to learn; tutorials are readily available on the web.
  • Associate the stylesheet with your TEI document: Once you’ve created your stylesheet, you need to tell your TEI document where to find it. This information is stored at the top of your file, as illustrated here:
    <?xml version="1.0"?>
    <?xml-stylesheet type="text/css" href="wwpguide.css"?>
    <TEI>
    ...
    </TEI>
          
    The href attribute contains a path or pointer to the CSS stylesheet, wherever it lives.
  • View the document with an XML-aware browser: Browsers with this capability include the current versions of Firefox, Safari, IE, Camino, and many others.

More Powerful Approaches: XSLT

XSLT (the Extensible Stylesheet Language for Transformations) makes it possible to transform XML data (such as TEI-encoded files) into any other XML format (such as XHTML) and also into other useful formats such as PDF, RTF, tab-delimited data, and many others. These transformations may result in a simple mapping of TEI elements onto elements in the target language: for instance, to enable a TEI-encoded text to be displayed as XHTML on the web, with TEI ref elements transformed into working links using XHTML a elements. However, XSLT can also be used to make more profound changes to the document structure, by selecting, omitting, reordering, or restructuring parts of the original XML document. For example, an index of first lines could be generated from a TEI collection of poems, using an XSLT stylesheet that selected only the first l element from each poem, and then sorted them into alphabetical order. Or a TEI-encoded magazine containing serial fiction could be transformed so as to extract and present each individual narrative in its entirety. In a TEI-encoded file representing a document containing hand-written revisions, XSLT could be used to generate two views of the document: the unrevised version and the final revised version. XSLT is an extraordinarily powerful tool and forms the basis of (or contributes to) most XML publication systems.

Using only XSLT, a small-scale project can develop a serviceable interface through which readers can browse and search a TEI-encoded collection. There are two main ways in which XSLT is used:

  • Static transformations: In a static transformation, the XSLT stylesheet transforms the source data (for instance, TEI) into an output format (for instance, XHTML). This output can then be placed on a server to be viewed on the web (or fed into another transformation for further processing). When the source data is updated—for instance, to correct an error or to add new material—the transformation needs to be run again in order to update the published output.
  • Dynamic transformations: In a dynamic transformation, both the source data and the XSLT stylesheet are located on a web server, as part of an XML publication. When a reader comes to the publication’s web site and asks to see a text (for instance by clicking on a link), the server runs the transformation and produces the output then and there and presents it to the reader. This process enables the server to provide the reader with different views of the content depending on their request: for instance, by clicking on links to “Sort by Author” or “Sort by Title” the reader can receive the same list of texts sorted by author (with the author’s name printed first) or by title (with the title printed first), based on the same source data. When the source data is updated, the changes made are reflected instantly in the interface, without requiring that data be transformed and moved from place to place.

The limitations of XSLT for publishing TEI are partly those of scale: because XSLT operates by processing XML directly, it is comparatively slow as a means of working with large quantities of data. For very large projects with thousands of long documents, XSLT on its own is probably not sufficiently fast or powerful to build a working publication. Also, although it can be used to search a text, it is not really designed for this purpose and when working with any substantial amount of data will be an unworkably slow search tool.

Where it is possible for a comparative novice to simply pick up CSS informally, by experimentation, XSLT is a more complex tool and requires both more time and (for most people) more instruction to learn. For most TEI projects of any scale, the development of the necessary XSLT stylesheets will be a substantial development task and will probably be the responsibility of a programmer or technical consultant. However, it is by no means beyond the capacity of humanists to learn, and there are increasingly workshops and tools available to assist those who want to work with XSLT on their own. For example, the NINES project offers an annual summer workshop which includes a strand on XSLT, and they are also developing tools to assist scholars in building and using simple XSLT.

XML Publication Systems

XML publication systems are more complex tools (or aggregations of tools) that improve on the publication methods discussed above by increasing the speed and efficiency of the processing being done (and hence of the resulting interface), and also by expanding the functions that can be provided. XML publication systems typically include some kind of search engine coupled with a system for indexing the XML data: essentially, putting the data into a form that (like an index) lends itself to speedy searching. These publication systems also usually incorporate a framework for managing the interface of the publication (through XSLT and CSS) so that the texts being published can be transformed into XHTML and displayed with the appropriate appearance. In addition, they may include modules to permit specialized functions like text analysis, visualization, linking to other projects’ data or to centralized information resources such as Google Maps.

Systems like these are typically intended for use by projects with substantial quantities of data and a need for highly functional interfaces that include complex searching and analysis. As a result they are designed to be installed and managed by programmers and those familiar with system administration, and do not usually lend themselves to use by individual humanities scholars or those who lack access to technical support. In addition, they may require significant configuration in order to function within the context of a specific project; if they operate out of the box at all, it may only be in a very plain and generic way that does not do justice to the details of the project’s own data and user needs. These configurations may include the development of project-specific stylesheets, configuration of indexing modules to deal with the specific elements that must be indexed for a specific data set, and basic setup of permissions and paths to allow the system to operate within the local server environment.

There exist a number of open-source XML publication systems as well as commercial products at various levels of expensiveness and functionality. Examples that the WWP has experimented with to some degree include:

  • Philologic (open-source; developed at the University of Chicago). Philologic is a fast and powerful publication tool that indexes TEI data (through a fairly lengthy ingestion process) and stores the results in a database for fast querying. The indexing can be customized so that specific features of interest can be included in the search interface: for instance, in a project focused on drama, one might choose to index all stage directions and speaker labels in addition to the ordinary publication metadata, to permit these to be searched. Philologic also offers text analysis features such as collocation and word frequency analysis. (A new extension to Philologic, PhiloMine, provides text mining features that extend Philo’s text analysis capabilities; however, PhiloMine is still under development and may not be ready for production use; for more information visit the PhiloMine site.) It will work with basic TEI data out of the box, but requires very significant customization to adapt it to more specialized functional requirements and data. Philologic offers a set of sample collections that demonstrate the kinds of features that are available without requiring customization.
  • eXtensible Text Framework (XTF) (open-source; developed by the California Digital Library). XTF is an indexing, display, and query tool that can be used to create TEI publications. It is built to handle large volumes of data. As with Philologic, it requires very significant configuration to produce a working publication of any complexity using XTF.
  • TEIPublisher (open-source; developed at the University of Maryland). TEIPublisher is a smaller-scale publishing system which is designed with individual users in mind. It provides a fairly easy graphical interface through which one can configure the basic details of the publication: what elements will be indexed, what metadata will be displayed or offered as search fields, what stylesheets will be used to control the appearance, etc. It can be used to develop a simple digital library or web site based on a collection of TEI-encoded texts, including basic searching and browsing. It uses the eXist XML database and Apache’s Lucene search engine (both open-source). TEIPublisher is in many ways an ideal tool for individuals and small projects, and its only drawback is that it is not currently under active development.

In addition to these systems in which the various components are already bundled together, one can also build an XML publishing system that combines various open-source components as needed to provide the functionality required. The available components and possible combinations are too many to detail here, and such a system would require the work of a programmer or someone with similar expertise.