WWP Book
Women Writers Project: blockimage Women Writers Online blockimage About blockimage Texts blockimage Encoding blockimage Site Index blockimage Contact

Encoding methods and delivery system

The RWO project represented for the WWP an opportunity not only to add an important group of Renaissance materials to our online collection, but also to test and refine our encoding system with a corpus of earlier texts in a wide range of genres. During the period covered by the RWO grant, the WWP not only made substantial improvements on our encoding system based on our research with RWO texts, but we also streamlined our encoding infrastructure and added tools which increase the speed and accuracy of the encoding process. We also developed a customized online delivery system which provides a search and browsing environment suited to teaching and scholarly research.

Encoding methods

The WWP uses the Text Encoding Initiative Guidelines for Electronic Text Encoding and Interchange (TEI), with TEI-conformant modifications as necessary to accommodate the idiosyncrasies of early modern texts. These modifications have been carefully documented and will be submitted to the TEI for possible inclusion in the next release of the TEI Guidelines. They fall into several categories:

  • cases where TEI prescribes too strict a limitation on where a given textual element may appear, or what it may contain. For instance, TEI does not provide for handwritten annotations on the title page of a document, but the WWP has often needed to transcribe handwriting on title pages of our early texts. Another example on which we spent considerable research time is notes, which in TEI have a very strictly prescribed structure, but which in our collection take various forms not envisioned by TEI.
  • cases where on the contrary TEI is more permissive than we wish to be about how a textual element is recorded, though these are less significant.
  • cases where the TEI does not provide for certain kinds of information which we feel are important to record. For instance, we wish to record the name and gender of each person who contributed to the production of the text, for retrieval purposes (author, translator, editor, printer, publisher, engraver, etc.), but TEI provides no convenient place to group this information. Similarly, TEI deliberately does not provide an explicit, effective means for recording the original rendition of the text, and we have had to develop such a system ourselves.

In addition to the general structural markup of the text itself, the WWP finds it important to record and mark up various kinds of information which are essential to scholars working with primary sources in digital form, and which help provide a familiar environment for scholarly research. Most important of these is the metadata which preserves detailed bibliographic information on the source text, including Wing and STC number, source library and shelfmark, facts of publication and authorship. This information is recorded in the header for each document and is heavily exploited in our search interface. In addition, we add further documentation about the condition of the source text, including any areas which are damaged or illegible.

Encoding support systems

The WWP's encoding staff use a Unix-based environment with an SGML-aware text editor (Emacs with psgml) for our text encoding work. This basic environment provides constraints which guarantee that the encoded texts conform to the TEI document type definition, and it also provides guidance for the encoder by offering a list of legal TEI elements at any given place in the text. Encoders begin their work with a blank template which already includes standard information and a framework for creating a full TEI header for the document. In addition, the WWP has written several tools which assist the encoder by streamlining the encoding process or by automatically tagging certain kinds of textual features. These tools include:

  • automatic tagging and regularization of early use of i/j/u/v/w
  • automatic tagging and regularization of Biblical citations (under development)
  • automatic tagging of page number sequences
  • automatic checking and flagging of errors in collation, encoding of personal names, encoding of rendition (all of which are errors not caught by the standard SGML parser/validator)

Delivery system

As creators of richly encoded SGML data, the WWP is one of a number of projects currently facing the same problem: the fact that SGML publication software is still scarce and designed for industrial production settings rather than academic projects in the humanities. Tools for publishing SGML content on the World Wide Web (such as INSO's--or, since late 1999, Enigma's--DynaWeb) are even scarcer and are also not designed with scholarly uses in mind. The advent of XML is widely predicted to be a possible solution to these problems, but at the time the WWP was planning our initial publication, we had the choice of customizing an existing application or of designing one ourselves from scratch. Although the latter option would theoretically have given us more flexibility and control over the resulting product, there were a number of potential concerns. The expense of software development was first among these, particularly because the actual cost of creating a functional system from scratch was difficult to estimate with precision. We also knew that although we could probably develop an SGML-to-HTML transformation system fairly easily for our specific texts, we would not be able to make it general enough to allow for easy expansion, nor could we easily support the rapid content-based indexing provided by commercial software. Finally, creating a new application ourselves would necessarily be an all or nothing approach--we risked being caught with no delivery system at all if we encountered any serious problems. We had already experimented with DynaWeb and although its default interface and functionality were ill-suited for our purposes, we thought we could build a customized interface with most of the functionality we sought. The advantages of this approach were that we would be able to start using the system in its uncustomized state almost immediately, and add improvements as we developed them. Furthermore, if the project turned out to be a long-term success, we could design a custom application ourselves later on, possibly taking advantage of the arrival of XML-aware software and support systems.

Accordingly we decided to build a custom interface and based our delivery system on DynaWeb. In DynaWeb the underlying infrastructure of indexing, searching, and processing the encoded data (which is performed by DynaText, an SGML search engine) is separated from the display of this data on the web. The latter works by a system of style sheets which dynamically translate SGML data into HTML for web display. From the user's point of view, the data is simply HTML which can be viewed with a standard web browser. However, searches and word- or structure-based functions are passed back to the DynaText engine and performed on a preprocessed form of the SGML data, allowing for the exploitation of specialized markup. Thus for instance the user can limit a word search to verse drama, even though HTML has no ability to represent or flag particular genres. The advantages of this general solution for us were considerable: the user would not need any specialized software or skills, the purchasing institution would not need to install anything locally, and the value of our SGML encoding would not be lost by down-translation to HTML (as it would be in a static, one-time translation system). Also unlike systems like SoftQuad's Panorama, which downloads an SGML text to the user's computer and allows specialized processing to occur locally, DynaWeb can search and selectively display information from the entire corpus. Panorama requires custom software to be installed locally and can only really handle one document at a time, both disadvantages which ruled it out for the kinds of uses we wanted to encourage.

On top of this basic system, we created a custom interface which provides several important features:

  • keyword-in-context (KWIC) display of search results, crucial for viewing large result sets. This display lists search hits with about 10 words of context surrounding each hit word, allowing the user to browse the hit list and quickly identify the hits of interest. This list is also sortable by author, date, and other categories, so that the user can get a quick profile of where the hits occur, or (if sorted by date) of the changing usage patterns over time for a given word. This feature is rarely available in standard industry text delivery systems, although academics have used them in highly individual or customized systems for a long time, and in print concordances for even longer.
  • advanced search interface. Our search interface offers the user the ability to do word and phrase searches (including Boolean operators and wildcards), proximity searches, and context-sensitive searches which exploit the text's markup to narrow a search to particular textual features specific to this collection. In addition, the user can search based on bibliographical information, such as Wing or STC number, source library or shelfmark, length or size of the book, and facts of publication, based on metadata encoded in the TEI header. To these categories we will be adding genre and subject keywords as well within the next year. In our next upgrade, we will be offering the ability to combine these different kinds of searches (for instance, to find the word "wit" within ten words of "love" within dramatic texts written between 1670 and 1680). We will also be offering the ability to save a search, either to requery it more specifically, or to reuse it in a later session.
  • navigational features. The challenge in delivering any large electronic collection is to ensure that the user never feels lost within the structure of the collection, or within any given text. Our customization offers intuitive navigation from text to text, and from section to section within a text; it also provides a clear sense of where the user is within the collection at all times. The system is able to take advantage of the structure imparted by SGML for display and chunking of the text, but at the same time it saves the user from the need to be always aware of the hierarchical structures imposed on the document by the encoding.

Next to Early User Response

Back to Publication

brownlogo