WWP Digitization Process
The WWP digitization process is in many ways typical of XML-based digital humanities projects. It involves a careful analysis of the source text, transcription and encoding using a schema to constrain the data and ensure consistency, several automated and manual checks for encoding quality, two cycles of proofreading and correction, and a final review. All of these processes are documented internally or elsewhere on this site, but an overview of the system may be useful to give readers a sense of what is involved.
- Cataloguing and metadata development:
When a new text enters the WWP work process, it is catalogued in the database we use to manage information about our texts and work flow. We record essential bibliographic information about the source, including the specific edition and copy being used and the library where the source is held. This enables readers to consult the exact copy from which our transcription is made, if necessary. From this record, we generate a template file with full metadata (encoded in the TEI header, the metadata section of a TEI-encoded document) which is used as the starting point for the encoded text.
- Document analysis:
The next step focuses on analysing the document's encoding requirements. We note here information concerning document's internal structure, its physical construction (pagination, signatures, and presentational details), the languages and handwriting used in the document, and any special characters or features that may require special treatment. Some of this information is used to establish default settings for renditional information, or to add metadata such as handwriting descriptions or language references to the TEI header.
- Transcription and encoding:
Once the document analysis is complete, the encoder begins transcribing the content of the document and representing its structure and content through XML encoding. The WWP's encoding system is a customization of the TEI Guidelines and is documented in detail here. Encoders are graduate and occasionally undergraduate students who receive extensive ongoing training in XML and the TEI and typically work at the WWP for between two and five years.
- Validation and supravalidation:
While the text is being encoded, and again when the capture process is complete, its encoding is checked by validating it against the WWP encoding schema. This ensures that the encoding follows the essential structural rules defined by this schema. In addition, we use a supravalidation process that checks for specific kinds of inconsistencies and errors that cannot be caught by the validation process: for instance, prompting the encoder to check for omitted catchwords, or to verify whether a given speaker label should contain a proper name. For some texts (those later in our chronological period, where spelling is more consistent) we also perform automated spellchecking. When these checks are complete the text is ready for proofreading.
- Proofreading:
The WWP conducts two cycles of proofreading. The first is performed using a printout that displays both the transcribed content and the XML encoding, to permit checking of both the accuracy of the transcription and the correctness and consistency of the encoding. The second cycle is performed on a formatted output that displays the text in a manner similar to the WWO interface, without the XML visible. This allows the proofreader to focus on catching any remaining transcription errors and to check renditional encoding more transparently. All corrections are marked on the proof copy.
- Correction entry and checking:
Following each round of proofreading, corrections are made in the source XML, and then checked by a second person.
- Final review:
Once the text has completed two cycles of proofreading, a final review is performed by one of the WWP senior staff, who checks the encoding, repeats the validation and supravalidation processes, and reviews the overall quality of the transcription. If there are any concerns at this stage, the document is sent back for a further proofreading of the XML.
- Transformation and publication
After the final review, the XML text is processed and transformed into various output formats which are published through Women Writers Online. These published versions may display the text in various ways, to support different kinds of reading and usage. For example, a reading interface might display the text with corrected readings of any errors in the source, while a research interface might display the original error together with a corrected reading. These choices are controlled by stylesheets, which also produce the visual form of the published text. The presentation of WWO texts is based on the original layout and font choices, but makes no attempt to reproduce the visual display of the original page precisely.