Julia Flanders
Women Writers Project
Brown University
Providence, RI
Julia_Flanders@brown.edu

Inside the Black Box: designing and understanding online research tools

Introduction: The Spectre of Bad Research

Roseanne Potter and others, writing on computers and literary study, reflect a concern on the part of humanities scholars that using electronic texts for their research will have two possible bad effects: either it will be irrelevant, not very useful (i.e. will have no impact on our research habits), or it will be damaging and will pervert the nature of humanities research.

More optimistic critics imagine that the impact will be positive: there's some of this in Potter, her approving language of "proof" and "certainty", "known facts about language".1 But by and large, this is not a view explicitly shared by most of today's literary scholars and critics (that we need improved notions of proof and objectivity in dealing with literary texts) but at the same time the bread-and-butter literary essay does rely on notions of evidence and proof: offering a thesis and defending it with material from a set of texts, according to procedures governed by shared standards of convincingness.

There's a seductiveness about scientific rhetoric: the urge for precision, for proof, for objectivity, and humanities disciplines have been grappling with an attraction for this kind of rhetoric, and a simultaneous and substantial critique of it: both on epistemological grounds (such objectivity, etc., are not possible) and on disciplinary/aesthetic grounds (such objectivity, etc., are not desirable, are not appropriate to humanities research or literary criticism).

So given the concerns about the possible irrelevance or inappropriateness of online sources to humanities research, and given the humanities' divided sense of where it stands with respect to "scientific" research methods, it may be useful to ask what seems distinctive about humanities research within the context of online research: that is, if we are interested in preserving what we feel are characteristically humanities approaches and methods, what activities must our online research tools take account of in order to become useful, functional parts of our research environment?

What is distinctive about humanities research?

For one thing, humanities research involves searches for unsystematic things. Particularly when working with older texts, it does not necessarily allow for a controlled vocabulary or spelling, not even to the degree that scientific vocabulary is fairly uniform. It also relies considerably on primary sources, and on a variety of interactions with them, both at the collection level and within the individual text. Just as in other fields, we need to be able to locate our sources, but we may wish to use a wider and more heterogeneous set of criteria in our searches (e.g. gender). At the collection level, we may be looking at large-scale patterns of words or textual structure, looking for specific topics, or looking for pieces of information (e.g. historical data). Within the individual text, humanities scholars are much more apt to be interested in particular verbal patterns and in questions of textual structure and genre.

Humanities research also involves issues of textual identity and authority, issues which rarely emerge in scientific sources. For this reason humanities sources require more detailed and stringent documentation. The reader needs to know the identity of the source text, and sometimes of the individual copy used in transcription. She also needs to know about editorial and transcriptional methods used in preparing the edition, since these can have a far more profound effect on the interpretation of the text than is likely in other disciplines.

We should also note that online research tools are much more prevalent in the sciences and more widely used, partly because of the current (though diminishing) dearth of online primary sources on which humanities research depends, and partly because research tools aren't necessarily modeled on humanities needs.

It's important to realize that the effectiveness of online tools depends more or less equally on the data and on the tools themselves, and how they interact. From the researcher's point of view, it can also help to understand something of both (albeit at a high level of abstraction) In what follows, I will discuss three aspects of online literary research tools--navigation/browsing, searching, and textual analysis--and describe the conceptual and technical foundations on which these things rest.

Browsing and navigation

"Browsing" is in a simple sense the electronic equivalent of "skimming" or "reading" a book. Just as with a book, if you look closely you can see some more complex behaviors underlying the simple, familiar activity; we aren't always, or usually, reading a text from the first word through to the last word, and for activities which require skipping around we need markers of various sorts. These behaviors depend on, and are reciprocally adapted to, various kinds of information which are "encoded" into the printed book in its formatting conventions. Without these conventions, the text is "all there" in a sense, but only in a very limited sense.

Compare an ordinarily formatted page of text with a page of scriptio continua which has no word breaks, sentence breaks, paragraph breaks, no special formatting at all. In the formatted text there are a number of cues being provided by the way the text is organized visually: relationships of sequencing, pairing, hierarchization and ranking of importance. We depend on these clues not just in order to read the text, in the fullest sense, but also to perform different kinds of reading: for instance, skimming a long text in search of the heading for a particular section, or interpreting the positional information given by a list or chart.

To browse an electronic text, we need essentially the same information about the structure of the text, but it needs to be provided in a dramatically different form. Its relation to the appearance of the text needs to be flexible, not fixed, since the electronic text needs to function on things like ebooks (with tiny screens) as well as on Web browsers and other kinds of interfaces.

If we think about more complex kinds of textual structures and how we read them, we can see even more clearly how important it is for the computer to have information about what the structure is and how it works. When we "browse" an electronic text, we may be asking not only that the computer deliver us a text whose parts look like what they are--paragraphs, headings, lists, notes, etc.--but also that it allow us to use the identity of these parts as a basis for more complex kinds of reading. We would not be surprised--it would in fact seem like a natural browsing activity--if our interface allowed us to go straight to a given line of poetry, by line number. But in order to do this, the computer needs to "know" what a line of poetry is, and whether in some cases it is made up of two half-lines, whether on the same line of type or not.

Searching

Searching, unlike browsing, is not really an extension of what we think of as natural reading behavior, but it does stem from some sort of intuitive processes which we perform in real life, having to do with things like categorization and taxonomies. We understand the impulse behind saying "Find me all the small black birds that don't have a red spot on the wing", even if in practice we do this by looking by eye in Peterson's. So it's not surprising to find that a lot of the immediately intuitive functionality people seek in online texts is precisely this kind of activity.

There are essentially two ways of designing a search system. The first is to assume that you have no control over how the text is prepared, and to attempt to build a really smart search engine which can try to hypothesize things about the text based on formatting or word patterns. This is the approach used by most web browsers, since the web is a chaotic textual space about which one can assume and control nothing. However, it has its limitations. The other approach is to build the intelligence into the data (as well as, or instead of, the search engine). This approach yields much better search results, if the data is well prepared, since it allows you to encode and identify the important features you want to use in searching (such as modernized spellings, or bibliographic information, or structural components of the text).

There are several ways of doing this. The first is to use metadata 2 associated with the file or encoded directly in the file; this approach is well suited to bibliographic information and information about the preparation of the file itself. The second method is to use metadata recorded in a separate database, such as a finding aid; this approach is good for things like name authority files which will be used in multiple texts, or for metadata associated with non-textual things like images. The final method is to use tagging which is attached directly to specific parts of the text. This is best for things like structural information (pagination, collation, basic parts of the text) and for identifying things like names, dates, quotations, and other aspects of the text's content.

In all three cases, the most crucial thing is not the comprehensiveness of the information included, but rather the precision with which it is recorded. It is essential to define a careful, systematic taxonomy of information with categories that do not overlap or contradict one another, and to use a controlled vocabulary.

Textual Analysis

Textual analysis is probably one of the oldest uses of computers in literary study and one to which they are extremely well suited. However, early textual analysis tools tended to be designed by and for computer scientists, and hence had intimidating user interfaces. Also, since there didn't exist a large body of online texts for analysis, doing this kind of research meant preparing your own data, being something of a geek; it forced your time and attention as a researcher away from the kinds of larger questions literary scholars address, and onto the analysis itself. As a result, although there has been a fair amount of computer-assisted textual analysis, it hasn't had a large impact on literary studies more broadly, despite predictions that Computer Criticism might be the next big trend in literary study 3.

With the rise of online texts as a normal, everyday fact of literary study, the situation has changed; people aren't going to electronic texts because they're interested in word-counting, but because it's part of their ordinary textual research. When they start working with a text, they may find (no matter what kind of reading they're doing) that there are forms of textual analysis which--far from being arcane statistical obfuscations--correspond to the "natural" things one wants to know about texts during the course of a reading.

Some of these are so straightforward they scarcely deserve the term "textual analysis": for instance, one might well ask whether a given word appears elsewhere in the text, or in other texts. From there one might go on to ask a series of more specific questions: how often does it occur? how does its rate of occurrence vary from text to text, and in what kinds of texts does it appear most frequently? for more frequently occurring words, where in the text do they tend to occur? what other words do they most often occur with? what is the full vocabulary profile for the text I'm reading now? how does it compare with that of other texts in the same genre? in other genres? by men?

These are all simple questions whose answers--if we could easily get them--wouldn't detour us by much from the course of our reading, but which would help give us a much more detailed and nuanced sense of how language functions within and across texts. Adding the variable of genre, chronology, geography, or other factors, one could harness such observations to larger questions of cultural history and meaning.

The more sophisticated kinds of textual analysis, such as would be required for authorship attribution studies, obviously require both more complex tools and also more statistical expertise on the part of the user. However, for scholars and students engaged in literary criticism or research, textual analysis tools need to reflect precisely the kinds of intuitive questions about language and linguistic patterns that people ask all the time. In essence, they need to help elucidate and represent the linguistic relationships within and between texts, whether having to do with word usage, with textual variants, or with textual structure.

Some tools and possibilities

The most obvious and simple tools, and probably the first to show up at sites like Women Writers Online, are those which assist in simple word usage studies. Some basic components include a unique word list for the text indicating the extent of its vocabulary (preferably with lemmatization to group different forms of the same word), a frequency analysis of the text's vocabulary, and the ability to compare these with the same data from other texts or collections, and to sort and process these comparisons by factors like authorship, date, genre, etc. Along with these, it could be very helpful to be able to display the results of such comparisons intuitively.

With time, there should also be more advanced tools which would still address what I think of as intuitive questions. These include, for instance, tools for collocation analysis, enabling the reader to look at the words which occur in proximity to a given word, to give a sense of word clustering. They might also include context-sensitive tools which allow the reader to use the structure of the document in her analysis: for instance, to look specifically at the vocabulary of prologues, and comparing it to that of epilogues. Another possibility would be tools that can exploit parallelism and alignment between two or more documents, as in the case of multiple witnesses of the same work. The most sophisticated of these tools right now is Collate, which can take two or more texts and generate a full report on the variants between them (either word variants or spelling variants).

Finally, there will be truly advanced tools which could be given a more intuitive interface for general use. Examples of these are tools for comparative word frequency analysis and collocation analysis, with advanced statistical processing, and more complex tools for studying and visualizing word clustering and usage patterns, including control over how the text is segmented for sampling purposes.

Conclusion

What online texts do for us, if they do nothing else, is remind us of the existence of media, whether paper or electronic, and of the work that goes into producing a text as embodied in a particular medium. We are so accustomed to producing and using conventional texts that we take for granted the specialized mechanisms on which their user interface depends: indexes, standard textual structures, formatting conventions, editorial practices, and the like. In becoming equally adept users of online texts, we need to gain the same familiarity with the particularities of this medium and the means by which online texts are instantiated for us as readers. We shouldn't be surprised if this process takes some time, and I hope that presentations like this one can at least start the process off.

References

1. Roseanne Potter, Literary Computing and Literary Criticism: Theoretical and Practical Essays on Theme and Rhetoric (University of Pennsylvania Press, 1989), xvii.

2. Metadata is, in effect, "data about the data": that is, additional information associated with an electronic file which documents things like bibliographic information, revision history, file format, and the like.

3. Potter, xvi.