========================================================================= Date: 12 September 1992 08:17:57 CDT From: "Wendy Plotkin (312) 413-0331" To: Subject: CETH Seminar in Textual Analysis A Report on CETH Seminar on Textual Analysis Princeton University August 9-21, 1992 TEI was prominently featured at the first seminar on textual analysis sponsored by the Center for the Electronic Texts in the Humanities (CETH). CETH was established in late 1991 by Rutgers and Princeton Universities to act as a central organization to assist in the creation, dissemination and use of electronic texts in the humanities. In addition to creating an inventory of machine-readable texts and making them available through the Internet, the Center is committed to offering educational seminars on various aspects of electronic texts. The two instructors were Susan Hockey, CETH Director, and Willard McCarty, Assistant Director of the University of Toronto's Centre for Computing in the Humanities. Susan chairs the TEI Steering Committee, and was formerly the Director of the United Kingdom's Computers in Teaching Initiative (CTI) Centre for Textual Studies, located at Oxford University. Willard is a member of the TEI Verse work group, the founding editor of the _Humanist_, and is currently working in the area of classical studies, in particular on Ovid's _Metamorphoses_. An international group of librarians, literary, linguistic and social science scholars, and computer and information scientists comprised the class. Librarians and library graduate students from the Association of Research Libraries and from universities at Arizona State, Columbia, Indiana, Iowa, Manitoba, Maryland, NYU, Princeton, Rutgers, Texas, and Wesleyan attended. Literary scholars and students from Spain, Virginia, New York State, North Carolina, and Wooster, Ohio ranged in their specialties from Old English to _Piers Plowman_ and modern English and Russian fiction. Linguistic scholars from England and Canada were working in computational linguistics and discourse analysis. Social scientists from Israel, Missouri, and Illinois brought backgrounds in the history of Judaism and Zionism, Sri Lanka, modern Western social theory, and U.S. urban development; a Princeton art historian, in the Princeton Cyprus expedition. Computer scientists and a mathematician from Rutgers and Wisconsin brought a familiarity with higher level programming techniques and interests in analyzing literature. The seminar provided historical information on electronic texts, including their development in the U.S., Europe, and elsewhere. Existing resources such as ARTFL, the Dante Database, the Thesaurus Linguae Grecae, and the Oxford English Dictionary Version II were described. Robert Hollander, Professor of Comparative Literature at Princeton and Dante Database creator, demonstrated the Dante. Toby Paff and Hannah Kaufmann of Princeton's Humanities Computer Center demonstrated ARTFL and the OED2. The need for additional effectively structured online dictionaries was expressed. Other electronic texts were made available for individual perusal (Intelex's _Pastmasters_, Georgetown's _The Phenomenology of the Mind_). Each of these pioneer projects includes textual analysis software with which to analyze the text; they are not aimed at the casual browser, in part due to copyright restrictions. A number of issues were identified as of continuing concern: the need for collaboration in the creation of electronic texts, ample space for their storage, their easy retrieval, and widespread access to texts. Better user interface, improved presentation of individual and parallel texts, hypertext (see below), and dynamic, graphic displays were also deemed desirable. Susan and Willard reviewed two textual analysis programs, one public domain and the other proprietary -- TACT and MICRO-OCP. Their common features include the creation of alphabetical frequency lists of all words, concordances (all the occurrences of a word or phrase, in context), and collocations (co-occurrences of words and phrases). Susan described several studies using stylistic analysis -- Mosteller's and and Wallace's _The Federalist Papers_, Morton's study of Greek texts and their disputed authorship by St. Paul, Kenny's work on _The Aristotelian Ethics_, and Burrows on Jane Austen. We also explored the statistical tests used to summarize the findings in these studies. Beyond stylistic analysis, we looked at linguistic and lexical analysis. Means of using TACT to undertake simple analyses of this type were described. Linguistic and lexical analysis are important for studying language and developing printed and electronic dictionaries. Of even greater significance are their potential for improving information retrieval. As the rules of language are systematized in a manner that computers can understand, computers can apply these rules in interpreting new textual material. The complexity of the task was revealed in the demonstration of a program to automatically parse several sentences. It was successful with one sentence, but completely fell apart when faced with a particularly ambiguous phrase. (By the end of the workshop, we all were freely talking about the difficulty of "disambiguating" words.) Much additional development in the area of automated recognition and analysis of "fuzzy" matches, names, concept relations and figures of speech such as metaphors was desired. Computer assistance in creating critical editions was explored. Those interested in this topic had the opportunity to try out the Collate program prepared by the chair of the TEI Text Criticism work group, Peter Robinson. Susan presented the TEI to the participants, many of whom were familiar with its general principles. TEI's advantages were described as its transportability across different platforms, ease of sharing texts and their analyses, and superior analytical tools. Some of those present expressed reservations about the labor intensiveness of marking up texts, and the desire to analyze a "clean" text free of the interpretation implicit in any mark-up system. A major constraint is the lack of existing software with which to ease the mark-up process and to exploit the mark-up for analysis. Such software is being developed or is used for selective applications. PAT takes advantage of the OED's SGML mark-up, while Dynatext, which was demonstrated to the group, uses SGML to create the links in its hypertext electronic books. These applications are presently too limited or too expensive for general use, and much additional effort is needed in this area. In spite of the reservations expressed, the need for a standard means of encoding and sharing texts seemed to be accepted. About half of us had brought texts to analyze using these tools. Afternoons and late evenings in the dormitory basement were devoted to this task. Texts treated included the poetry of Canadian Margaret Avison, "Piers Plowman" (B), Shakespeare's tragedies, "My Dinner With Andre," classified ads from modern British newspapers, English translations of French and Egyptian fiction, 15th Century Russian chronicles, Durkheim's works, the diary of Robert Knox (a 17th century British sea captain's son imprisoned on Ceylon), andd an early issue of _The Catholic Worker_, a progressive activist Catholic newspaper. One student created a program for Latin morphological analysis (and taking a cue from Julius Caesar, proposed that Latin be adopted as Europe's common language). The projects aptly demonstrated the challenges involved in analyzing electronic texts. In some cases, the difficulty lay in creating or obtaining access to an electronic text. Several attempts at scanning were unsuccessful, especially on older books such as the _History of the British Royal Society_. Where technology was not a problem, obtaining publishers' approval to convert copyrighted texts such as _The Book of Mormon_ and _Lolita_ was. Stylistic analysis required choosing characteristic features of style. For example, Nabokov's language in _Pale Fire_ was compared to the poetry of Alexander Pope and Robert Frost which it parodied, raising questions about the appropriateness of semantic and lexical analysis as the basis of comparison. An interesting study of professional and amateur English translations of French authors Theophile Gautier and Eugene Sue for "stylistic fingerprints" included too small a sample from which to draw conclusions. A successful outcome occurred when a student studying Egyptian short stories and their translations found that TACT and Micro-OCP speeded up the analysis he had begun years before without these tools. Conceptual and thematic analysis called for hard decisions about the relationship between complicated concepts and words and brief phrases (the basic units of TACT and Micro-OCP). A study of the use of the words "sin", "redemption," and "atonement" in _The Book of Mormon_ revealed interesting information about the Mormons' connection of these concepts. Willard's description of his work with Ovid's _Metamorphoses_ augmented by his explanatory article in the _Tact Exemplar_ demonstrated how TACT could be used to unveil important themes. Although the results of these analyses were quite interesting, the need for additional development of analytical and auxiliary tools was widely agreed upon. Don Walker, a member of the TEI Steering Committee and chair of the Association for Computational Linguistics (ACL), described the extensive work being done worldwide with electronic texts, especially in linguistics and lexical analysis. The ACL Data Collection Initiative is amassing electronic transcriptions of written and spoken English, a portion of which is available on CD-ROM. The Network of European Corpora is developing standards to guide the individual European nations in the creation of language corpora. The Consortium for Lexical Research and Linguistic Data Consortium have formed to enhance cooperation among the many projects in progress. Hypertext, the newest frontier in electronic texts, was discussed and debated. Its advantages in integrating different sources of information were acknowledged. Its effect on the behavior of student and scholar are not yet understood, however. Will it stimulate the student or scholar to investigate sources other than those included in the hypertext package, or create the perception that the most important sources are included in the package? Elli Mylonas, chair of the TEI Performance Texts work group, gave a presentation on Pandora, a new text retrieval program she and others have developed to search the Thesaurus Linguae Grecae, and on Perseus. Perseus, developed by a consortium of universities and located at Harvard, is a multi-media educational Macintosh product that includes Greek/English texts from the classical period, a Greek/English lexicon, a classical encyclopedia, and a wealth of photographs of artifacts and sites. Elli also demonstrated two types of electronic hypertext fiction. The first type is represented by Voyager Company's books, which tend to treat text in a traditional manner, although they include analytical and note-taking tools for those who want to analyze Sara Paretsky and the like. The second, Story Space fiction, was created explicitly for the electronic medium and uses the interweaving allowed by hypertext as part of its literary strategy. Ann Okerson of the Association for Research Libraries (ARL) described ARL's efforts in exploring the extent and advantages of electronic journals, newsletters, and bulletin boards. Scholarly communication has speeded up with the advent of the computer, and collaboration has become a greater possibility with the ease of the electronic medium. The ARL has produced the "Directory of Electronic Journals, Newsletters and Academic Discussion Lists," and Ann described her greater appreciation of publishers' efforts after completing this project. Finally, Andreas Bjorklind of Sweden offered a presentation on Wide Information Area Servers (WAIS), the new communication system which allows individuals to search and retrieve electronic databases across the world. The seminar offered a great variety of information about electronic texts and textual analysis, as well as a relaxed setting in which to study. It was enlightening to learn how many universities and libraries are already involved in offering and analyzing electronic texts. Many of the scholars attending were involved in establishing humanities computing centers or services within their institutions, libraries or departments. The professional and personal relationships established, the understanding gained of textual analysis techniques, and the appreciation of the need for additional hardware and software for more sophisticated analysis were the highlights of the session. ========================================================================= Date: 12 September 1992 16:04:11 CDT From: "Wendy Plotkin (312) 413-0331" To: Subject: CETH Seminar Summary: Correction Apologies to James Campbell, Chair of the Electronic Information Services at the University of Virginia's Alderman Library*, for my leaving out mention of him in the list of libraries represented. Jim contributed a great deal to the workshop through his familiarity with available electronic textual resources, and showed a strong interest in the TEI. *Jim is also North Europe Bibliographer.