The Independent Header

Many libraries, text repositories, research sites and related institutions collect bibliographic and documentary information about machine readable texts without necessarily collecting the texts themselves. Such institutions may thus want access to the header of a TEI document without its attached text in order to build catalogues, indexes and databases that can be used by people to locate relevant texts at remote locations, obtain full documentation about those texts, and learn how to obtain them. This chapter of the Guidelines describes a set of practices by which the headers of TEI documents can be extracted from those documents and exchanged as freestanding independent TEI documents to remote locations. Headers exchanged independently of the documents they describe are called independent headers. The responsibility for this chapter was borne by the Working Committee for Text Documentation, whose members are listed in the front matter. The chapter was drafted by Richard Giordano.

This chapter outlines practices recommended for encoders (especially those responsible for the documentation of text) when creating independent headers to be distributed, and specifies the set of recommended elements that should be included in the independent header. Of interest to librarian cataloguers who may receive independent headers from remote sites, it also discusses the relationship between the elements of TEI headers and MARC tags, in order to facilitate the cataloguing of these headers or the loading of independent headers into local MARC-based bibliographic databases. This chapter does not describe how to create a header. Guidance on the creation of headers and descriptions of each element in the header can be found in chapter . Definition and Principles for Encoders

A independent header is a header extracted from a TEI text that can be exchanged as an independent document between libraries, archives, collections, projects, and individuals. The file description of the independent header (enclosed by the fileDesc element) can be used to generate bibliographic records. The profile description, encoding description, and revision history (encoded by the profileDesc, encodingDesc, and revisionDesc elements) can form part of a bibliographic description or, more appropriately, be used as an attached codebook for full documentation of the analysis of the text, and how it was encoded. Thus, the independent header can serve as the primary means by which libraries, archives, related repositories, research projects, and individual researchers can obtain bibliographic, descriptive, and full documentary information on machine-readable texts that reside in remote locations.

The structure of an independent header is exactly the same as that of a teiHeader attached to a document, and can therefore be validated using the same document type definition (DTD). In practice, this means that a teiHeader and its DTD can be extracted from a TEI document and shipped to a receiving institution with little or no change. However, some fields that are listed as optional in the header are listed as recommended for the independent header. For this reason, this chapter should be consulted in connection with any plan to send headers as independent documents to remote locations.

When deciding which information to include in the independent header, and the format or structure of that information, the following should be kept in mind:

The independent header should provide full bibliographic information on the encoded text, its source, where the text can be located, and any restrictions governing its use.

The header should contain useful information about the encoding of the text itself. In this regard, it is highly recommended that the encoding description be as complete as possible. The Guidelines do not require that the encoding description be included in the header (since some simple transcriptions of small items may not require it), but in practice the use of a header without an encoding description would be severely limited.

The independent header should be amenable to automatic processing, particularly for loading into databases and for the creation of publications, indexes, and finding aids, without undue editorial intervention on the part of the receiving institution. For this reason, two recommendations are made regarding the format or structure of the header: first, where there is a choice between a prose content model and one that contains a formal series of specialized elements, wherever possible and appropriate the specialized elements should be preferred to unstructured prose. For instance, the source description can contain either a free-prose citation (tagged bibl or even p) or else a biblStruct element, which provides a more rigorous structure for the bibliographic information (see examples in section ). The more structured biblStruct element is more suitable for automatic processing, and is therefore recommended over the less structured alternatives whenever the header is to be exchanged as an independent header. Second, with respect to corpora, information about each of the texts within a corpus should be included in the overall corpus-level teiHeader. That is, source information, editorial practices, encoding descriptions, and the like should be included in the relevant sections of the corpus teiHeader, with pointers to them from the headers of the individual texts included in the corpus. There are three reasons for this recommendation: first, the corpus-level header will contain the full array of bibliographic and documentary information for each of the texts in a corpus, and thus be of great benefit to remote users searching for particular texts, who may have access only to the independent header; second, such a layout is easier for the coder to maintain than searching for information throughout a text; and third, generally speaking, this practice results in greater overall consistency, especially with respect to bibliographic citations. Required and Recommended Tags

The richness and size of the header reflect the diversity of uses to which electronic texts conforming to these Guidelines will be put. It is not intended, however, that all of the elements recommended in this chapter be present in every header. As described in section , the TEI header allows for the provision of a very large amount of information concerning the text itself, its source, encodings, and revisions as well as detailed descriptive information that can be used by researchers in analysing the text. The amount of encoding will depend on the nature and intended use of the text. At one extreme, an encoder may expect that the header will be needed only to provide bibliographic information of the text adequate to local needs. At the other, wishing to ensure that their texts can be used for the widest range of applications, encoders will want to document as explicitly as possible both bibliographic and descriptive information in such a way that no prior or ancillary knowledge about the text is needed in order to process it. The header, in the latter case, will be very full, approximating to the kind of documentation often supplied in the form of a manual. Most texts will lie somewhere between these extremes; textual corpora in particular will tend toward the latter extreme.

There follows a list of the components of the header, in the order in which they are presented in chapter , together with an indication of their importance in constructing an independent header. fileDesc required. Some subelements are required, others optional or recommended: titleStmt required; subelements are required or optional: title required author required, if known sponsor optional funder optional principal required, if known resp required, if known role and name required, if known, when the responsibility is not an author, sponsor, funding body, or principal researcher. Details may be found in section . editionStmt recommended edition recommended resp recommended role and name recommended primarily to distinguish editions. extent optional publicationStmt required date recommended publisher, distributor, or authority required city recommended address recommended: prose is sufficient idno recommended availability recommended seriesStmt optional title required idno recommended resp and name optional notesStmt recommended sourceDesc required. As much information as possible should be provided to identify the source. The following tags are either required or recommended, but other tags not listed here should be used wherever applicable in order to provide an accurate identification of the source. In some instances, the biblFull tag is preferable to the biblStruct tag. biblStruct recommended. For a full discussion of biblStruct, see section . analytic required when the citation describes an item within a larger collection, such as an essay within a collection or an article in a journal, and is not an independent publication. If used, it should contain the following elements in this order: author required, if known. title required. editor recommended. monogr mandatory when applicable; this element should contain the following elements in this order: author required, if known. title required. The level attribute must be used to indicate whether this is the title of a book, journal, or series. All tags that indicate. It is highly recommended that the type attribute be used to distinguish the main title from subordinate, parallel, or other titles. All elements that indicate intellectual responsibility for a work, such as editor, are required, if known. imprint required. city required, if known. pubPlace recommended when city is unknown. org recommended. date required. If the date is unknown, n.d. may be used. idno recommended. series required, if the item is part of a series. title required, but type attribute is optional. scriptStmt required for transcribed speech. See section . recordingStmt mandatory when applicable: resp and name recommended recording recommended equipment recommended broadcast recommended comment optional encodingDesc very highly recommended, especially for projects, collections, or corpora. If the encodingDesc element is used, it is recommended that it contain one or more of the following elements, rather than a prose description. See section . projectDesc optional samplingDecl optional editorialDecl recommended; it is also recommended that the editorial declaration make use of the specialized elements defined in section , rather than only consisting of prose paragraphs. Prose may of course be used in addition to these elements for material otherwise not handled. tagsDecl recommended refsDecl optional in general, but recommended if a standard referencing system is built into the encoded works. Section describes three different methods for documenting the referencing system: the prose method, the stepwise method, and the milestone method. No preference is expressed for one type of method over another, since this depends on the convenience of the coder and the likely efficiency of the particular software applications envisaged for the text. Only one method can be used within a single refsDecl element. If a text uses both hierarchical and milestone tagging, this can only be described in prose. classDecl required where the scheme attribute has been used to identify the classification scheme or taxonomy used by any of the elements keywords, classcode, occupation or socecstatus. Even where this is not done, this element may usefully document the classification employed, either explicitly as a series of taxonomy elements, or implicitly by means of bibliographic citation. profileDesc recommended langUsage recommended language recommended textDesc optional in most instances, but recommended when the encoder wants to provide a full description of the situation within which a text was produced or experienced, characterize it in a relatively continuous manner (in contrast to discrete categories based on type or topic), and believes that this characterization of the text will be helpful to the understanding, analysis, or retrieval of this text by remote users. If a collection or corpus uses a pre-existing descriptive typology as its organizing principle, it is recommended that its components be re-expressed in terms of the parameters listed here. If the encoder believes that pre-existing text categories (such as a standard classification scheme) are sufficient, then it is recommended that the textClass element be used instead. See section for details and guidance. channel required constitution required derivation required domain required factuality required interaction required preparedness required purposes required purpose required textClass optional in most instances; this element may may be used as an alternative or in addition to the textDesc element. textClass is recommended in the following situations: a standard text category, such as the Library of Congress List of Subject Headings or a Dewey Decimal Classification category, clearly describes the text situational parameters (or the demographic elements of the particDesc element) are used and a text category can be constructed by the encoder based on a recurring set of values for those parameters. See section for details and guidance. One or more of the following sub-elements can be used. keywords recommended only if using a standard thesaurus such as the Library of Congress List of Subject Headings, a discipline-specific thesaurus, or a thesaurus defined explicitly in the header. In each case, the source should be indicated by the scheme attribute and defined in the classDecl element. classCode recommended only if the text is categorized by an internationally accepted classification scheme, such as the Dewey Decimal or Universal Decimal classification schemes. The scheme should be indicated by the scheme attribute and defined in the classDecl element. catRef optional in most instances, but recommended when a user-defined classification is in use. The scheme should be indicated by the scheme attribute and defined in the classDecl element. particDesc optional, but recommended for spoken text when the encoder judges that such information is useful to remote users in the analysis of that text, and for both written and spoken text if such information is useful in the analysis of language usage. For details and guidance, see section . participant or particGroup recommended. Though the substructure of both the participant and particGroup elements can be prose, in independent headers one or more of the following sub-elements providing more specific details should be used in preference. Users of these Guidelines are free to extend the set of headings listed below. name recommended when the information is available birthDate recommended when the information is available birthPlace recommended when the information is available firstLang recommended when the information is available langKnown recommended when the information is available residence recommended when the information is available education recommended when the information is available affiliation recommended when the information is available occupation it is recommended that, where possible, the classification of the trade, occupation, or profession be derived from a standard classification or taxonomy, and that the source taxonomy be identified in the scheme attribute. socecstatus it is recommended that, where possible, the encoding of social and economic status be derived from a standard classification or taxonomy, and that the source taxonomy be identified in the scheme attribute. particRelations optional, but recommended where it is judged by the encoder that such information is important to the analysis of the text. If the particRelations tag is used, it is recommended that the special purpose relation element be used. See section . settingDesc optional, but recommended when the encoder judges that this information is useful in the analysis of the text, particular in the analysis of language usage. revisionDesc required in the independent header when available. It is recommended that the revisionDesc be encoded with a series of special purpose change elements with grouped name, what and date tags. Header Elements and their Relationship to the MARC Record

This section offers some guidance to both cataloguers and bibliographic analysts who want to load TEI independent headers into a MARC-based retrieval system. Because there are variations in cataloguing practice across local sites, among bibliographic utilities (such as OCLC and RLIN), and differences in MARC usage in different countries, only tentative advice is possible. Note that the following examples are based on USMARC, not UNIMARC. For more information on UNIMARC, see Brian P. Holt, UNIMARC Manual (London, U.K.: IFLA Universal Bibliographic Control and International MARC Programme, British Library, 1987). For USMARC, see Walt Crawford, MARC for library use: understanding USMARC (Boston: G.K. Hall, 1989), USMARC format for bibliographic data, including content designation (Washington, D.C.: Library of Congress, 1987), and Deborah J. Byrne, MARC manual : understanding and using MARC records (Englewood, Colo.: Libraries Unlimited, Inc., 1991). UNIMARC offers cataloguers in different countries the opportunity to combine different national practices in a single MARC format, and is the preferred flavour of MARC records for distribution across national boundaries. The implementation of UNIMARC, however, will be affected by local practice and by guidelines offered by the bibliographic utilities. Though UNIMARC is a stable format, the guidelines for its implementation are not sufficiently known or stabilized to be included in this chapter.

There are some major differences between the MARC record and the TEI header that will cause problems for librarians trying to map from the TEI independent header to the MARC record. The most important difference between the MARC record and the TEI header is the function of each. Despite the efforts and claims of some members of the library community, the MARC record remains fundamentally an electronic version of the catalogue card, with the limitations of its model. The primary function of the MARC record when it was first designed in the mid-1960s was to allow for the electronic distribution of cataloguing records in support of card production. See Henriette Avram, The MARC Pilot Project (Washington D.C.: Library of Congress, 1968), p. 3. For discussion of the relationship between the MARC record and the catalogue card, see Michael Gorman, "After AACR2R: The Future of the Anglo-American Cataloging Rules," in Richard Smiraglia, ed., Origins, Content and Future of AACR2 Revised (Chicago: American Library Association, 1992). The catalogue card is a unitary record for a physical object containing complex bibliographic data of varying sorts. The catalogue card points to the physical object. The TEI header provides full bibliographic information (as would a card), as well as documentary non-bibliographic information that supports the analysis, either by humans or machines, of the electronic text documented by header. Most of this analytical information, which is found in profile description, encoding description, and revision history, has little direct provision for it in the MARC record, and if retained must be recorded as unstructured notes (55XX) fields. Notes fields usually do not have the structure to support machine retrieval and analysis, while properly formatted profile, encoding, and revision descriptions lend themselves to retrieval, can support machine processing (including analysis) and point directly to the electronic text attached to the header. Moreover, the electronic text points back to the relevant elements in the header.

Though this chapter offers some advice on where the profile, encoding, and revision descriptions might go in a MARC record, for practical reasons a repository might want create a codebook from these divisions of the header, and create a MARC record from the file description only. The MARC record should contain a reference to the codebook.

Subfields (or delimiters) are indicated by the dollar sign ($). MARC fields for the File Description

Note that there is no provision for the Main Entry (or USMARC 1XX fields) in the TEI header. The main entry should be constructed, using appropriate name authority control, by the cataloguer from information derived from the header that indicates who is primarily responsible for the intellectual content of the work. There is an author tag, but the form of the name will have to be checked by a cataloguer before the main entry is constructed. titleStmt corresponds to title and statement of responsibility fields in MARC, typically 240 (for uniform title) and 245 (for title proper). title 240 $a (for uniform titles) or 245 $a fields. Put any subtitles in 24X $b. Insert the constant, [computer file] in the 24X $h gmd subfield.

The following elements belong in the 245 $c subfield: statement of responsibility. sponsor funder principal Example: Two stories by Edgar Allen Poe: electronic version Poe, Edgar Allen (1809-1849) compiled by James D. Benson ]]> This might be tagged in MARC as: 245 Two stories by Edgar Allen Poe :$belectronic version ; compiled by $cJames D. Benson. edition 250 $a name 250 $b Example: Student's edition, June 1987 New annotation by George Brown ]]> This might be tagged in MARC as: 250 $aStudent's edition, June, 1987, new annotation by $bGeorge Brown. extent. The extent is analogous to the Physical Description MARC field. Fields 256 or 3XX, depending on local practice are appropriate. date 260 $c, and appropriate fixed fields. publisher, distributor, or authority 260 $b city 260 $a Example: Columbia University Press New York 1993 ]]> This may be tagged in MARC as: 260 $aNew York :$bColumbia University Press, $c1993.

Local practice will determine appropriate MARC fields for address, idno, and availability. Restrictions on access should normally be placed in the 506 field, while the place where an item may be ordered will be located in a local notes (590) field. If local practice warrants it, the address of the publisher should be indicated in the 260 field.

The series title and the idno should be placed in the appropriate 490 fields (series untraced), if series authority checking needs to be done. Further, because the TEI tags do not differentiate between name, conference, or title series, there is no simple mechanical method for determining which MARC tag (410, 411, etc.) should be used. Safe practice would be to load any series statements into 490 fields, and then to conduct authority work on those fields. notesStmt These are usually reserved for general notes (500) fields.

The sourceDesc can be mapped to be a source of data note (537 in RLIN MDF format) with the print constant Transcribed from: at the beginning of the note. The biblStruct itself can be mapped onto a 581 field (note on primary publication) using the ISBD format to separate each data element.

The scriptStmt, recordingStmt, recording, equipment, and broadcast elements do not easily map on to existing MARC fields, and should be put into a local notes field (590) treating the TEI tag introducing each component as a print constant at the head of the field in order to facilitate future local processing and retrieval. Example: CNN Network News News Headlines 12 Jun 1991 ]]> This may be tagged in MARC thus: CNN Network News News Headlines 12 Jun 1991> ]]> Example:

Recorded from FM radio to chrome tape

Britain's pleasure parade BBC Radio 4 FM Robin Day Margaret Thatcher The World Tonight</></series> <date>27 Nov 89</date> </bibl> </broadcast> </recording> </recordingStmt> ]]> </eg> This can be tagged in MARC as: <eg> <![ CDATA [ 590 <recordingStmt> <recording type=video dur="10 mins"> <equipment><p>Recorded from FM radio to chrome tape</p></equipment><broadcast> <bibl><title>Britain's pleasure parade BBC Radio 4 FM Robin Day Margaret Thatcher The World Tonight</></series> <date>27 Nov 89</date> </bibl> </broadcast> </recording> </recordingStmt> ]]> </eg> <div2 id=SHed><head>MARC Fields for the Encoding Description</head> <p>The <gi>encodingDesc</gi> element provides useful information documenting the relationship between an electronic text and the source or sources from which it was derived. The <gi>projectDesc</gi>, <gi>samplingDecl</gi>, <gi>editorialDecl</gi>, and <gi>refsDecl</gi> elements provide details of decisions and rationales used about the process and purposes of the project, how text was sampled, principles of editorial practice, and how canonical references are constructed. The 567 field (notes on methodology) appears to be the most appropriate for this sort of information, though this field is normally intended for methodologies characterizing the social sciences. Practically, it would be wise to transcribe the <gi>projectDesc</gi>, <gi>editorialDecl</gi>, <gi>refsDecl</gi>, and <gi>classDecl</gi> elements directly as one or more 567 fields without intervention, with the element name at the beginning of each field, and any TEI tags left intact. This may facilitate any locally-developed retrieval software. <p>Example: <eg> <![ CDATA [ <encodingDesc> <projectDesc><p>Texts were collected to illustrate the full range of twentieth-century spoken and written Swedish, written by native Swedish authors.</projectDesc> <samplingDecl><p>Sample of 2000 words taken from the beginning of the text.</p></samplingDecl> <editorialDecl> <interpretation><p>Errors in transcription controlled by using the SUC spell checker, v.2.4</p></interpretation> </editorialDecl> </encodingDesc> ]]> </eg> This may be tagged in MARC as: <eg> <![ CDATA [ 567 <projectDesc><p>Texts were collected to illustrate the full range of twentieth-century spoken and written Swedish, written by native Swedish authors.</p> 567 <samplingDecl><p>Sample of 2000 words taken from the beginning of the text.</p> 567 <editorialDecl> <interpretation><p>Errors in transcription controlled by using the SUC spell checker, v. 2.4</p> </interpretation> </editorialDecl> ]]> </eg> <div2 id=SHpd><head>MARC Fields for the Profile Description</head> <p>The profile description is the most problematical element in the TEI header for librarian cataloguers, because it provides a detailed description of the <emph>non-bibliographic</emph> aspects of the text, specifically the languages and sublanguages used, the situation in which it was produced, and the participants and their setting. This information can be used for retrieval purposes, or in machine-supported analysis of the text. The information can be loaded into a separate <soCalled>codebook</soCalled> and referenced by the MARC record. Little guidance can be offered on the appropriate MARC location for the elements that make up the profile description, except to suggest that if a site wants to load the profile description into a MARC record for archival and possibly retrieval purposes, then the contents of the profile description may be mapped into a locally-defined notes field (59X) with its TEI tags intact, as in the examples above. <div2 id=SHrd><head>MARC fields for the Revision Description</head> <p>The revision history (<gi>revisionDesc</gi>) logs all changes to a machine readable file whether or not these constitute a new edition of the file. Aside from the edition area of the MARC record, there are no MARC fields that deal specifically with changes of this sort. This information might be best included in a <soCalled>codebook</soCalled>, rather than a MARC record. As before, the simplest way of approaching this problem is to include the material with its TEI tags intact as a locally-defined note (59X) in order to support future local processing. <div2 id=SHstr><head>Structure of the DTD for Independent Headers</head> <p>The following document type definition is provided in file <term>teishd2.dtd</term> and constitutes the auxiliary DTD for independent headers, as described in this chapter. <eg id=dih> <![ CDATA [ <!-- 26.8: File teishd2.dtd: Auxiliary DTD for Independent --> <!-- Header --> <!-- Text Encoding Initiative: Guidelines for Electronic --> <!-- Text Encoding and Interchange. DRAFT Version 2. 1992-93. --> <!-- Copyright (c) 1990, 1992, 1993 ACH, ACL, ALLC. --> <!-- Permission to copy in any form is granted for use with --> <!-- TEI-aware systems and applications, provided this --> <!-- notice is included in all copies. --> <!-- These materials may not be altered; modifications to --> <!-- these DTDs should be performed as specified in chapter --> <!-- MD of the Guidelines. --> <!-- These materials subject to revision. Current versions --> <!-- available from the Text Encoding Initiative. --> <!-- Embed entities for TEI generic identifiers. --> <!ENTITY % TEI.elementNames system 'teigis2.ent' > %TEI.elementNames; <!-- Embed entities for TEI keywords. --> <!ENTITY % TEI.keywords.ent system 'teikey2.ent' > %TEI.keywords.ent; <!-- Define element classes for content models, shared --> <!-- attributes for element classes, and global attributes. --> <!-- (This all happens within the file teiclas2.ent.) --> <!ENTITY % TEI.elementClasses system 'teiclas2.ent' > %TEI.elementClasses; <!-- Now declare the IHS element. --> <!ELEMENT ihs - O (teiHeader+) > <!ATTLIST ihs %a.global; > <!-- Finally, embed the TEI header and core tag sets. --> <!ENTITY % TEI.header.dtd system 'teihdr2.dtd' > %TEI.header.dtd; <!ENTITY % TEI.core.dtd system 'teicore2.dtd' > %TEI.core.dtd; ]]> </eg> <p>The overall structure of a set of independent headers, encoded for interchange as a group, is thus: <eg> <![ CDATA [ <!DOCTYPE ihs system 'teishd2.dtd'> <ihs> <teiHeader> <fileDesc> ... </fileDesc> <encodingDesc> ... </encodingDesc> <profileDesc> ... </profileDesc> <revisionDesc> ... </revisionDesc> </teiHeader> <teiHeader> <fileDesc> ... </fileDesc> <encodingDesc> ... </encodingDesc> <profileDesc> ... </profileDesc> <revisionDesc> ... </revisionDesc> </teiHeader> <teiHeader> ... </teiHeader> <!-- ... etc. --> </ihs> ]]> </eg> <p>In practice, headers might be stored in separate operating system files, to reduce redundant storage requirements; in this case, the top-level file for a typical document might have the following structure: <eg> <![ CDATA [ <!DOCTYPE tei system 'tei2.dtd' [ <!ENTITY txt01 system 'text01.tei' > <!ENTITY hdr01 system 'text01.hdr' > ]> <tei.2> &hdr01 &txt01 </tei.2> ]]> </eg> while that for a set of independent headers might have this structure: <eg> <![ CDATA [ <!DOCTYPE ihs system 'teishd2.dtd' [ <!ENTITY hdr01 system 'text01.hdr' > <!ENTITY hdr02 system 'text02.hdr' > <!ENTITY hdr03 system 'text03.hdr' > <!-- ... etc. --> ]> <ihs> &hdr01 &hdr02 &hdr03 <!-- etc. --> </ihs> ]]> </eg>