Definition and Principles for Encoders
A independent header is a header extracted from a TEI
text that can be exchanged as an independent document between libraries,
archives, collections, projects, and individuals. The file description
of the independent header (enclosed by the fileDesc element)
can be used to generate bibliographic records. The profile description,
encoding description, and revision history (encoded by the
profileDesc, encodingDesc, and revisionDesc
elements) can form part of a bibliographic description or, more
appropriately, be used as an attached codebook for
full documentation of the analysis of the text, and how it was encoded.
Thus, the independent header can serve as the primary means by which
libraries, archives, related repositories, research projects, and
individual researchers can obtain bibliographic, descriptive, and full
documentary information on machine-readable texts that reside in remote
locations.
The structure of an independent header is exactly
the same as that of a teiHeader attached to a document, and
can therefore be validated using the same document type definition (DTD).
In practice,
this means that a teiHeader and its DTD can be extracted from a
TEI document and shipped to a receiving institution with little or no
change. However, some fields that are listed as optional
in the
header are listed as recommended
for the independent header.
For this reason, this chapter should be consulted in connection with
any plan to send headers as independent documents to remote locations.
When deciding which information to include in the independent header,
and the format or structure of that information, the following should
be kept in mind:
The independent header should provide full bibliographic information
on the encoded text, its source, where the text can be located, and any
restrictions governing its use.
The header should contain useful information about the encoding of
the text itself. In this regard, it is highly recommended that the
encoding description be as complete as possible. The Guidelines do not
require that the encoding description be included in the header (since
some simple transcriptions of small items may not require it), but in
practice the use of a header without an encoding description would be
severely limited.
The independent header should be amenable to automatic processing,
particularly for loading into databases and for the creation of
publications, indexes, and finding aids, without undue editorial
intervention on the part of the receiving institution. For this reason,
two recommendations are made regarding the format or structure of the
header: first, where there is a choice between a prose content model
and one that contains a formal series of specialized elements,
wherever possible and appropriate the specialized elements should
be preferred to unstructured prose. For instance, the source
description can contain either a free-prose citation (tagged
bibl or even p) or else a biblStruct element,
which provides a more rigorous structure for the bibliographic
information (see examples in section ). The more
structured biblStruct element is more suitable for automatic
processing, and is therefore recommended over the less structured
alternatives whenever the header is to be exchanged as an independent
header. Second, with respect to corpora, information about each of the
texts within a corpus should be included in the overall corpus-level
teiHeader. That is, source information, editorial practices,
encoding descriptions, and the like should be included in the relevant
sections of the corpus teiHeader, with pointers to them from
the headers of the individual texts included in the corpus. There are
three reasons for this recommendation: first, the corpus-level header
will contain the full array of bibliographic and documentary information
for each of the texts in a corpus, and thus be of great benefit to
remote users searching for particular texts, who may have access only to
the independent header; second, such a layout is easier for the coder to
maintain than searching for information throughout a text; and third,
generally speaking, this practice results in greater overall
consistency, especially with respect to bibliographic citations.
Required and Recommended Tags
The richness and size of the header reflect the diversity of uses to
which electronic texts conforming to these Guidelines will be put. It
is not intended, however, that all of the elements recommended in this
chapter be present in every header. As described in section , the TEI header allows for the provision of a very large
amount of information concerning the text itself, its source, encodings,
and revisions as well as detailed descriptive information that can be
used by researchers in analysing the text. The amount of encoding will
depend on the nature and intended use of the text. At one extreme, an
encoder may expect that the header will be needed only to provide
bibliographic information of the text adequate to local needs. At the
other, wishing to ensure that their texts can be used for the widest
range of applications, encoders will want to document as explicitly as
possible both bibliographic and descriptive information in such a way
that no prior or ancillary knowledge about the text is needed in order
to process it. The header, in the latter case, will be very full,
approximating to the kind of documentation often supplied in the form of
a manual. Most texts will lie somewhere between these extremes; textual
corpora in particular will tend toward the latter extreme.
There follows a list of the components of the header, in the order in
which they are presented in chapter , together with an
indication of their importance in constructing an independent header.
- fileDesc required. Some subelements are required,
others optional or recommended:
- titleStmt required; subelements are required or
optional:
- title required
- author required, if known
- sponsor optional
- funder optional
- principal required, if known
- resp required, if known
- role and name required, if known, when the
responsibility is not an author, sponsor, funding body, or
principal researcher. Details may be found in
section .
- editionStmt recommended
- edition recommended
- resp recommended
- role and name recommended primarily to
distinguish editions.
- extent optional
- publicationStmt required
- date recommended
- publisher, distributor,
or authority required
- city recommended
- address recommended: prose is sufficient
- idno recommended
- availability recommended
- seriesStmt optional
- title required
- idno recommended
- resp and name optional
- notesStmt recommended
- sourceDesc required. As much information as possible
should be provided to identify the source. The following tags are
either required or recommended, but other tags not listed here should be
used wherever applicable in order to provide an accurate identification
of the source. In some instances, the biblFull tag is
preferable to the biblStruct tag.
- biblStruct recommended. For a full discussion
of biblStruct, see section .
- analytic required when the citation describes an item
within a larger collection, such as an essay within a
collection or an article in a journal, and is not an
independent publication. If used, it should contain the
following elements in this order:
- author required, if known.
- title required.
- editor recommended.
- monogr mandatory when applicable; this element should
contain the following elements in this order:
- author required, if known.
- title required. The level attribute must
be used to indicate whether this is the title of a book,
journal, or series. All tags that indicate. It is highly
recommended that the type attribute be used
to distinguish the main title from subordinate, parallel,
or other titles. All elements that indicate intellectual
responsibility for a work, such as editor, are
required, if known.
- imprint required.
- city required, if known.
- pubPlace recommended when city is
unknown.
- org recommended.
- date required.
If the date is unknown,
n.d.
may be used.
- idno recommended.
- series required, if the item is part of a series.
- title required, but type attribute
is optional.
- scriptStmt required for transcribed speech.
See section .
- recordingStmt mandatory when applicable:
- resp and name recommended
- recording recommended
- equipment recommended
- broadcast recommended
- comment optional
- encodingDesc very highly recommended, especially
for projects, collections, or corpora.
If the encodingDesc element is used, it is recommended
that it contain one or more of the following elements, rather
than a prose description. See section .
- projectDesc optional
- samplingDecl optional
- editorialDecl recommended; it is also recommended
that the editorial declaration make use of the specialized
elements defined in section , rather than
only consisting of prose paragraphs.
Prose may of course be used in addition to these
elements for material otherwise not handled.
- tagsDecl recommended
- refsDecl optional in general, but recommended if
a standard referencing system is built into the encoded works.
Section describes three different methods
for documenting the referencing system: the prose method,
the stepwise method, and the milestone method. No preference
is expressed for one type of method over another, since this
depends on the convenience of the coder and the likely
efficiency of the particular software applications envisaged for
the text. Only one method can be used within a
single refsDecl element. If a text uses both
hierarchical and milestone tagging, this can only be described
in prose.
- classDecl required where the scheme
attribute has been used to identify the
classification scheme or taxonomy used by any of the elements
keywords, classcode, occupation or
socecstatus. Even where this is not done, this element
may usefully document the
classification employed, either explicitly as a series of
taxonomy elements, or implicitly by means of
bibliographic citation.
- profileDesc recommended
- langUsage recommended
- language recommended
- textDesc optional in most instances, but recommended
when the encoder wants to provide a full description of the
situation within which a text was produced or experienced,
characterize it in a relatively continuous manner (in contrast
to discrete categories based on type or topic), and believes
that this characterization of the text will be helpful to the
understanding, analysis, or retrieval of this text by remote
users. If a collection or corpus uses a pre-existing descriptive
typology as its organizing principle, it is recommended that
its components be re-expressed in terms of the parameters listed
here. If the encoder believes that pre-existing text categories
(such as a standard classification scheme) are sufficient, then
it is recommended that the textClass element be used
instead. See section for details and guidance.
- channel required
- constitution required
- derivation required
- domain required
- factuality required
- interaction required
- preparedness required
- purposes required
- purpose required
- textClass optional in most instances; this element may
may be used as an alternative or in addition to the textDesc
element. textClass is recommended in the following
situations:
- a standard text category, such as the Library of Congress
List of Subject Headings or a Dewey Decimal Classification
category, clearly describes the text
- situational parameters (or the demographic elements of
the particDesc element) are used and a text category
can be constructed by the encoder based on a recurring set of
values for those parameters.
See section for details and guidance. One
or more of the following sub-elements can be used.
- keywords recommended only if using a standard
thesaurus such as the Library of Congress List of Subject
Headings, a discipline-specific thesaurus, or a thesaurus
defined explicitly in the header. In each case, the source
should be indicated by the scheme attribute and
defined in the classDecl element.
- classCode recommended only if the text is
categorized by an internationally accepted classification
scheme, such as the Dewey Decimal or Universal Decimal
classification schemes. The scheme
should be indicated by the scheme attribute and
defined in the classDecl element.
- catRef optional in most instances, but recommended
when a user-defined classification is in use. The scheme
should be indicated by the scheme attribute and
defined in the classDecl element.
- particDesc optional, but recommended for spoken
text when the encoder judges that such information is useful
to remote users in the analysis of that text, and for both
written and spoken text if such information is useful in the
analysis of language usage. For details and guidance,
see section .
- participant or particGroup recommended.
Though the substructure of both the participant
and particGroup elements can be prose, in independent headers
one or more of the following sub-elements providing more specific
details should be used in preference. Users of these
Guidelines are free to extend the set of headings listed below.
- name recommended when the information is available
- birthDate recommended when the information is available
- birthPlace recommended when the information is available
- firstLang recommended when the information is available
- langKnown recommended when the information is available
- residence recommended when the information is available
- education recommended when the information is available
- affiliation recommended when the information is available
- occupation it is recommended that, where possible,
the classification of the trade, occupation, or profession
be derived from a standard classification or taxonomy, and that
the source taxonomy be identified in the scheme
attribute.
- socecstatus it is recommended that, where
possible, the encoding of social and economic status be
derived from a standard classification or taxonomy, and that
the source taxonomy be identified in the scheme
attribute.
- particRelations optional, but recommended where it
is judged by the encoder that such information is important to the
analysis of the text. If the particRelations tag is used, it
is recommended that the special purpose relation element
be used. See section .
- settingDesc optional, but recommended when the
encoder judges that this information is useful in the analysis of the
text, particular in the analysis of language usage.
- revisionDesc required in the independent header when
available. It is recommended that the revisionDesc be encoded
with a series of special purpose change elements with grouped
name, what and date tags.
Header Elements and their Relationship to the MARC Record
This section offers some guidance to both cataloguers and
bibliographic analysts who want to load TEI independent headers into a
MARC-based retrieval system. Because there are variations in
cataloguing practice across local sites, among bibliographic utilities
(such as OCLC and RLIN), and differences in MARC usage in different
countries, only tentative advice is possible. Note that the following
examples are based on USMARC, not UNIMARC.
For more information on UNIMARC, see Brian P. Holt,
UNIMARC Manual (London, U.K.: IFLA Universal
Bibliographic Control and International MARC Programme, British Library,
1987). For USMARC, see Walt Crawford, MARC
for library use: understanding USMARC (Boston: G.K. Hall,
1989), USMARC format for bibliographic data,
including content designation (Washington, D.C.: Library of
Congress, 1987), and Deborah J. Byrne, MARC
manual : understanding and using MARC records (Englewood, Colo.:
Libraries Unlimited, Inc., 1991).
UNIMARC offers cataloguers in different countries the opportunity to
combine different national practices in a single MARC format, and is the
preferred flavour of MARC records for distribution
across national boundaries. The implementation of UNIMARC, however,
will be affected by local practice and by guidelines offered by the
bibliographic utilities. Though UNIMARC is a stable format, the
guidelines for its implementation are not sufficiently known or
stabilized to be included in this chapter.
There are some major differences between the MARC record and the TEI
header that will cause problems for librarians trying to map from the
TEI independent header to the MARC record. The most important
difference between the MARC record and the TEI header is the function of
each. Despite the efforts and claims of some members of the library
community, the MARC record remains fundamentally an electronic version
of the catalogue card, with the limitations of its model.
The primary function of the MARC record when it was
first designed in the mid-1960s was to allow for the electronic
distribution of cataloguing records in support of card production. See
Henriette Avram, The MARC Pilot Project
(Washington D.C.: Library of Congress, 1968), p. 3. For
discussion of the relationship between the MARC record and the catalogue
card, see Michael Gorman, "After AACR2R: The Future of the
Anglo-American Cataloging Rules," in Richard Smiraglia, ed.,
Origins, Content and Future of AACR2 Revised (Chicago:
American Library Association, 1992).
The catalogue card is a unitary record for a physical object containing
complex bibliographic data of varying sorts. The catalogue
card points to the physical object. The TEI header provides full
bibliographic information (as would a card), as well as documentary
non-bibliographic information that supports the analysis, either by
humans or machines, of the electronic text documented by header. Most
of this analytical information, which is found in profile description,
encoding description, and revision history, has little direct provision
for it in the MARC record,
and if retained must be recorded as unstructured notes (55XX) fields.
Notes fields usually do not have the structure to support machine
retrieval and analysis, while properly formatted profile, encoding, and
revision descriptions lend themselves to retrieval, can support machine
processing (including analysis) and point directly to the electronic
text attached to the header. Moreover, the electronic text points back
to the relevant elements in the header.
Though this chapter offers some advice on where the profile,
encoding, and revision descriptions might go in a MARC record, for
practical reasons a repository might want create a codebook from these
divisions of the header, and create a MARC record from the file
description only. The MARC record should contain a reference to the
codebook.
Subfields (or delimiters) are indicated by the dollar sign ($).
MARC fields for the File Description
Note that there is no provision for the Main
Entry (or USMARC 1XX fields) in the TEI header. The main
entry should be constructed, using appropriate name authority control,
by the cataloguer from information derived from the header that
indicates who is primarily responsible for the intellectual content of
the work. There is an author tag, but the form of the name
will have to be checked by a cataloguer before the main entry is
constructed.
- titleStmt corresponds to title and statement of
responsibility fields in MARC, typically 240 (for uniform title)
and 245 (for title proper).
- title 240 $a (for uniform titles) or 245 $a fields. Put
any subtitles in 24X $b. Insert the constant, [computer file] in
the 24X $h gmd subfield.
The following elements belong in the 245 $c subfield: statement of
responsibility.
- sponsor
- funder
- principal
Example:
Two stories by Edgar Allen Poe: electronic
version
Poe, Edgar Allen (1809-1849)
compiled by
James D. Benson
]]>
This might be tagged in MARC as:
245 Two stories by Edgar Allen Poe :$belectronic version ;
compiled by $cJames D. Benson.
- edition 250 $a
- name 250 $b
Example:
Student's edition,
June 1987
New annotation by
George Brown
]]>
This might be tagged in MARC as:
250 $aStudent's edition, June, 1987, new annotation by
$bGeorge Brown.
- extent. The extent is analogous to the
Physical Description MARC field. Fields 256
or 3XX, depending on local practice are appropriate.
- date 260 $c, and appropriate fixed fields.
- publisher, distributor, or authority
260 $b
- city 260 $a
Example:
Columbia University Press
New York
1993
]]>
This may be tagged in MARC as:
260 $aNew York :$bColumbia University Press, $c1993.
Local practice will determine appropriate MARC fields for
address, idno, and availability.
Restrictions on access should normally be placed in the 506 field,
while the place where an item may be ordered will be located in a local
notes (590) field. If local practice warrants it, the address of the
publisher should be indicated in the 260 field.
The series title and the idno should be placed in
the appropriate 490 fields (series untraced), if series authority
checking needs to be done. Further, because the TEI tags do not
differentiate between name, conference, or title series, there is no
simple mechanical method for determining which MARC tag (410, 411, etc.)
should be used. Safe practice would be to load any series statements
into 490 fields, and then to conduct authority work on those fields.
- notesStmt These are usually reserved for general notes
(500) fields.
The sourceDesc can be mapped to be a source of
data note (537 in RLIN MDF format) with the print constant
Transcribed from:
at the beginning of the note. The
biblStruct itself can be mapped onto a 581 field (note on
primary publication) using the ISBD format to separate each data
element.
The scriptStmt, recordingStmt, recording,
equipment, and broadcast elements do not easily map on
to existing MARC fields, and should be put into a local notes field
(590) treating the TEI tag introducing each component as a print
constant at the head of the field in order to facilitate future local
processing and retrieval.
Example:
CNN Network News
News Headlines
12 Jun 1991
]]>
This may be tagged in MARC thus:
CNN Network News
News Headlines
12 Jun 1991>
]]>
Example:
Recorded from FM radio to chrome
tape
Britain's pleasure parade
BBC Radio 4 FM
Robin Day
Margaret Thatcher
The World Tonight>
27 Nov 89
]]>
This can be tagged in MARC as:
Recorded from FM radio to chrome
tape
Britain's pleasure parade
BBC Radio 4 FM
Robin Day
Margaret Thatcher
The World Tonight>
27 Nov 89
]]>
MARC Fields for the Encoding Description
The encodingDesc element provides useful information
documenting the relationship between an electronic text and the source or
sources from which it was derived. The projectDesc,
samplingDecl, editorialDecl, and refsDecl
elements provide details of decisions and rationales used about the
process and purposes of the project, how text was sampled, principles
of editorial practice, and how canonical references are constructed.
The 567 field (notes on methodology) appears to be the most appropriate
for this sort of information, though this field is normally intended
for methodologies characterizing the social sciences. Practically, it
would be wise to transcribe the projectDesc,
editorialDecl, refsDecl, and classDecl
elements directly as one or more 567 fields without intervention, with
the element name at the beginning of each field, and any TEI tags left
intact. This may facilitate any locally-developed retrieval software.
Example:
Texts were collected to illustrate the
full range of twentieth-century spoken and written Swedish,
written by native Swedish authors.
Sample of 2000 words taken from the
beginning of the text.
Errors in transcription controlled
by using the SUC spell checker, v.2.4
]]>
This may be tagged in MARC as:
Texts were collected to illustrate the
full range of twentieth-century spoken and written
Swedish, written by native Swedish authors.
567 Sample of 2000 words taken from the
beginning of the text.
567
Errors in transcription controlled
by using the SUC spell checker, v. 2.4
]]>
MARC Fields for the Profile Description
The profile description is the most problematical element in the TEI
header for librarian cataloguers, because it provides a detailed
description of the non-bibliographic aspects of the text,
specifically the languages and sublanguages used, the situation in which
it was produced, and the participants and their setting. This
information can be used for retrieval purposes, or in
machine-supported analysis of the text. The information can be loaded
into a separate codebook and referenced by the MARC
record. Little guidance can be offered on the appropriate MARC
location for the elements that make up the profile description, except
to suggest that if a site wants to load the profile description into a
MARC record for archival and possibly retrieval purposes, then the
contents of the profile description may be mapped into a locally-defined
notes field (59X) with its TEI tags intact, as in the examples
above.
MARC fields for the Revision Description
The revision history (revisionDesc) logs all changes to a
machine readable file whether or not these constitute a new edition of
the file. Aside from the edition area of the MARC record, there are no
MARC fields that deal specifically with changes of this sort. This
information might be best included in a codebook,
rather than a MARC record. As before, the simplest way of approaching
this problem is to include the material with its TEI tags intact as a
locally-defined note (59X) in order to support future local processing.
Structure of the DTD for Independent Headers
The following document type definition is provided in file
teishd2.dtd and constitutes the auxiliary DTD for
independent headers, as described in this chapter.
%TEI.elementNames;
%TEI.keywords.ent;
%TEI.elementClasses;
%TEI.header.dtd;
%TEI.core.dtd;
]]>
The overall structure of a set of independent headers, encoded for
interchange as a group, is thus:
...
...
...
...
...
...
...
...
...
]]>
In practice, headers might be stored in separate operating system
files, to reduce redundant storage requirements; in this case, the
top-level file for a typical document might have the following
structure:
]>
&hdr01
&txt01
]]>
while that for a set of independent headers might have this structure:
]>
&hdr01
&hdr02
&hdr03
]]>