The writing system declaration or WSD is an
auxiliary document which provides information on the methods used to
transcribe portions of text in a particular language and script.The primary responsibility for this chapter was borne by work
group TR 1 (character set issues), chaired by Harry Gaylord. The
original design of the writing system declaration and the first draft of
its documentation were created by Steven J. DeRose; revisions have been
made by Harry Gaylord and Syun Tutiya, among others.
For purposes of the WSD, a writing system is a given method
of representing a particular language, in a particular script or
alphabet; the WSD specifies one method of representing a given writing
system in electronic form. A single WSD thus links three distinct
objects:
the language in question
the writing system (script, alphabet, syllabary) used to write the
language
the coded character set, entity names, or transliteration scheme
used to represent the graphic characters of the writing system
Different natural languages thus have different writing system
declarations, even if they use the same script. Different methods used
to write the same language (e.g. Cyrillic or Latin encoding of
Serbo-Croatian), and different methods of representing the same script
in electronic form (e.g. different coded character sets such as ASCII or
EBCDIC, or different transliteration schemes) similarly must use
different writing system declarations.
This chapter describes first the overall structure of the WSD, and
then documents the specific elements used to document the natural
language, writing system, coded character sets, SGML entity names, and
transliteration schemes united by the WSD. The concluding sections
describe how the WSD is associated with different portions of a
document. Standard pre-defined TEI writing system declarations, which
should suffice for most uses, are also included. There follows a brief
description of how to create a new WSD on the basis of an existing WSD.
The final section provides a more formal discussion of the meaning of
a WSD than is provided elsewhere.
Overall Structure of Writing System Declaration
A writing system declaration is a distinct auxiliary document,
separate from any transcription for which it is used. It is encoded
using a single writingSystemDeclaration element, and
contains the following elements:
declares the coded character set, transliteration scheme, or
entity set used to transcribe a given writing system of a given
language.
Attributes include:
gives a formal name for the writing system declarationgives the date on which the writing system declaration was last
revised.identifies the language being described in the writing system
declaration.contains a prose description of the script declared by a writing
system declaration.specifies one or more conventional directions in which a language
is written using a given script.contains a specification of the characters used in a particular
writing system to write a particular language, and of how those
characters are represented in electronic form.(in a writing system) contains a note of any type.
All elements in the writing system declaration may bear either of the
following two attributes:
gives a unique identifier for the elementgives the language in which the content of the element is
written.
These attributes function in the same
way as the global id and lang attributes of the
main TEI DTD (although for technical reasons the latter is declared
differently): the former provides a unique SGML identifier for the
element, and the latter identifies the language in which the contents of
the element are expressed, using a code from ISO 639.
The overall structure of a writing system declaration is thus as
follows:
]]>
The attributes date and name are required on
the writingSystemDeclaration element. The date
attribute is used to specify the date on which the WSD was written or
last changed; this must be given in the format
yyyy-mm-dd, as defined by ISO 8601: 1988,
Data
elements and interchange formats --- Information interchange ---
Representation of dates and times.
The name attribute is used to assign a formal name to the
writing system declaration, for references to it from elsewhere. It is
recommended, though not required, that the name be constructed as an
SGML formal public identifier; for purposes of writing system
declarations, this means it should follow the pattern of the following
examples:
-//TEI P2: 1993//NOTATION WSD for Modern English//EN
-//WWP 1993//NOTATION WSD for 17th-century English//EN
-//OTA 1991//NOTATION WSD for Old English//EN
-//GLDV 1997//NOTATION WSD Mittelhochdeutsch//DE
That is, the identifier should consist of
either the string -// (indicating that the organization
issuing the WSD is not registered with ISO) or the string +//
(indicating that the organization is so registered)
a short string identifying the issuing organization and optionally
the document which defines the WSD (or at least the issuing organization
and the date)
the string //the keyword NOTATION, indicating that in SGML terms, the
writing system declaration functions as a non-SGML notationa descriptive phrase indicating the contents of the WSD (which
may, as here, begin WSD for ...)
the string //the ISO 639 code for the language in which the WSD is written
The other elements of the WSD are described in the following
sections.
The DTD for writing system declarations is included in file
teiwsd2.dtd; a writing system declaration will thus
begin with a document type declaration invoking that file:
]>
]]>
The formal declaration of the writing system declaration is as
follows:
%TEI.wsdNames;
]]>
Identifying the Language
The language element is used to name the language associated
with the WSD. Its iso639 attribute gives the ISO standard
code for the language as defined by
ISO 639: 1988. Code for the
representation of names of languages.
identifies the language being described in the writing system
declaration.
Attributes include:
gives the standard language code from ISO 639.
If the language in question is not included in the list in ISO 639,
the value of the attribute iso639 should be '' (the
empty string).
The language element should not be confused with the global
lang attribute; the element identifies the language whose
writing system is being documented, while the attribute identifies the
language in which the description is being written. A writing system
declaration for classical Greek, for example, which itself is written in
English, would have the value eng for the lang
attribute on the top-level element, and the value grc for the
iso639 attribute on the language element:
Classical Greek. This WSD documents the Beta
transcription code for classical Greek developed by the Thesaurus
Linguae Graecae of the University of California, Irvine.
]]>
Normally, the language described is a natural language; in some
cases, however, artificial languages, dialects, or other sublanguages
may be usefully regarded as a language and documented in a writing
system declaration. When a sublanguage is documented, a description of
the sublanguage should be included in the language element:
Japanese (specialized writing system for waka)
]]>
When a writing system declaration is prepared solely in order to
document a coded character set or entity set suitable for use with many
natural languages, the content of the language element should
be Various (or the equivalent in the language of the WSD):
Various
]]>
The language element is formally defined thus:
]]>
Describing the Writing System
The writing system itself is described in general terms using the
following elements:
contains a prose description of the script declared by a writing
system declaration.specifies one or more conventional directions in which a language
is written using a given script.
Attributes include:
indicates the order in which characters within a line are
conventionally presented in this writing system.
Suggested values include:
left to rightright to lefttop to bottombottom to topindicates the order in which lines conventionally follow each
other in this writing system.
Suggested values include:
top to bottombottom to topleft to rightright to left
The script element contains a prose description of the
script, alphabet, syllabary, or other system of writing used to write
the language in question. The direction element indicates the
direction(s) in which the script is conventionally written. Both these
elements are provided for the sake of human readers; neither is likely
to be suited to machine processing without human intervention.
The Latin alphabet conventionally used to write English, for example,
might be described thus:
Latin alphabet (with diacritics for loan words)
]]>
The chars and lines attributes are used to
indicate the direction in which characters within a line, and lines on
the page, may legitimately be written using the script in question. If
more than one direction is possible, the direction element may
repeat or its attributes may be given more complex values. A script
written vertically top to bottom, with lines arranged either left to
right or right to left, for example, might be declared either thus:
]]>
or thus:
]]>
In very complex cases, the attributes may be given prose values:
]]>
or the element may be omitted entirely (in which case experts on the
script should be consulted for advice on proper processing):
Japanese (mixture of kanji, hiragana, katakana)
]]>
It should be noted that the direction element describes
conventional display only: all scripts are subject to unusual treatment
for aesthetic or other reasons, and such unusual treatment need not be
foreseen here. (The Latin alphabet, for example, although
conventionally written left-to-right, top-to-bottom, can be set
vertically in signs or in other special cases.) Unusual methods of
arranging the text on a page are best documented within the document
instance by means of the global rend attribute.
The script and direction elements are declared
thus:
]]>
Documenting the Character Set and Its Encoding
Base Components of the WSD
The characters or graphic symbols of the writing system are
documented in the characters element of the WSD. This
documentation can take any of the following forms:
reference to an international standard, national standard, or
private coded character set
reference to a public set of SGML entities
reference to another WSD which documents the same script and
the same methods of representing it electronically
formal declaration of each graphic unit in the writing system
a combination of the above: reference to one or more standard
coded character sets, entity sets, or writing system declarations,
followed by individual declaration of all exceptions
The coded character sets, entity sets, and external WSDs referred to
are called the base components of the writing system
declaration. The base components of a WSD are declared within the
characters element using the following elements:
contains a specification of the characters used in a particular
writing system to write a particular language, and of how those
characters are represented in electronic form.identifies a public or private coded character set which is used
as a basic component of a writing system declaration.identifies a writing system declaration whose mappings among
characters, forms, entity names, and bit patterns are to be incorporated
(possibly with modifications) in this writing system declaration.identifies a public or private entity set whose mappings between
entity names and characters are to be incorporated (perhaps with
modifications) into this writing system declaration.
The elements codedCharSet, baseWsd, and
entitySet are all members of the class
baseStandard and inherit from it the following attributes:
gives the normal citation form for the standard being referred
to.indicates the authority responsible for issuing the standard being
referred to: the TEI, the International Organization for Standardization
(ISO), a national body, or a private body.
Legal values are:
the base writing system declaration is a standard WSD issued by
the Text Encoding Initiativethe character set or entity set was issued by ISOthe character set or entity set was issued by a national standards
bodythe writing system declaration, character set, or entity set was
issued publicly by a private organization or projectthe writing system declaration, character set, or entity set has
not been publicly issued by any organization; it is specific to an
individual text or project
Some simple examples of the use of these elements follow:
]]>
The base components identify the set of characters used in the
writing system, and further specify, for each character, the string(s)
of bytes and entity names used to encode it in the text. This
information may be modified by further information given within the
exceptions element, as described below in section .
The elements for identifying the base components of the writing
system declaration are declared thus:
]]>
Exceptions in the WSD
The exceptions element contains definitions for any
character which differs in any respect from the specifications contained
in the base components of the WSD. If no base components are named,
then every character in the writing system must be defined explicitly.
The documentation for each character in the writing system indicates
at least the following:
the string of bytes used to represent the character
whether the character is a letter, a punctuation mark, a
diacritical mark, or falls into some other class
a brief conventional name or description of the character
any standard or local entity names used for the character
the position of the character in the Universal Character Set (UCS)
defined by ISO 10646, if known
the position of the character's form in the tables prepared by the
Association for Font Information Interchange
In addition, images of the character encoded in a graphics format or
other notation may be associated with the character as internal
or external figures.
This information is encoded using the following elements:
documents ways in which a writing system declaration differs from
the coded character sets, base writing system declarations, and entity
sets which form its bases.defines one unit in a writing system, supplementing or overriding
information provided in the base coded character sets, writing system
declarations, and entity sets.
Attributes include:
describes the function of the character using a prescribed
classification.
Legal values are:
character is used in writing words (lexical items) of the language
(includes members of syllabaries and ideographic systems, as well as
composite letter-plus-diacritic combinations)character is a punctuation mark which does not appear within
lexical itemscharacter can appear as a normal punctuation mark, but can also
appear within a lexical item (and should usually, when occurring between
two lexical characters, be treated as lexical---in English, hyphen and
apostrophe are typically treated as members of this class)character is an Arabic decimal numeral (0, 1, ... 9) (does not
include superscript numbers, circled numbers, numeric dingbats,
etc.)character represents some form of white space (space character,
horizontal or vertical tab, newline, etc.)character is a diacritic applying to the following lexical
charactercharacter is a diacritic applying to the preceding lexical
charactercharacter is a diacritic which is explicitly joined to a lexical
character by a joiner charactercharacter is used to join a diacritic to the lexical character to
which it applies (in some encoding schemes, the backspace control
character may be used as a joiner; in others, a graphic character is
used for the same function)character does not fall into any of the other classes (dingbats
and other unusual characters fall here)(in a writing system declaration) contains a description of a
character or character form.identifies one letter form taken by a particular character in a
writing system declaration.
Attributes include:
gives the byte string used to encode the letter form in the
text.specifies which base coded character set the string
value occurs in.gives the name of one or more entities defined for this character
form in some standard entity set(s).gives one or more entity names used locally for this character
form.gives the position of the character form in the thirty-two bit
universal character set defined by ISO
10646.gives one or more codes associated with this letter form by the
Association for Font Information Interchange.(in a writing system declaration) contains an image of a character
form, stored in-line in some declared notation.
Attributes include:
identifies the notation in which the figure is encoded.(in a writing system declaration) refers to a figure or
illustration depicting the character form, which is stored in
some declared notation external to the text.
Attributes include:
identifies the notation in which the figure is stored.gives the SGML name of the external entity which contains the
figure.
The exceptions element contains a series of
character elements only, each of which may contain descriptions
of the character (including its name), notes, and a series of
form elements documenting the different forms the character can
take. Attributes on the character and form elements
are used to convey the information mentioned above: byte string, entity
names, UCS-4 code, etc.
A simple example:
]]>
When transliteration schemes are used, the string used to
encode the character will typically be in a different alphabet:
]]>
The UCS-4 code is given as eight hexadecimal digits, one for each
four bits of the thirty-two-bit value. For legibility a hyphen may be
inserted as a separator after the fourth hexadecimal digit:
00000308 has the same meaning as 0000-0308. Since in
almost all cases at present the leading sixteen bits are zero, however,
by convention the leading four hexadecimal zeros may be dropped
entirely: the value 0308 is identical in meaning to the value
0000-0308.
In some cases, the character is represented not as a single UCS
character but as a sequence of such characters; in this case, each
thirty-two-bit value except the last must be followed by a plus sign:
]]>
If a given character element has more than one encoding
using ISO 10646 (e.g. both as a-umlaut and as a plus
umlaut), then both encodings may be given, separated by blanks:
]]>
In most cases, identifying the character or character form by means
of its UCS-4 and AFII codes will suffice to identify the character for
all later users of the WSD. In some cases, however, further information
must be provided.
This may be provided in a note attached to the
character or form element:
This character has the form of a capital-letter N,
but is written the same height as a lower-case N.
Its appearance is thus that of UCS-4 0274, but it
does not have the same semantics.
]]>
In some cases, it will be necessary or useful to provide an image of
the character in question, or to refer to a standard reference work for
such an image. The following character element might be used
to describe, for example, a common Old French abbreviation for
est, for which the local entity est has been
defined:
]]>
Here, Cappelli is the name of a standard reference work which may
be consulted to see what the character in question looks like.Dizionario di Abbreviature latine ed italiane
per cura di Adriano Cappelli, 6th ed. (Milan: Ulrico Hoepli, 1979).
This work on Latin abbreviations might be less convenient for the
purpose than one concentrating on Old French, but it is more widely
used than any other.
Where recourse to reference works is impossible, a picture of the
character may be encoded using any standard graphics format, and
associated with the character by standard SGML techniques. The SGML
document must then have:
an SGML notation declaration for the graphics format used
an external entity declaration for the file containing the image
an extFigure element to name the notation and the entity
For a discussion of graphic images and of the declaration of non-SGML
notations, see chapter . If the Old French abbreviation
is encoded using CGM (Computer Graphics Metafile) format in a file
called est.cgm, then it may be associated with the appropriate
character declaration as follows. In the DTD subset of the WSD, the
following declarations are required:
]]>
In the body of the WSD itself:
]]>
Despite now having a picture of the character, we retain the prose
description and reference to Cappelli, for the sake of those without
ready access to the appropriate graphics processors.
The exceptions element and its contents are declared thus:
]]>
Documenting Coded Character Sets and Entity Sets
Public or private coded character sets and entity sets may be
usefully documented using WSDs; the WSD will make explicit some
information (such as the UCS-4 and AFII codes) not normally given
explicitly in character set standards or public entity sets. The coded
character set or entity set being documented should be included by means
of a codedCharSet or entitySet element; the
exceptions element should include one character
element for each character included in the character set or the entity
set. Deciding whether to treat two entities or two bit patterns as
separate characters or as forms of the same character will require
knowledge of the script involved, and different encoders may reach
different decisions. In cases of doubt, though, it is usually
acceptable practice to treat each bit pattern in a coded character set,
and each entity in an entity set, as a distinct character.
A non-standard local coded character set (e.g. an EBCDIC character
set) may be documented in a WSD by defining one character
element for each printable code point in the character set, adding the
names of standard (and local) entities, UCS-4 codes, and AFII codes as
appropriate. Since this extra information is useful in packing
documents for interchange, and in processing pattern
arguments in the TEI extended-pointer syntax described in section
, those responsible for a
local installation are strongly encouraged to document the local system
character set in a WSD, if it is not already so documented.
Documenting Transliteration Schemes
When a script is encoded not in a character set designed for it,
but in one designed for another script, (e.g. Greek encoded using the
Latin alphabet), a transliteration scheme is necessary. In
documenting such a transliteration scheme, the coded character set
actually in use should be named as a base component. An
exceptions element can then be used to override the normal
meaning of the individual byte strings used in the transliteration.
For example, the following character element overrides the
usual association of the byte representing A with the Latin
letter A and substitutes instead an association with the Greek
letter alpha:
]]>
Care should be taken in choosing or developing transliteration schemes
to ensure that they are unambiguously reversible.
Notes in the WSD
Notes on the WSD, individual characters, or individual character
forms may be included in the note element at the appropriate
level.
(in a writing system) contains a note of any type.
Unlike its counterpart in the main TEI DTD, the note element
within the writing system declaration may contain no paragraphs and no
phrase-level elements: only character data. It is formally declared
thus:
]]>
Linkage between WSD and Main Document
The writing system declaration is associated with different portions
of a main document by means of the global lang attribute.
This attribute is defined as an SGML IDREF and its value must be the
SGML identifier on a language element within the TEI header of
the main document. The language element in turn provides, in
its wsd attribute, the SGML name of the entity (usually an
external file) containing the writing system declaration associated with
that lang value.
At least one writing system declaration must be associated with any
TEI document: this follows from the requirement that a value be
specified for the lang attribute on the outermost element
(tei.2 or teiCorpus.2) of any TEI document, since the
lang attribute is required to point at a language
element in the TEI header, which in turn is required to indicate an
entity containing the writing system declaration associated with that
language.
Predefined TEI WSDs
The Text Encoding Initiative has defined standard writing system
declarations for the following languages and scripts:
The nine official languages of the European Community ( Danish,
Dutch, English, French, Greek, German, Italian, Portuguese, and
Spanish), encoded using the character set ISO 8859-1
Hebrew (using ISO 8859-8)
Russian (using ISO 8859-5)
classical Greek (using the Thesaurus Linguae
Graecae's Beta Code transliteration scheme)
Work is underway on a standard writing system declarations for
Japanese; resources permitting, declarations for Korean and Chinese will
also be made.
In addition to the language-specific WSDs just mentioned, the TEI has
defined a number of WSDs to document specific public coded character
sets and entity sets. These are used as components in the
language-specific WSDs, and may be similarly used for locally developed
WSDs. WSDs for specific ISO standard character sets include:
ISO 646 (non-national subset)
ISO 646 (International Reference Version)
ISO 8859-1 (Latin 1, Western Europe)
ISO 8859-2 (Latin 2, Eastern Europe)
ISO 8859-5 (Latin and Cyrillic)
ISO 8859-7 (Latin and Greek)
ISO 8859-8 (Latin and Hebrew)
ISO 8859-9 (Western Europe and Turkey)
Writing system declarations are also under preparation for a number
of commercial character sets in wide use:
IBM Code Page 437 (IBM PC, early models)
IBM Code Page 850 (IBM PS/2 and later-model IBM PCs)
IBM Code Page 1014
Apple Macintosh default system character set
Adobe Postscript default character set
The Text Encoding Initiative has prepared entity sets for use in
transcribing some languages; these are distributed both as SGML entity
sets and as TEI writing system declarations documenting those entity
sets:
-//TEI P2: 1993//ENTITIES Arabic//EN
-//TEI P2: 1993//ENTITIES Coptic//EN
-//TEI P2: 1993//ENTITIES Classical Greek//EN
-//TEI P2: 1993//ENTITIES International Phonetic Alphabet//EN
WSDs will also be prepared for other standard entity sets, including:
ISO 8879: 1986//ENTITIES Added Latin 1//EN
ISO 8879: 1986//ENTITIES Added Latin 2//EN
ISO 8879: 1986//ENTITIES Russian Cyrillic/EN
ISO 8879: 1986//ENTITIES Non-Russian Cyrillic//EN
ISO 8879: 1986//ENTITIES Greek Letters//EN
ISO 8879: 1986//ENTITIES Diacritical Marks//EN
ISO 8879: 1986//ENTITIES Box and Line Drawing//EN
ISO 8879: 1986//ENTITIES Numeric and Special Graphic//EN
ISO 8879: 1986//ENTITIES Publishing//EN
ISO 8879: 1986//ENTITIES General Technical//EN
All TEI writing system declarations are distributed with the TEI
document type definitions.
The standard TEI writing system declarations are expected to meet the
needs of many encoders; some, however, will need to prepare new WSDs to
describe character-encoding schemes not included in the standard WSDs.
Details of WSD Semantics
This section describes the meaning of the WSD in more formal terms
than have been used elsewhere in this chapter; it can be skipped by most
readers, but should be read carefully by those who wish to write complex
writing system declarations or to implement software to process writing
system declarations or to interpret them in the processing of
TEI-conformant documents.
WSD Semantics: General Principles
A writing system declaration provides a complicated bundle of
mappings:
a 1:1 partial function from strings in given coded character sets
to character forms
a function from entity names to character forms, and therefore
derivatively:
a function from entity names to strings
a function from character forms to characters, and therefore
derivatively:
a function from strings to characters
a function from entity names to characters
a relation between UCS-4 codes and character forms
a relation between AFII codes and character forms
a function from UCS-4 codes to characters
a relation from AFII codes to characters
To ensure that the relations described as functions are in fact
functional, the following constraints apply on the WSD:
No two form elements can have the same values for both
codedCharSet and string. Since usually there is
only one codedCharSet used as a basic component, this usually
means each string attribute value must be unique in the WSD.
No two form elements can name the same entity in either
entityStd or entityLoc. (It is legal, though
pointless, for both entityStd and entityLoc on the
same form element to name the same entity.)
More than one form element may have the same
UCS-4 value, but if so
they must be within the same character element.
These constraints may be summarized thus: one
character (however the creator of the WSD defines a
character) can be associated with more than one byte string, entity
name, UCS-4 code, or AFII code, but any single byte string (given a
specific coded character set), any single entity name, and any single
UCS-4 code must be associated with only one single
character. One can, for example, associate both
tilde and
logical not with a character meaning logical
negation, but one cannot associate both a character called
tilde and one called logical negation with the ASCII
character 7/14: given a 7/14 in the text, it must be unambiguously
clear whether the character is a tilde or a logical
negation. If one wishes to retain the ambiguity, one must define a
character called (for example)
logical-not or tilde or swung-dash.
Similar restrictions apply to entity names and UCS-4 codes: each must
be associated with a single character element.
Semantics of WSD Base Components
The effects of naming coded character sets, entity sets, and other
WSDs as base components may now be defined thus:
reference to a coded character set makes available the set of
bit-pattern-to-character mappings defined in the coded character set.
I.e. if a WSD refers to a coded character set, then whenever the WSD is
in use, any character in that coded character set may be used with its
standard meaning (unless it has been redefined using the
exceptions element). It is recommended that a WSD be provided
for each coded character set, to make the mappings fully explicit.
reference to an entity set makes available the set of
entity-name-to-character mappings defined in the entity set. (It is
assumed that standard public entity sets contain enough information to
count as a valid mapping; for private entity sets, the preferred method
of providing the necessary information is to define the entity set in a
WSD). If for example a WSD refers to the ISO Latin 1 entity set, then
whenever that WSD is in use, any entity in that set may be used with
their public meaning, unless they have been redefined in the
exceptions element.
reference to a WSD makes available the set of mappings declared in
that WSD; the language and writing system direction information given
in the base WSD is ignored.
If reference is made only to standard character sets and entity
sets, there is no mechanical method of associating the
characters involved in one mapping with those
involved in another. E.g. a reference to ISO 646 IRV provides a map
from code point 5/11 to a character one might call left square
bracket. A reference to entity set ISOpub1 provides
a map from the entity name lbr to what should probably be
considered the same character. There is however no guarantee that any
processing software will necessarily be sufficiently intelligent to
make this association of mappings automatically; it requires
hard-coded knowledge of the specifics of certain character sets and
entity sets.
When, however, base WSDs are used to document important entity sets
and character sets, it does become possible to define mechanical
methods of associating character elements in different base
components.
Multiple Base Components
When multiple bases of the same type are referred to, the effects are
these:
if more than one coded character set is named, then it is
expected that character-set shifting as described in ISO 2022 or
some equivalent is in use, and proper shifting is the responsibility
of the user. All strings in the WSD must specify the ID of
the proper coded-character-set base, using the
codedCharSet attribute.
if more than one entity set is named, then entity names from all
named sets may be used as values of the entityStd and
entityLoc attributes. If the same name occurs in more than
one entity set, the assumption is made that it refers each time to the
same character.
if more than one base WSD is named, then all characters declared
in all the WSDs are available. For this case, we can define what
happens to merge the different base components more precisely than for
the other types of base component.
Any two form elements which name the same entity or the
same string in the same coded character set are considered the same
form, and are merged as described below in section .
any two form elements which give the same UCS-4 code are
considered forms of the same character, and their parent
character elements are merged. The forms themselves may be
merged or may remain distinct: if the forms have conflicting values for
any attribute, they must remain distinct; if they don't conflict, they
may be merged, at the option of the processing software. In the general
case, there might be more than one way to perform mergers, so merger is
not required.
The result of invoking multiple base WSDs is thus a merged WSD in
which the form and character elements have been merged
as prescribed. If the merger is impossible because the two WSDs are
incompatible, a semantic error occurs. A set of WSDs is compatible and
may be invoked together if all of the following are true:
any given entity name is associated with a single string
(in a given coded character set) and a single character class
any given string or UCS-4 code is
associated with a single character class
Semantics of Exceptions
We can now define the semantics of the exceptions element
this way:
The base components provide a preliminary set of mappings, as
described above. For convenience let us call this the default
map. The exceptions element allows the user to modify
the default map by defining further mappings and by overriding parts of
the default map. There are three cases: a new character
element replaces an old one, is merged with an old one, or is added to
the set without affecting any old ones.
Case 1: If a form element within exceptions
(F-new) collides with a form element in
the default map (F-old), then the parent character element of
F-new replaces the parent element of F-old. Two form elements
collide if they have the same values for codedCharSet and
string. (N.B. if this condition occurs within the default
map, the two form elements are merged.)
E.g. to define the TLG Beta code transliteration of alpha as a
we may first name ISO 646 IRV as a base component; this has the effect
of creating the following (possibly imaginary) form element:
]]>
We then include the following within the exception element:
]]>
This overrides the character element for latin A, and
indicates that in the transliteration scheme documented by this WSD,
character 6/01 represents a Greek alpha, no matter what ISO 646 says.
Case 2: If a character element within exceptionsoverlaps with one in the default map, then the two
character elements are merged. Two character elements
overlap if any of their children name the same entity or UCS-4 code.
(N.B. if these conditions occur within the default map, they lead to
merger either of the two form elements --- for entity name
overlap --- or of the two character elements.)
For example: suppose we wish to document the three-Rs transcription
described in section . We name ISO 646 IRV as a base
character set (or WSD) and add the following exceptions:
lowercase latin letter r
]]>
As a second example, imagine we wish to document a local entity set
for Old English in which we define local entities t (thorn), d (eth) and
a (aesc). Assuming the TEI has provided a WSD for the Latin 1 entities,
the whole WSD is this:
VariousThis WSD is just to document the local entities; it should be
named as a base WSD by the actual writing system declaration.
]]>
This has the effect of merging the character elements for
thorn, eth, and aesc (or a-e
ligature) defined in the ISO Latin 1 WSD with those given here, which
specify the local entity name. The form elements may or may
not be merged, so the software may or may not actually realize that
the local entity t corresponds with the UCS-4 code given
in the TEI WSD for ISO Latin 1.
The full local WSD can then be this:
Anglo-Saxon / Old English
]]>
We refer explicitly to ISO Latin 1, for clarity, but in theory it has
already been included in -//OTA 1990//WSD Old English
entities//EN and need not be repeated. At this time, the rules for
merger would force our local form elements to be merged with
the standard form elements, so the local entity t
would map correctly into the UCS-4 character set.
Case 3: If a character element has no form
children which collide with anything in the default map, and does not
itself overlap with anything in the default map, then it is simply added
to the default map.
For example, suppose we wish to document an abbreviation used for Old
French est in our manuscript, which resembles e with a tilde or
macron. Since we expect we may have more abbreviations for est,
we use the local entity name est1 for this one. Within
exceptions, we declare the abbreviation thus:
]]>
Merger of Form and Character Elements
In some cases, the form and character elements
introduced notionally by reference to a coded character set or entity
set, or introduced explicitly by reference to a base WSD, may be
considered as referring to identical objects; this is called
merger. Two form elements F1 and F2 can be merged
if they both have the same values for codedCharSet and
string, or if codedCharSet and string
are unspecified (implied) in at least one. When F1 and F2 are merged,
the result is a (possibly imaginary) form element (F3) the
attributes of which are derived thus:
if F1 has no value for a codedCharSet, then F3 has the
same value for this attribute as does F2. (If both F1 and F2 have
explicit values, the values must be identical.)
if F1 has an empty string for string, then F3 has the
same value as F2. (If both have values other than '', they must be
identical.)
for entityStd, entityLoc, ucs-4,
and afiiCode, F3 gets a value containing all the entity names
or codes which appear in the corresponding attribute values of either F1
or F2. (I.e. the attribute values are viewed as sets, and F3 gets the
union of F1 and F2.)
The children of the new element are derived by taking all the
desc children of F1, then all the desc children of F2;
all the figure children of F1, then those of F2; all the
note children of F1, then all the note children of F2.
In other words, all the children of the source elements survive as
children of the result element.
Two character elements C1 and C2 may be merged unless their
values for class differ. The resulting character
element C3 has the same class value as C1 and C2, and all the
children of C1 and C2 are made children of C3 (desc children
first, then form children).
Note that merger is sometimes required by the semantic rules given
above, and sometimes optional. If merger is required but not legal
(because the two elements to be merged are incompatible), then a
semantic error has occurred and the two base WSDs which give rise to it
should not be invoked together.