Characters and Character Sets

Computer systems vary greatly in the sets of characters they make available for use in electronic documents; this variety enables users with widely different needs to find computer systems suitable to their work, but it also complicates the interchange of documents among systems; hence the need for a chapter on this topic in these Guidelines.

Three character-set problems arise for the encoder of electronic texts: what character set to use in the creation or for the processing and storage of the electronic text how to signal shifts from one character set to another, e.g. from the Latin alphabet to Greek and back, or to a special symbol character set and back how to prepare documents for interchange so that the character data are not corrupted in transit This chapter describes the recommended solutions to these problems, in enough detail to satisfy the needs of most users. More detail and more technical information can be found in chapters , , , and . Local Character Sets

No single character set is required for use in TEI-encoded documents. Users may use any character set available to them. It is recommended that the character set used by documented by a writing system declaration, on which see below.

In general, it is most convenient to use a character set readily available on one's computer system, though for special purposes it may be preferable to customize the character set using software specialized for the purpose. Whether to use the usual character set or create a custom set depends on the documents being encoded, the tools available for customizing the character set, the user's technical facility, and the perceived relative convenience of living with the existing character set and modifying to suit one's documents more closely. The choice must be made by each individual according to individual circumstances; no general recommendations are made here as to whether locally customized character sets should be used. For local processing, encoders should whatever character set they find convenient. Characters Available Locally

When the characters in a text exist in the local character set, the appropriate character codes should be used to represent them. Virtually all computer systems provide at least the following characters (in addition to the space character): a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 0 1 2 3 4 5 6 7 8 9 0 " % & ' ( ) * + , - . / : ; < = > ? _

Other characters, such as Latin characters with diacritics (e.g. ä or é) or non-Latin characters (e.g. Greek, Hebrew, Arabic, Cyrillic, and Oriental scripts), are less universally provided. If the local character set provides an ä however, there is normally no reason not to use it where that character appears in the text, unless the electronic text is to be moved frequently among machines, in which case one may wish to restrict the electronic text to characters known to translate well among machines. (For more information on moving characters among machines, see section , below.)

Full use of a local character set will require that the SGML declaration define all the characters used as legal SGML characters. For further information see chapter . Characters Not Available Locally

Characters not available in the local character set should usually be encoded using SGML entity references. In SGML terms, an entity (described in more detail in section ) is any named string of characters, from a single character to an entire system file. Entities are included in SGML documents by entity references.

For example, the standard entity name for the character ä is auml; a reference to the entity gives the name of the entity, preceded by an ampersand and followed by a semicolon. Strictly speaking the semicolon is not always required; for details see any full treatment of SGML. For example, using the standard names for a-umlaut and u-umlaut, one could transcribe a German sentence thus:

Standard entity names have been defined for most characters used by languages written in the Latin alphabet, and for some other alphabetic scripts. A useful subset of these may be found in chapter .

Where no standard entity name exists, or where the standard name is felt unsuitable for some reason, the encoder may declare non-standard entities, using the normal SGML syntax. If, for example, it is desired to distinguish, in the transcription of a manuscript, among three distinct forms of the letter r, one could use the following declarations, embedded in the DTD subset of the SGML document: ]]> By manipulating the declarations, one could make the three forms reduce to a single form (as here), or maintain the distinction by giving the three forms three distinct expansions.

To ensure that the SGML output uses the same entity references for them as the SGML input, for example, one could use the following declarations. ]]> Of course, it may be preferable to develop a custom character set for such cases, but the technique of defining distinct entities for the distinct characters (or character forms) one wishes to encode will be needed to transfer the data to another system without that custom character set.

For transcriptions in scripts not supported by the local character set, entity references may prove unwieldy. In such cases, it is also possible to transliterate the material from its original script into the script of the local character set; like a customized local character set, a transliteration scheme should be documented with a writing system declaration. Transliteration schemes should be reversible (i.e. from the transliteration it should be possible to reconstruct the original writing exactly); where possible, standard schemes should be preferred to ad hoc schemes. Reversible transliteration schemes are defined by national and international standards too numerous to list here; another useful source is ALA-LC Romanization Tables: Transliteration Schemes for Non-Roman Scripts, approved by the Library of Congress and the American Library Association, tables compiled and edited by Randall K. Barry (Washington: Library of Congress, 1991). Many of these schemes, however, use diacritics which are themselves not always available in standard electronic character sets and thus may require careful adaptation for use in electronic work. Some standard transliteration schemes for commonly encoded languages are included in chapter . Shifting Among Character Sets

Many documents contain material from more than one language: loan words, quotations from foreign languages, etc. Since languages use a variety of writing systems, which in turn use a variety of character repertoires, shifts in language frequently go hand in hand with shifts in character repertoire and writing system. Since language change is frequently of importance in processing a document, even when no character set shift is needed, the encoding scheme defined here provides a global attribute lang to make it possible to mark language shifts explicitly. This attribute may also be used to trigger character-set shifting by application programs.

Some languages use more than one writing system. Japanese may be written in kanji, hiragana, katakana, or combinations of these. Hebrew may be written with or without vowel points. Some languages may be written either in the Latin or in the Cyrillic alphabet; or Cyrillic may alternate with Arabic script. In such cases, each writing system must be treated separately, as if a separate language. Each distinct value of the lang attribute, therefore, represents a single natural language and a single writing system (and hence a single character set).

It is recommended that each value used for the lang attribute correspond to a writing system declaration suitable for the character set and writing system being used. A number of standard writing system declarations are provided in this document; others may be provided locally.

Like any global attribute, the lang attribute may be used on any element in the SGML document. To mark a technical term, for example, as being in a particular language, one may simply specify the appropriate language on the term tag: ... But then only will there be good ground of hope for the further advance of knowledge, when there shall be received and gathered together into natural history a variety of experiments, which are of no use in themselves, but simply serve to discover causes and axioms, which I call Experimenta lucifera, experiments of light, to distinguish them from those which I call fructifera, experiments of fruit.

Now experiments of this kind have one admirable property and condition: they never miss or fail. ... ]]> Character Set Problems and Interchange

Electronic texts may be exchanged over electronic networks, through exchange of magnetic media, or by other means. In every case except the transmission of magnetic media (e.g. disk or tape) from one machine to another machine of the same hardware type running the same operating system, the electronic data is subject to translation and interpretation, and hence to misinterpretation and distortion, by utility software working somewhere on the interchange path. Network gateways, tape-reading software, and disk utilities routinely translate from one character set to another before passing the data on. If the utility errs in identifying the character set, or if several utilities translate back and forth among character sets using non-reversible translations, the chances are good that characters will be garbled and information lost.

At this time (1992), the characters least susceptible to loss or misinterpretation in transit among systems are those shown below, which represent a subset of the characters in the internation standard ISO 646 and may thus be called the ISO 646 subset. a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 0 1 2 3 4 5 6 7 8 9 0 " % & ' ( ) * + , - . / : ; < = > ? _

In interchange over any transmission link, the transmitted document should contain only those characters which safely survive transmission over the link; others should be represented with entity references, or with transliterations, as described above.

In blind interchange by means of magnetic media, it is recommended that the document be encoded using some well documented and widely used standard character set.

In blind interchange over networks, it is recommended that the transmitted document contain only characters known to travel safely over the networks involved. In the most general case, those characters are the ISO 646 subset given above. The Writing System Declaration

As from 3.2.4, with modifications.