|
Using character sets other than Latin, Greek, and Cyrillic-all of which fit
neatly into a 256-code matrix with room to spare-used to be something of a black art. There was a lot of
confusion because of competing standards; as some anonymous observer noted, "The nice thing about standards
is that there are so many to choose from". Even the International Standards Organization (ISO) had
conflicting standards for some character sets.
Some languages are written from right to left, some from top to bottom, and in some the characters change
shape according to position within a word. Once upon a time those things did not pose any great problem in
Australia, but that has changed. The need to provide all kinds of information in multiple languages, the
requirements of commerce, and the Internet are all important factors in the push to automate all writing
systems and make them more easily available.
One of the first analyses of character sets for language automation was written by John Clews in 1988
(Language Automation Worldwide: The Development of Character Set Standards). He explains some of the
problems associated with standardising character set standards, such as transliteration and sorting order.
Libraries have a particular interest in transliteration; they wanted a neat way of transcribing book titles
into Latin characters that represented the original sound. However, non-Latin writing systems do not follow
the same sequence of consonants and vowels as in English (or French, German, etc.). Transliteration-oriented
character sets were often incompatible with proper sorting. Cyrillic is a particular case where the problem
took years to resolve.
Clews mooted the introduction of a multi-byte character set that would cover all writing systems. In late
1990 the ISO issued a draft standard, DIS 10646, which proposed a four-byte code with room for over four
million characters. Because of the need to reserve certain positions (such as 0-31) for control codes the
number of effective positions is much reduced, but there was still plenty of room for every writing system to
have its own dedicated space.
American vendors flexed their collective muscle to sink DIS 10646 by forming the Unicode Consortium.
Unicode is a 2-byte system that provides 65,536 (216) code places. The argument was that many
writing systems share common characters, and - so to speak - a common pool should be created.
In particular, the Chinese/Japanese/Korean (CJK) writing systems share the use of what are generally called
Chinese characters. Unicode did not allow for the different way in which many of those characters are
rendered, or for the very large number of Chinese characters (presently in the order of 85,000) and the fact
that new ones continue to be created. According to The Unicode Standard Unicode provides for 27,484,
but Ken Lunde (in CJKV Information Processing) refers to "the standard set of 20,902 Chinese
characters". Whatever the figure, it is still short of the some 40,000 found in better dictionaries.
In a compromise between the ISO and the Unicode Consortium, ISO 10646-1993 was published with the Unicode
standard included as a subset. That is to say, the original 4-byte proposal is retained (but not implemented)
and a second tier (Unicode) inserted; it is also known as UCS-2 (2-byte Universal Character Set), and the
Basic Multilingual Plane (BMP).
It seems unlikely that UCS-4 (the 4-byte Universal Character Set) will be implemented in the near future,
which means we have to live-for better or worse-with Unicode. Several standards, including HTML and XML, are
Unicode compliant; Windows NT/2000 uses Unicode instead of ASCII; Java and Delphi are Unicode-ready; and
there are some Unicode-enabled applications. Apple's Macintosh operating system has been Unicode-enabled for
some time, which put it well ahead of PCs in the multilingual field. That gap that is now closing, but there
is not much documentation for users or developers.
For anyone with a serious interest in character set standards and their application, there are two important
books. One deals with Unicode and the other with what has become known as CJKV (Chinese, Japanese,
Korean, and Vietnamese).
The Unicode Standard Version 3.0
Published by Addison-Wesley for the Unicode Consortium, The Unicode Standard Version 3.0 is in A4
format, runs to over a thousand pages, and comes with a CD that contains substantial additional information
and data.
Apart from CJK ideographs and syllabaries, the current Unicode Standard includes Latin, Runic, Ogham, Greek,
Cyrillic, Glagolitic, Georgian, Arabic, Armenian, Hebrew, Syriac, Thaana, languages of the Indian
sub-continent, Tibetan, Thai, Lao, Khmer, Myanmar, Mongolian, Ethiopic, technical and mathematical symbols,
currency symbols, dingbats, Braille, and the International Phonetic Alphabet.
Each is fully described with a table showing the glyphs (what each character looks like). There are variant
glyphs, such as the differences between italic "a" and roman "a". The term, roman, usually
means "upright" (as opposed to "italic"), but is sometimes used to mean "with serifs".
The glyphs contained in The Unicode Standard are a valuable resource for anyone who wants to create or
vary a character set.
Apart from the glyphs there are complete lists of the names of all the characters. Standardised naming is
important for a number or reasons; if someone asks about "Greek small letter alpha with varia and
ypogegrammeni" one can find it in the index from where the glyph can be located with details of how the
character is formed (it is Greek "a" with a grave accent above and a small iota below). Just
think of the potential for one-upmanship if you can say, "Shouldn't that have been written using a capital
omega with dasia and oxia", or to be able to recite the Tibetan alphabet.
Many characters are made of a base letter with additional marks of various kinds that, in English, are
commonly called accents. In an ideal typographic world each combination would be a character in its own right
(and that is why a 4-byte system was proposed). Because Unicode does not have room for a full repertoire of
composed characters, users have to make do with non-spacing marks, which is much the same as dead keys on a
typewriter. What happens is that the user enters the base character and then the non-spacing mark; the effect
is that the mark is printed in the same space as that of the base character.
Non-spacing is achieved by use of a negative offset. For example, the tilde on your keyboard prints as an
ordinary character, but by changing the offset to a negative value (usually in the order of -412) the tilde
will print to the left of an ordinary character position. Thus, type 'a' followed by the modified tilde and,
voila, the "a" has a tilde above it. At least, that's the theory. Try it with an "i" and the tilde is
not accurately centred. There are (complicated) ways around that problem, but it is interesting that Donald
Knuth had it all sorted out as far back as 1977 when he introduced the first version of TeX.
Unicode provides separate combined characters for all the European languages that use a Latin alphabet, but
falls back to what are called composed characters for a number of other languages. Greek and
Vietnamese are two prime examples. The Unicode Standard contains extensive explanations, mainly for
developers, of how composing works. A letter with a single mark is easy to handle (unless it is thin), but
multiple marks are by no means uncommon and can present some complexity.
It is easy to create a custom character set using an application such as Fontographer. The problem is
that it has either to be embedded in any document created for distribution, or the recipients have be
provided with a copy of the font. Adobe's Acrobat has an option for font embedding, but embedding has
two drawbacks: it makes larger files, and the fonts can be extracted. There is presently some concern about
the ease with which proprietary fonts can be lifted from PDF files; methods of protection were being
investigated when I last discussed the problem with font foundries. However, that is another subject.
Unicode does set aside space for custom character sets, which might include logos and other special symbols.
The idea is to enable corporate users to remain within Unicode and have in-house character sets, and to
provide for scientific communities (amongst others) to have their own specialist character sets.
The most comprehensive single source of information about matters affecting the application of Unicode is
The Unicode Standard; the present edition is, for a hard-covered book of its size and scope of content,
well-priced. It contains the detailed specifications for Unicode; implementation guidelines; a character
database (on CD); character mappings used by various vendors and national standards; and technical reports
covering topics such as sorting, compression, and XML.
The Unicode Consortium: The Unicode Standard Version
3.0
ISBN 0-201-61633-5
Published by Addison-Wesley, 1040 pp. + CD,
RRP $79.95
|

|
CJKV Information Processing
The author of this book is well-known to everyone who deals with the complexities of Japanese text
processing; since 1991 he has been with Adobe and - at the time of writing the book - is Manager of CJKV Type
Development.
The CJKV in the title stands for Chinese, Japanese, Korean, Vietnamese. The use of Chinese characters is
common to the written form of all four languages, although that usage is no longer common in Vietnamese.
Vietnamese was originally written using characters adapted from Chinese; the French introduced the phonetic
Latin form of writing that is now standard. The Chinese characters one sees in shop signs, writing, and
printed material are used by people of Chinese ethnic origin and who speak and write in a Chinese
dialect.
Pre-French Chinese and native Vietnamese characters are found in historical and old family records, but for
all practical purposes Vietnamese is now written using modified Latin.
It is complicated by the addition of two additional consonants, "D" (code DOh in ISO 8859/2) and a crossed
lower case "d" (code FOh in ISO 8859/2). As well there are twelve base characters, but there is not space
here to describe them fully. The result is that the base alphabet runs to forty characters.
There is more confusion: Vietnamese is a tonal language with five tone marks. It all adds up to characters
that can have quite a bird's nest of marks above them as well one of the tone marks (a dot) below.
It is possible to squeeze the lot into an 8-bit matrix, which is one of the standard solutions used in
Vietnam. The problem is to know which standard has been used in the creation of any given document. Ken Lunde
provides the most complete account of relevant standards to be found in English language literature.
Vietnamese looks complicated to those who are used to our plain, vanilla, 26-character alphabet. Chinese,
Japanese, and Korean can be even more complex, largely because of different national systems that have been
introduced over time, and various writing directions: left-to-right, and top-to-bottom. There are many
typographical issues that have to be addressed according to writing direction, and this is the only text I
have seen in which they are addressed.
I was once asked to translate information on an identification plate from a piece of (unseen) Japanese
military equipment; it was totally incomprehensible until I realised it dated from a period when horizontal
text was written right-to-left.
The best single resource for learning about CJKV information processing is Ken Lunde's book. It is an
essential resource and reference for anyone involved in developing information systems that use any of those
languages, including publishing (online or print).
Chapters cover: Writing Systems; Character Set Standards; Encoding Methods; Input Methods; Output Methods;
Font Formats; Typography; Information Processing Techniques; Operating Systems, Text Editors, and Word
Processors; Dictionaries and Dictionary Software; The Internet; and The Web. There are twenty-three
appendices that include a number of encoding tables, software and document sources, mailing lists,
professional organisations, Per] code examples, and a glossary.
Ken Lunde: CJKV lnformation Processing
ISBN 1-56592-224-7
Published by O'Reilly,
1101 pp., RRP $150.00
|

|
Reprinted from the August 2000 issue of PC Update, the
magazine of Melbourne PC User Group, Australia
|