INTRO_INTERNATIONALIZATION(7P) — SUNPHIGS LIBRARY

NAME

INTRO_INTERNATIONALIZATION − introduction to the internationalization extensions to the SunPHIGS graphics library

DESCRIPTION

This document describes changes made to SunPHIGS to support internationalized applications. These applications can be adapted easily to a specific language, or localized. Internationalized applications typically have a core that is constant across locales, and locale-specific modules that contain language-specific information.

Areas of SunPHIGS applications that require internationalization are: the character sets used for string device input and echoing, the formating of numbers and dates in output, and the character sets for text output. SunOS is being internationalized and will provide support for non-English string input and echoing, and localized formating of numbers and dates. These localizations will be based on environment variables and files which will indicate the installation’s locale. The internationalization extensions to SunPHIGS will allow applications to produce text primitives in the correct character sets for the locale.

OVERVIEW

The internationalization extensions provide the ability to switch between different languages or character sets, and to represent and display characters from very large character sets. The extensions are in three main areas: an extended character encoding, called EUC, a new text attribute to select several simultaneous character sets, and a text attribute that extends the text font attribute. The attributes are supported through two GENERALIZED STRUCTURE ELEMENTs and several ESCAPEs.

DEFINITIONS

The following terms are used throughout this document:

Character Set

Character sets may be composed of alphabets, ideograms or other symbols. ASCII and Kanji (Japanese) are examples of character sets. As a text attribute, the character set determines the graphical characters available for text output, but not their stylistic appearance.

Font

The shapes of the symbols representing each character in a character set. In SunPHIGS, graphical characters in different character sets with the same font will have similar styles. Monospaced, Simplex and Italic are examples of font styles. As a text attribute, the font determines the style in which graphical characters are displayed.

Codeset

A character encoding that designates the character set and font attribute to be applied to the character. This allows characters from several different character sets to be differentiated within a string. There is a character set attribute and font attribute associated with each codeset.

Cswidth

The width of a character set, which is the number of bytes needed to specify a character in a character set. ASCII has a cswidth of 1, Japanese Kanji has a cswidth of 2. Cswidth is a characteristic of the character set and is constant.

EUC - EXTENDED UNIX CODING

The Extended UNIX Code (EUC) defines an encoding for codesets that allows multi-byte characters as well as the mixing of several different character sets within a string. EUC encoding uses bit patterns and control characters to separate characters within a string into four codesets, numbered 0 to 3. In SunPHIGS, each codeset has associated character set and font attributes. For example, when a character is in codeset 1, the character matches the encoding pattern for codeset 1, and it is rendered with the attributes associated with that codeset.

EUC encoding uses the most significant bit (MSB) of the character byte and two special characters, SS2 and SS3, to map the characters in the input string into the codesets:

Codeset 0 is defined as characters with the MSB set to 0. This is normal 7-bit ASCII character encoding.

Codeset 1 is defined as characters with the MSB set to one. This is equivalent to a logical OR of the character with the octal value \200: ASCII A (octal \101) encoded in codeset 1 is \301 (\101 OR \200).

Codeset 2 is defined as characters with the MSB set, and preceded by a single shift byte, called SS2. SS2 has the value 0x8E (octal \216). The characters are otherwise like codeset 1 characters. ASCII A in codeset 2 is \216\301.

Codeset 3 is defined as characters with the MSB set, and preceded by a single shift byte, called SS3. SS3 has the value 0x8F (octal \217). The characters are otherwise like codeset 1 characters. ASCII A in codeset 3 is \217\301.

Because EUC uses the MSB of characters to determine the codeset, applications must not use this bit for purposes other than controlling the codeset of the character.

Some character sets such as Japanese Kanji have a cswidth of two: two bytes are required to specify one character for display. In these cases, the second byte is encoded the same way as the first. For example, if a Kanji character is \101\102 in codeset 0, it will be \216\301\302 in codeset 2.

Because EUC supports multi-byte character sets, there is no longer a direct relationship between the number of bytes in a string and the number of characters it represents. The cswidths of the character sets active in the encoding affects the interpretation of the input characters, and the relationship between the number of input characters and output characters. For example, if the character set associated with codeset 2 has a cswidth of 2, the string \216\301\302 will be interpreted as the single character \101\102 in codeset 2. If the character set associated with codeset 2 has a cswidth of 1, the string \216\301\302 will be interpreted as \101 in codeset 2 followed by \102 in codeset 1.

The complete encoding can be represented as follows:

EUC Name	EUC bit pattern for	EUC bit pattern for
	1-Byte Character Set	2-Byte Character Set
Codeset 0	0xxxxxxx	0xxxxxxx0xxxxxxx
Codeset 1	1xxxxxxx	1xxxxxxx1xxxxxxx
Codeset 2	SS2 1xxxxxxx	SS2 1xxxxxxx1xxxxxxx
Codeset 3	SS3 1xxxxxxx	SS3 1xxxxxxx1xxxxxxx

EXAMPLE

Many Japanese applications use EUC to encode English, Kanji, and Katakana (phonetic Japanese) characters within a single string. The codesets are often associated with character sets as follows:

Codeset 0 - ASCII

Codeset 1 - Kanji characters, each pair of input characters specifies one Kanji character

Codeset 2 - Katakana characters

Codeset 3 - Implementation dependent

Codeset 3 could be used for an additional character set, such as one for special symbols.

USING INTERNATIONALIZED TEXT

SunPHIGS supports two new attributes in the PHIGS traversal state list: character sets and extended fonts. The character sets attribute is the set of four character set indices associated with the EUC codesets. The extended fonts attribute is the set of four font indices associated with the EUC codesets. The values of each of these attributes may be set using GENERALIZED STRUCTURE ELEMENTs.

CHARACTER SETS

The character sets attribute is set using the set character set for codeset GENERALIZED STRUCTURE ELEMENT (GSE). This GSE is used to associate a character set with one of the codesets. When this element is traversed, the character set entry for the specified codeset in the PHIGS traversal state list is set to the specified value. See GENERALIZED STRUCTURE ELEMENT for more specific information.

The character sets supported by a workstation type can be inquired using ESCAPE -7, inquire character set facilities. The character sets available are defined as constants in phigs.h and PARAMETER statements in phigs77.h:

Value	C Name	FORTRAN Name	Cswidth	Character Set
1	PCS_ASCII	PCSASCII	1	ISO-646 (ASCII)
-1	PCS_GREEK	PCSGREEK	1	Greek
-2	PCS_SYMBOL	PCSSYMBOL	1	Symbol
-3	PCS_CARTOGRAPHIC	PCSCARTOGRAPHIC	1	Cartographic
-4	PCS_KANJI	PCSKANJI	2	JIS-X0208 (Japanese Kanji,
				formerly JIS-C6226)
-5	PCS_KATAKANA	PCSKTKANA	1	Katakana (Japanese Phonetic)

EXTENDED FONTS

The extended fonts attribute is set using the set font for codeset GENERALIZED STRUCTURE ELEMENT. This GSE is used to associate a font with one of the codesets. When this element is traversed, the font index entry for the specified codeset in the PHIGS traversal state list is set to the specified value.

Each character set may support different fonts. A list of the fonts and precisions supported by a workstation type for a particular character set can be inquired using ESCAPE -8, inquire fonts for character set.

All the font indices have named constants defined in phigs.h and phigs77.h (FORTRAN) as shown below. Note that the constants denote the style of the text, and thus have the same names and values for different character sets.

The ASCII fonts available are as follows:

Value	C Name	FORTRAN Name	Style
1	PFONT_MONO	PFONTMONO	Monospaced
-2	PFONT_SIMPLEX	PFONTSIMPLEX	Simplex
-3	PFONT_DUPLEX	PFONTDUPLEX	Duplex
-4	PFONT_COMPLEX	PFONTCOMPLEX	Complex
-5	PFONT_TRIPLEX	PFONTTRIPLEX	Triplex
-6	PFONT_ITALIC_COMPLEX	PFONTITALICCMPLX	Italic Complex
-7	PFONT_ITALIC_TRIPLEX	PFONTITALICTRPLX	Italic Triplex
-10	PFONT_SCRIPT_SIMPLEX	PFONTSCRIPTSMPLX	Script Simplex
-11	PFONT_SCRIPT_COMPLEX	PFONTSCRIPTCMPLX	Script Complex

Additional fonts available for the ASCII character set for compatibility with previous releases are as follows:

Value	C Name	FORTRAN Name
-8	PFONT_GREEK_SIMPLEX	PFONTGREEKSMPLX
-9	PFONT_GREEK_COMPLEX	PFONTGREEKCMPLX
-12	PFONT_CARTOGRAPHIC	PFONTCARTO
-13	PFONT_SYMBOL	PFONTSYMBOL

The Greek fonts available are as follows:

Value	C Name	FORTRAN Name	Style
1	PFONT_MONO	PFONTMONO	Monospaced
-2	PFONT_SIMPLEX	PFONTSIMPLEX	Simplex
-4	PFONT_COMPLEX	PFONTCIMPLEX	Complex

The symbol fonts available are as follows:

Value	C Name	FORTRAN Name	Style
1	PFONT_MONO	PFONTMONO	Monospaced
-2	PFONT_SIMPLEX	PFONTSIMPLEX	Simplex

The cartographic fonts available are as follows:

Value	C Name	FORTRAN Name	Style
1	PFONT_MONO	PFONTMONO	Monospaced
-2	PFONT_SIMPLEX	PFONTSIMPLEX	Simplex

The Kanji fonts available are as follows:

Value	C Name	FORTRAN Name	Style
1	PFONT_MONO	PFONTMONO	Monospaced

The Katakana fonts available are as follows:

Value	C Name	FORTRAN Name	Style
1	PFONT_MONO	PFONTMONO	Monospaced

BUNDLED TEXT REPRESENTATION EXTENSIONS

Bundled text representations (in the workstation state list) have been extended to include a font for each codeset. The complete text representation now contains the following fields:

font for codeset 0 (font in normal text representation)
font for codeset 1 (an internationalization extension)
font for codeset 2 (an internationalization extension)
font for codeset 3 (an internationalization extension)
text precision
character expansion factor
character spacing
color index

ESCAPE -10, set extended text representation fonts is used to set only the new fields. The font for codeset 0 field is set by the font field in the normal text representation. The font for codeset 0, text precision, character expansion factor, and character spacing fields can only be set by SET TEXT REPRESENTATION. The font for codeset 1, font for codeset 2, and font for codeset 3 fields are set only by ESCAPE -10, set extended text representation fonts .

The current value of the fonts section of an extended text representation can be inquired using ESCAPE -11, inquire extended text representation fonts .

The predefined values for the fonts section of an extended text representation can be inquired using ESCAPE -12, inquire predefined extended text representation fonts .

When the current text font Aspect Source Flag (ASF) is set to BUNDLED, the fonts for all of the codesets will be taken from the workstation’s representation indicated by the current text index. When the current text font ASF is set to INDIVIDUAL, the fonts for all of the codesets will be taken from the workstation’s extended fonts attribute in the PHIGS traversal state list.

DEFAULT BEHAVIOR

The default character set for all codesets is 1. The default font for all codesets is 1. During traversal, if a font is not available in the specified character set, font 1 and that character set will be used. During traversal, if a specified character set is not available, font 1 and character set 1 will be used.

RELATION TO EXISTING FUNCTIONS

The behavior of existing PHIGS functions has been adapted to give compatible behavior when using the internationalization extensions. Specifically, SET TEXT FONT sets the font for codeset 0 to the specified value. Likewise, the font in SET TEXT REPRESENTATION sets the font entry codeset 0 in the text representation. The inquiry INQUIRE TEXT EXTENT uses character set 1 for all the codesets, the specified font for code set 0, and font 1 for codesets 1 to 3.

ESCAPE -9, inquire extended text extent is provided to perform the same function as INQUIRE TEXT EXTENT using the character sets and fonts for all four codesets.

Museum