COL_SEQ_8(5)
NAME
col_seq_8 - Collating sequence table for languages with 8-bit character sets
HP-UX COMPATIBILITY
Level: HP-UX/STANDARD
Origin: HP
Native Language Support:
8−bit data, customs
DESCRIPTION
There are four language dependent collation algorithms for European languages. These algorithms are:
2-to-1 Conversions: Some languages, like Spanish, require two adjacent characters to occupy one position in the collating sequence. Examples are "CH" (which follows "C") and "LL" (which follows "L").
1-to-2 Conversions: Some languages, like German, require one character (e.g. "sharp S") to occupy two adjacent positions in the collating sequence.
Don’t care Characters: Some languages designate certain characters to be ignored in character comparisons. For example, if "-" is a "Don’t Care" character, then the strings "REACT" and "RE-ACT" would equal each other when compared.
Case and Accent Priority: Many languages require a "two pass" collating algorithm: in pass one, the accents are stripped off the letters and the resulting two strings are compared; if they are equal, a second pass with the accents back in place is performed to break the tie. The case of letters may also be used in this fashion.
This table has four sections - a file header, a sequence table, a 2-to-1 mapping table and a 1-to-2 mapping table.
| Header |
| Sequence Table |
| 2-to-1 Mapping Table |
| 1-to-2 Mapping Table |
Length and pointers are in units of two bytes.
Header:
| Byte 0 | Byte 1 | |
| 0 | Table Length | |
| 2 | Language Id Number | |
| 4 | Reserved | |
| 6 | Pointer to Sequence Table | |
| 8 | Length of Sequence Table | |
| 10 | Pointer to 2-to-1 Mapping Table | |
| 12 | Length of 2-to-1 Mapping Table | |
| 14 | Pointer to 1-to-2 Mapping Table | |
| 16 | Length of 1-to-2 Mapping Table | |
| 18 | Lowest Char | Highest Char | |
| 20 | Reserved | |
Sequence Table:
| Sequence Entry 0 |
| Sequence Entry 1 |
| (other entries from 2-254) |
| Sequence Entry 255 |
The byte value of a character is used as an index into the sequence table.
Sequence Entry Format: Each entry in the sequence table above uses two bytes and has one of the following formats:
| First Byte | Second Byte | Format Type | |
| - | - | - | |
| Bits: 15-8 | 7−6 | 5-4-3-2-1-0 | |
| 0 | 00 | 0 | don’t-care characters |
| sequence no. | 00 | priority | all 1-to-1 mapped characters w/o priority |
| sequence no. | 01 | index | 2-to-1 mapped characters |
| seq # (1.ch) | 10 | index | 1-to-2 mapped characters |
The 6-bit index indexes into either the 2-to-1 or the 1-to-2 mapping table.
Mapping Table for 2-to-1 Mapped Characters
| 2-to-1 Mapping Table |
| Entry Pair 1 |
| Entry Pair 2 |
| (other entry pairs) |
| Entry Pair n |
| Sequence Entry Format for Mapped Pairs | ||
| Byte 0 | Byte 1 | |
| 0 | Legal Char 1 | |
| Sequence Entry for This Pair | ||
| (other entries for this pair) | ||
| 0 | Legal Char n | |
| Sequence Entry for This Pair | ||
| Sentinel: -1 | ||
| 00 | priority | |
The "legal" 2-to-1 characters are listed for each particular character. "Legal" means that the combination of two characters is treated as a single character. If a match is found, then the corresponding sequence entry is used for the two. Whenever a legal successor is not found in the table, the character is treated according to 1-to-1 mapping, and the priority in the last entry, combined with sequence number of the character, creates the sequence entry.
Mapping Table for 1-to-2 Mapped Characters
| 1-to-2 Mapping Table |
| Sequence Entry |
| Sequence Entry |
| (other sequence entries) |
| Sequence Entry |
Entries in the 1-to-2 mapping table have the same format as entries in the sequence table. The sequence number of the first character is known from the entry in the sequence table. The sequence number of the second character is found in the 1-to-2 mapping entry, and the priority is used for both characters.
SEE ALSO
Hewlett-Packard — last mod. May 11, 2021