Museum

Home

Lab Overview

Retrotechnology Articles

Online Manuals

⇒ col_seq_8(5) — HP-UX 5.00

Media Vault

Software Library

Restoration Projects

Artifacts Sought

Related Articles

sort(1)

nl_string(3C)

COL_SEQ_8(5)

NAME

col_seq_8 - Collating sequence table for languages with 8-bit character sets

HP-UX COMPATIBILITY

Level: HP-UX/STANDARD

Origin: HP

Native Language Support:
8−bit data, customs

DESCRIPTION

There are four language dependent collation algorithms for European languages. These algorithms are:
 
2-to-1 Conversions: Some languages, like Spanish, require two adjacent characters to occupy one position in the collating sequence. Examples are "CH" (which follows "C") and "LL" (which follows "L"). 
 
1-to-2 Conversions: Some languages, like German, require one character (e.g. "sharp S") to occupy two adjacent positions in the collating sequence. 
 
Don’t care Characters: Some languages designate certain characters to be ignored in character comparisons. For example, if "-" is a "Don’t Care" character, then the strings "REACT" and "RE-ACT" would equal each other when compared. 
 
Case and Accent Priority: Many languages require a "two pass" collating algorithm: in pass one, the accents are stripped off the letters and the resulting two strings are compared; if they are equal, a second pass with the accents back in place is performed to break the tie. The case of letters may also be used in this fashion. 
 
This table has four sections - a file header, a sequence table, a 2-to-1 mapping table and a 1-to-2 mapping table.

Header
Sequence Table
2-to-1 Mapping Table
1-to-2 Mapping Table

Length and pointers are in units of two bytes. 

Header:

Byte 0 Byte 1
0 Table Length
2 Language Id Number
4 Reserved
6 Pointer to Sequence Table
8 Length of Sequence Table
10 Pointer to 2-to-1 Mapping Table
12 Length of 2-to-1 Mapping Table
14 Pointer to 1-to-2 Mapping Table
16 Length of 1-to-2 Mapping Table
18 Lowest Char   |   Highest Char
20 Reserved

Sequence Table:

Sequence Entry 0
Sequence Entry 1
(other entries from 2-254)
Sequence Entry 255

The byte value of a character is used as an index into the sequence table. 

Sequence Entry Format: Each entry in the sequence table above uses two bytes and has one of the following formats:
 

First Byte Second Byte Format Type
- - -
Bits: 15-8 7−6 5-4-3-2-1-0
0 00 0 don’t-care characters
sequence no. 00 priority all 1-to-1 mapped characters w/o priority
sequence no. 01 index 2-to-1 mapped characters
seq # (1.ch) 10 index 1-to-2 mapped characters

The 6-bit index indexes into either the 2-to-1 or the 1-to-2 mapping table. 

Mapping Table for 2-to-1 Mapped Characters

2-to-1 Mapping Table
Entry Pair 1
Entry Pair 2
(other entry pairs)
Entry Pair n

Sequence Entry Format for Mapped Pairs
Byte 0 Byte 1
0 Legal Char 1
Sequence Entry for This Pair
(other entries for this pair)
0 Legal Char n
Sequence Entry for This Pair
Sentinel:  -1
00 priority

The "legal" 2-to-1 characters are listed for each particular character. "Legal" means that the combination of two characters is treated as a single character.  If a match is found, then the corresponding sequence entry is used for the two.  Whenever a legal successor is not found in the table, the character is treated according to 1-to-1 mapping, and the priority in the last entry, combined with sequence number of the character, creates the sequence entry. 

Mapping Table for 1-to-2 Mapped Characters

1-to-2 Mapping Table
Sequence Entry
Sequence Entry
(other sequence entries)
Sequence Entry

Entries in the 1-to-2 mapping table have the same format as entries in the sequence table.  The sequence number of the first character is known from the entry in the sequence table. The sequence number of the second character is found in the 1-to-2 mapping entry, and the priority is used for both characters. 
 

SEE ALSO

sort(1), nl_string(3C). 

Hewlett-Packard  —  last mod. May 11, 2021

Typewritten Software • bear@typewritten.org • Edmonds, WA 98026