18 The optional Extended-Character word set
18.1 Introduction
This word set deals with variable width character encodings. It also
works with fixed width encodings.
Since the standard specifies ASCII encoding for characters, only
ASCII-compatible encodings may be used. Because ASCII compatibility
has so many benefits, most encodings actually are ASCII compatible.
The characters beyond the ASCII encoding are called "extended
characters" (xchars).
All words dealing with strings shall handle xchars when the xchar word
set is present. This includes dictionary definitions. White space
parsing does not have to treat code points greater than $20 as white
space.
18.2 Additional terms and notation
18.2.1 Definition of Terms
- code point:
- A member of an extended character set.
18.2.2 Parsed-text notation
Append table
18.1 to table
2.1.
Table 18.1: Parsed text abbreviations
|
Abbreviation | Description |
|
<xchar> | the delimiting extended character |
|
|
See:
2.2.3 Parsed-text notation.
18.3 Additional usage requirements
18.3.1 Data types
Append table
18.2 to table
3.1.
Table 18.2: Data Types
|
Symbol | Data type | Size on stack |
|
pchar | primitive character | 1 cell |
xchar | extended character | 1 cell |
xc-addr | xchar-aligned address | 1 cell |
|
|
See:
3.1 Data types.
18.3.1.1 Extended Characters
An extended character (xchar) is the code point of a character within an
extended character set; on the stack it is a subset of
u. Extended
characters are stored in memory encoded as one or more primitive characters
(pchars).
18.3.2 Environmental queries
Append table
18.3 to table
3.4.
Table 18.3: Environmental Query Strings
|
String Value data type | Constant? | Meaning |
|
XCHAR-ENCODING | c-addr u | no | Returns a printable ASCII string that represents the encoding,
and use the preferred MIME name (if any) or the name in the
IANA character-set register[1] (RFC-1700) such
as "ISO-LATIN-1 " or "UTF–8 ",
with the exception of "ASCII ", where the alias
"ASCII " is preferred. |
MAX-XCHAR | u | no | Maximal value for xchar |
XCHAR-MAXMEM | u | no | Maximal memory consumed by an xchar in address units |
|
|
|
|
See:
3.2.6 Environmental queries.
18.3.3 Common encodings
Input and files are often encoded iso–latin–1 or utf–8. The encoding
depends on settings of the computer system such as the LANG environment
variable on Unix. You can use the system consistently only when you do
not change the encoding, or only use the ASCII subset.
The typical practice in environments requiring more than one encoding
is that the base system is ASCII only, and the character set is then
extended to specify the required encoding.
18.3.4 The Forth text interpreter
In section
3.4.1.3 Text interpreter input number conversion, <
cnum> should be redefined to be:
<cnum> | the number is the value of <xchar>
|
18.3.5 Input and Output
IO words such as
KEY,
EMIT,
TYPE,
READ-FILE,
READ-LINE,
WRITE-FILE, and
WRITE-LINE
operate on
pchars. Therefore, it is possible that these words
read or write incomplete
xchars, which are completed in the next
consecutive operation(s). The IO system shall combine these
pchars
into a complete
xchars on output, or split an
xchars into
pchars on input, and shall not throw a "malformed
xchars"
exception when the combination of these
pchars form a valid
xchars.
-TRAILING-GARBAGE can be used to process
an incomplete
xchars at the end of such an IO operation.
ACCEPT as input editor may be aware of
xchars to
provide comfort like backspace or cursor movement.
18.4 Additional documentation requirements
18.4.1 System documentation
18.4.1.1 Implementation-defined options
Since Unicode input and display poses a number of challenges like input
method editors for different languages, left-to-right and right-to-left
writing, and most fonts contain only a subset of Unicode glyphs,
systems should document their capabilities. File IO and in-memory
string handling should work transparently with
xchars.
18.4.1.2 Ambiguous conditions
18.4.1.3 Other system documentation
- no additional requirements.
18.4.2 Program documentation
- no additional requirements.
18.5 Compliance and labeling
18.5.1 Forth-2012 systems
The phrase "Providing the Extended-Character word set" shall be
appended to the label of any Standard System that provides all of
the Extended-Character word set.
The phrase "Providing
name(s) from the Extended-Character
Extensions word set" shall be appended to the label of any Standard
System that provides portions of the Extended-Character Extensions
word set.
The phrase "Providing the Extended-Character Extensions word set"
shall be appended to the label of any Standard System that provides
all of the Extended-Character and Extended-Character Extensions
word sets.
18.5.2 Forth-2012 programs
The phrase "Requiring the Extended-Character word set" shall be
appended to the label of Standard Programs that require the system
to provide the Extended-Character word set.
The phrase "Requiring
name(s) from the Extended-Character
Extensions word set" shall be appended to the label of Standard Programs
that require the system to provide portions of the Extended-Character
Extensions word set.
The phrase "Requiring the Extended-Character Extensions word set"
shall be appended to the label of Standard Programs that require the
system to provide all of the Extended-Character Exception and
Extended-Character Extensions word sets.
18.6 Glossary
18.6.1 Extended-Character words
18.6.2 Extended-Character extension words