18 The optional Extended-Character word set

18.1 Introduction

This word set deals with variable width character encodings. It also works with fixed width encodings.

Since the standard specifies ASCII encoding for characters, only ASCII-compatible encodings may be used. Because ASCII compatibility has so many benefits, most encodings actually are ASCII compatible. The characters beyond the ASCII encoding are called "extended characters" (xchars).

All words dealing with strings shall handle xchars when the xchar word set is present. This includes dictionary definitions. White space parsing does not have to treat code points greater than $20 as white space.

18.2 Additional terms and notation

18.2.1 Definition of Terms

code point:: A member of an extended character set.

18.2.2 Parsed-text notation

Append table 18.1 to table 2.1.

Table 18.1: Parsed text abbreviations


Abbreviation	Description

<xchar>	the delimiting extended character

See: 2.2.3 Parsed-text notation.

18.3 Additional usage requirements

18.3.1 Data types

Append table 18.2 to table 3.1.

Table 18.2: Data Types


Symbol	Data type	Size on stack

pchar	primitive character	1 cell
xchar	extended character	1 cell
xc-addr	xchar-aligned address	1 cell

See: 3.1 Data types.

18.3.1.1 Extended Characters

An extended character (xchar) is the code point of a character within an extended character set; on the stack it is a subset of u. Extended characters are stored in memory encoded as one or more primitive characters (pchars).

18.3.2 Environmental queries

Append table 18.3 to table 3.4.

Table 18.3: Environmental Query Strings


String Value data type		Constant?	Meaning

`XCHAR-ENCODING`	c-addr u	no	Returns a printable ASCII string that represents the encoding, and use the preferred MIME name (if any) or the name in the IANA character-set register^[1] (RFC-1700) such as "`ISO-LATIN-1`" or "`UTF–8`", with the exception of "`ASCII`", where the alias "`ASCII`" is preferred.
`MAX-XCHAR`	u	no	Maximal value for xchar
`XCHAR-MAXMEM`	u	no	Maximal memory consumed by an xchar in address units


^[1] http://www.iana.org/assignments/character-sets

See: 3.2.6 Environmental queries.

18.3.3 Common encodings

Input and files are often encoded iso–latin–1 or utf–8. The encoding depends on settings of the computer system such as the LANG environment variable on Unix. You can use the system consistently only when you do not change the encoding, or only use the ASCII subset. The typical practice in environments requiring more than one encoding is that the base system is ASCII only, and the character set is then extended to specify the required encoding.

18.3.4 The Forth text interpreter

In section 3.4.1.3 Text interpreter input number conversion, <cnum> should be redefined to be:

<cnum>

the number is the value of <xchar>

18.3.5 Input and Output

IO words such as KEY, EMIT, TYPE, READ-FILE, READ-LINE, WRITE-FILE, and WRITE-LINE operate on pchars. Therefore, it is possible that these words read or write incomplete xchars, which are completed in the next consecutive operation(s). The IO system shall combine these pchars into a complete xchars on output, or split an xchars into pchars on input, and shall not throw a "malformed xchars" exception when the combination of these pchars form a valid xchars. -TRAILING-GARBAGE can be used to process an incomplete xchars at the end of such an IO operation. ACCEPT as input editor may be aware of xchars to provide comfort like backspace or cursor movement.

18.4 Additional documentation requirements

18.4.1 System documentation

18.4.1.1 Implementation-defined options

Since Unicode input and display poses a number of challenges like input method editors for different languages, left-to-right and right-to-left writing, and most fonts contain only a subset of Unicode glyphs, systems should document their capabilities. File IO and in-memory string handling should work transparently with xchars.

18.4.1.2 Ambiguous conditions

the data in memory does not encode a valid xchar (18.6.1.2486.50 X-SIZE);
the xchars value is outside the range of allowed code points of the current character set used;
words improperly used outside 6.1.0490 <# and 6.1.0040 #> (18.6.2.2488.20 XHOLD).

18.4.1.3 Other system documentation

no additional requirements.

18.4.2 Program documentation

no additional requirements.

18.5 Compliance and labeling

18.5.1 Forth-2012 systems

The phrase "Providing the Extended-Character word set" shall be appended to the label of any Standard System that provides all of the Extended-Character word set.

The phrase "Providing name(s) from the Extended-Character Extensions word set" shall be appended to the label of any Standard System that provides portions of the Extended-Character Extensions word set.

The phrase "Providing the Extended-Character Extensions word set" shall be appended to the label of any Standard System that provides all of the Extended-Character and Extended-Character Extensions word sets.

18.5.2 Forth-2012 programs

The phrase "Requiring the Extended-Character word set" shall be appended to the label of Standard Programs that require the system to provide the Extended-Character word set.

The phrase "Requiring name(s) from the Extended-Character Extensions word set" shall be appended to the label of Standard Programs that require the system to provide portions of the Extended-Character Extensions word set.

The phrase "Requiring the Extended-Character Extensions word set" shall be appended to the label of Standard Programs that require the system to provide all of the Extended-Character Exception and Extended-Character Extensions word sets.