2.1 Character Set
{
AI95-00285-01}
{
AI95-00395-01}
{character set} The
character repertoire for the text of an Ada program consists of the entire
coding space described by the ISO/IEC 10646:2003 Universal Multiple-Octet
Coded Character Set. This coding space is organized in
planes,
each plane comprising 65536 characters.
{plane
(character)} {character
plane}
Discussion: {
AI95-00285-01}
It is our intent to follow the terminology of ISO/IEC 10646:2003 where
appropriate, and to remain compatible with the character classifications
defined in
A.3, “
Character
Handling”.
Syntax
Paragraphs 2 and
3 were deleted.
{
AI95-00285-01}
{
AI95-00395-01}
A
character is defined by this International
Standard for each cell in the coding space described by ISO/IEC 10646:2003,
regardless of whether or not ISO/IEC 10646:2003 allocates a character
to that cell.
Static Semantics
{
AI95-00285-01}
{
AI95-00395-01}
The coded representation for characters is implementation defined [(it
need not be a representation defined within ISO/IEC 10646:2003)]. A character
whose relative code position in its plane is 16#FFFE# or 16#FFFF# is
not allowed anywhere in the text of a program.
Implementation defined: The coded representation
for the text of an Ada program.
Ramification: {
AI95-00285-01}
Note that this rule doesn't really have much force, since the implementation
can represent characters in the source in any way it sees fit. For example,
an implementation could simply define that what seems to be an
other_private_use
character is actually a representation of the space character.
{
AI95-00285-01}
The semantics of an Ada program whose text is not in Normalization Form
KC (as defined by section 24 of ISO/IEC 10646:2003) is implementation
defined.
Implementation defined: The semantics
of an Ada program whose text is not in Normalization Form KC.
{
AI95-00285-01}
The description of the language definition in this International Standard
uses the character properties General Category, Simple Uppercase Mapping,
Uppercase Mapping, and Special Case Condition of the documents referenced
by the note in section 1 of ISO/IEC 10646:2003. The actual set of graphic
symbols used by an implementation for the visual representation of the
text of an Ada program is not specified.
{unspecified
[partial]}
Discussion: Our character classification
considers that the cells not allocated in ISO/IEC 10646:2003 are graphic
characters, except for those whose relative code position in their plane
is 16#FFFE# or 16#FFFF#. This seems to provide the best compatibility
with future versions of ISO/IEC 10646, as future characters can be already
be used in Ada character and string literals.
This paragraph was deleted.{
AI95-00285-01}
Any character whose General Category is defined to be “Letter,
Uppercase”.
Any character whose General Category is defined to be “Letter,
Lowercase”.
Any character whose General Category is defined to be “Letter,
Titlecase”.
Any character whose General Category is defined to be “Letter,
Modifier”.
Any character whose General Category is defined to be “Letter,
Other”.
Any character whose General Category is defined to be “Mark, Non-Spacing”.
Any character whose General Category is defined to be “Mark, Spacing
Combining”.
Any character whose General Category is defined to be “Number,
Decimal”.
Any character whose General Category is defined to be “Number,
Letter”.
Any character whose General Category is defined to be “Punctuation,
Connector”.
Any character whose General Category is defined to be “Other, Format”.
Any character whose General Category is defined to be “Separator,
Space”.
Any character whose General Category is defined to be “Separator,
Line”.
Any character whose General Category is defined to be “Separator,
Paragraph”.
The characters whose code positions are 16#09# (CHARACTER TABULATION),
16#0A# (LINE FEED), 16#0B# (LINE TABULATION), 16#0C# (FORM FEED), 16#0D#
(CARRIAGE RETURN), 16#85# (NEXT LINE), and the characters in categories
separator_line and
separator_paragraph.
{control character: See also format_effector}
Discussion: ISO/IEC 10646:2003 does not
define the names of control characters, but rather refers to the names
defined by ISO/IEC 6429:1992. These are the names that we use here.
Any character whose General Category is defined to be “Other, Control”,
and which is not defined to be a format_effector.
Any character whose General Category is defined to be “Other, Private
Use”.
Any character whose General Category is defined to be “Other, Surrogate”.
Any character that is not in the categories other_control,
other_private_use, other_surrogate,
format_effector, and whose relative code position
in its plane is neither 16#FFFE# nor 16#FFFF#.
This paragraph
was deleted.
Discussion: {
AI95-00285-01}
We considered basing the definition of lexical elements on Annex A of
ISO/IEC TR 10176 (4th edition), which lists the characters which should
be supported in identifiers for all programming languages, but we finally
decided against this option. Note that it is not our intent to diverge
from ISO/IEC TR 10176, except to the extent that ISO/IEC TR 10176 itself
diverges from ISO/IEC 10646:2003 (which is the case at the time of this
writing [January 2005]).
More precisely,
we intend to align strictly with ISO/IEC 10646:2003. It must be noted
that ISO/IEC TR 10176 is a Technical Report while ISO/IEC 10646:2003
is a Standard. If one has to make a choice, one should conform with the
Standard rather than with the Technical Report. And, it turns out that
one must make a choice because there are important differences
between the two:
ISO/IEC TR 10176 is still based on ISO/IEC
10646:2000 while ISO/IEC 10646:2003 has already been published for a
year. We cannot afford to delay the adoption of our amendment until ISO/IEC
TR 10176 has been revised.
There are considerable differences between
the two editions of ISO/IEC 10646, notably in supporting characters beyond
the BMP (this might be significant for some languages, e.g. Korean).
ISO/IEC TR 10176 does not define case conversion
tables, which are essential for a case-insensitive language like Ada.
To get case conversion tables, we would have to reference either ISO/IEC
10646:2003 or Unicode, or we would have to invent our own.
For the purpose
of defining the lexical elements of the language, we need character properties
like categorization, as well as case conversion tables. These are mentioned
in ISO/IEC 10646:2003 as useful for implementations, with a reference
to Unicode. Machine-readable tables are available on the web at URLs:
with an explanatory
document found at URL:
The actual text of the standard only makes specific
references to the corresponding clauses of ISO/IEC 10646:2003, not to
Unicode.
{
AI95-00285-01}
The following names are used when referring to certain characters (the
first name is that given in ISO/IEC 10646:2003):
{quotation
mark} {number
sign} {ampersand}
{apostrophe}
{tick}
{left parenthesis}
{right parenthesis}
{asterisk}
{multiply}
{plus sign}
{comma}
{hyphen-minus}
{minus}
{full stop}
{dot}
{point}
{solidus}
{divide}
{colon}
{semicolon}
{less-than sign}
{equals sign}
{greater-than sign}
{low line}
{underline}
{vertical line}
{exclamation point}
{percent sign}
Discussion: {
AI95-00285-01}
{
graphic symbols}
{
glyphs}
This
table serves to show the correspondence between ISO/IEC 10646:2003 names
and the graphic symbols (glyphs) used in this International Standard.
These are the characters that play a special role in the syntax of Ada.
graphic symbol | name | graphic symbol | name |
|
| | | |
|
" | quotation mark | : | colon |
|
# | number sign | ; | semicolon |
|
& | ampersand | < | less-than sign |
|
' | apostrophe, tick | = | equals sign |
|
( | left parenthesis | > | greater-than sign |
|
) | right parenthesis | _ | low line, underline |
|
* | asterisk, multiply | | | vertical line |
|
+ | plus sign | / | solidus, divide |
|
, | comma | ! | exclamation point |
|
– | hyphen-minus, minus | % | percent sign |
|
. | full stop, dot, point | | |
|
Implementation Permissions
1 {
AI95-00285-01}
The characters in categories
other_control,
other_private_use, and
other_surrogate
are only allowed in comments.
2 The language does not specify the source
representation of programs.
Discussion: Any source representation
is valid so long as the implementer can produce an (information-preserving)
algorithm for translating both directions between the representation
and the standard character set. (For example, every character in the
standard character set has to be representable, even if the output devices
attached to a given computer cannot print all of those characters properly.)
From a practical point of view, every implementer will have to provide
some way to process the ACATS. It is the intent to allow source representations,
such as parse trees, that are not even linear sequences of characters.
It is also the intent to allow different fonts: reserved words might
be in bold face, and that should be irrelevant to the semantics.
Extensions to Ada 83
{
extensions to Ada 83}
Ada
95 allows 8-bit and 16-bit characters, as well as implementation-specified
character sets.
Wording Changes from Ada 83
{
AI95-00285-01}
The syntax rules in this clause are modified to remove the emphasis on
basic characters vs. others. (In this day and age, there is no need to
point out that you can write programs without using (for example) lower
case letters.) In particular,
character (representing
all characters usable outside comments) is added, and
basic_graphic_character,
other_special_character, and
basic_character
are removed.
Special_character is expanded
to include Ada 83's
other_special_character,
as well as new 8-bit characters not present in Ada 83. Ada 2005 removes
special_character altogether; we want to stick
to ISO/IEC 10646:2003 character classifications. Note that the term “basic
letter” is used in
A.3, “
Character
Handling” to refer to letters without diacritical marks.
{
AI95-00285-01}
Character names now come from ISO/IEC 10646:2003.
Extensions to Ada 95
{
AI95-00285-01}
{
AI95-00395-01}
{
extensions to Ada 95}
Program text can use
most characters defined by ISO-10646:2003. This clause has been rewritten
to use the categories defined in that Standard. This should ease programming
in languages other than English.