ICU 72.1 72.1
|
Old Unicode normalization API. More...
#include <normlzr.h>
Public Types | |
enum | { DONE =0xffff } |
If DONE is returned from an iteration function that returns a code point, then there are no more normalization results available. More... | |
Public Member Functions | |
Normalizer (const UnicodeString &str, UNormalizationMode mode) | |
Creates a new Normalizer object for iterating over the normalized form of a given string. More... | |
Normalizer (ConstChar16Ptr str, int32_t length, UNormalizationMode mode) | |
Creates a new Normalizer object for iterating over the normalized form of a given string. More... | |
Normalizer (const CharacterIterator &iter, UNormalizationMode mode) | |
Creates a new Normalizer object for iterating over the normalized form of the given text. More... | |
Normalizer (const Normalizer ©) | |
Copy constructor. More... | |
virtual | ~Normalizer () |
Destructor. More... | |
UChar32 | current (void) |
Return the current character in the normalized text. More... | |
UChar32 | first (void) |
Return the first character in the normalized text. More... | |
UChar32 | last (void) |
Return the last character in the normalized text. More... | |
UChar32 | next (void) |
Return the next character in the normalized text. More... | |
UChar32 | previous (void) |
Return the previous character in the normalized text and decrement. More... | |
void | setIndexOnly (int32_t index) |
Set the iteration position in the input text that is being normalized, without any immediate normalization. More... | |
void | reset (void) |
Reset the index to the beginning of the text. More... | |
int32_t | getIndex (void) const |
Retrieve the current iteration position in the input text that is being normalized. More... | |
int32_t | startIndex (void) const |
Retrieve the index of the start of the input text. More... | |
int32_t | endIndex (void) const |
Retrieve the index of the end of the input text. More... | |
bool | operator== (const Normalizer &that) const |
Returns true when both iterators refer to the same character in the same input text. More... | |
bool | operator!= (const Normalizer &that) const |
Returns false when both iterators refer to the same character in the same input text. More... | |
Normalizer * | clone () const |
Returns a pointer to a new Normalizer that is a clone of this one. More... | |
int32_t | hashCode (void) const |
Generates a hash code for this iterator. More... | |
void | setMode (UNormalizationMode newMode) |
Set the normalization mode for this object. More... | |
UNormalizationMode | getUMode (void) const |
Return the normalization mode for this object. More... | |
void | setOption (int32_t option, UBool value) |
Set options that affect this Normalizer 's operation. More... | |
UBool | getOption (int32_t option) const |
Determine whether an option is turned on or off. More... | |
void | setText (const UnicodeString &newText, UErrorCode &status) |
Set the input text over which this Normalizer will iterate. More... | |
void | setText (const CharacterIterator &newText, UErrorCode &status) |
Set the input text over which this Normalizer will iterate. More... | |
void | setText (ConstChar16Ptr newText, int32_t length, UErrorCode &status) |
Set the input text over which this Normalizer will iterate. More... | |
void | getText (UnicodeString &result) |
Copies the input text into the UnicodeString argument. More... | |
virtual UClassID | getDynamicClassID () const override |
ICU "poor man's RTTI", returns a UClassID for the actual class. More... | |
Public Member Functions inherited from icu::UObject | |
virtual | ~UObject () |
Destructor. More... | |
virtual UClassID | getDynamicClassID () const |
ICU4C "poor man's RTTI", returns a UClassID for the actual ICU class. More... | |
Static Public Member Functions | |
static void | normalize (const UnicodeString &source, UNormalizationMode mode, int32_t options, UnicodeString &result, UErrorCode &status) |
Normalizes a UnicodeString according to the specified normalization mode. More... | |
static void | compose (const UnicodeString &source, UBool compat, int32_t options, UnicodeString &result, UErrorCode &status) |
Compose a UnicodeString . More... | |
static void | decompose (const UnicodeString &source, UBool compat, int32_t options, UnicodeString &result, UErrorCode &status) |
Static method to decompose a UnicodeString . More... | |
static UNormalizationCheckResult | quickCheck (const UnicodeString &source, UNormalizationMode mode, UErrorCode &status) |
Performing quick check on a string, to quickly determine if the string is in a particular normalization format. More... | |
static UNormalizationCheckResult | quickCheck (const UnicodeString &source, UNormalizationMode mode, int32_t options, UErrorCode &status) |
Performing quick check on a string; same as the other version of quickCheck but takes an extra options parameter like most normalization functions. More... | |
static UBool | isNormalized (const UnicodeString &src, UNormalizationMode mode, UErrorCode &errorCode) |
Test if a string is in a given normalization form. More... | |
static UBool | isNormalized (const UnicodeString &src, UNormalizationMode mode, int32_t options, UErrorCode &errorCode) |
Test if a string is in a given normalization form; same as the other version of isNormalized but takes an extra options parameter like most normalization functions. More... | |
static UnicodeString & | concatenate (const UnicodeString &left, const UnicodeString &right, UnicodeString &result, UNormalizationMode mode, int32_t options, UErrorCode &errorCode) |
Concatenate normalized strings, making sure that the result is normalized as well. More... | |
static int32_t | compare (const UnicodeString &s1, const UnicodeString &s2, uint32_t options, UErrorCode &errorCode) |
Compare two strings for canonical equivalence. More... | |
static UClassID | getStaticClassID () |
ICU "poor man's RTTI", returns a UClassID for this class. More... | |
Old Unicode normalization API.
This API has been replaced by the Normalizer2 class and is only available for backward compatibility. This class simply delegates to the Normalizer2 class. There is one exception: The new API does not provide a replacement for Normalizer::compare().
The Normalizer class supports the standard normalization forms described in Unicode Standard Annex #15: Unicode Normalization Forms.
The Normalizer class consists of two parts:
The Normalizer class is not suitable for subclassing.
For basic information about normalization forms and details about the C API please see the documentation in unorm.h.
The iterator API with the Normalizer constructors and the non-static functions use a CharacterIterator as input. It is possible to pass a string which is then internally wrapped in a CharacterIterator. The input text is not normalized all at once, but incrementally where needed (providing efficient random access). This allows to pass in a large text but spend only a small amount of time normalizing a small part of that text. However, if the entire text is normalized, then the iterator will be slower than normalizing the entire text at once and iterating over the result. A possible use of the Normalizer iterator is also to report an index into the original text that is close to where the normalized characters come from.
Important: The iterator API was cleaned up significantly for ICU 2.0. The earlier implementation reported the getIndex() inconsistently, and previous() could not be used after setIndex(), next(), first(), and current().
Normalizer allows to start normalizing from anywhere in the input text by calling setIndexOnly(), first(), or last(). Without calling any of these, the iterator will start at the beginning of the text.
At any time, next() returns the next normalized code point (UChar32), with post-increment semantics (like CharacterIterator::next32PostInc()). previous() returns the previous normalized code point (UChar32), with pre-decrement semantics (like CharacterIterator::previous32()).
current() returns the current code point (respectively the one at the newly set index) without moving the getIndex(). Note that if the text at the current position needs to be normalized, then these functions will do that. (This is why current() is not const.) It is more efficient to call setIndexOnly() instead, which does not normalize.
getIndex() always refers to the position in the input text where the normalized code points are returned from. It does not always change with each returned code point. The code point that is returned from any of the functions corresponds to text at or after getIndex(), according to the function's iteration semantics (post-increment or pre-decrement).
next() returns a code point from at or after the getIndex() from before the next() call. After the next() call, the getIndex() might have moved to where the next code point will be returned from (from a next() or current() call). This is semantically equivalent to array access with array[index++] (post-increment semantics).
previous() returns a code point from at or after the getIndex() from after the previous() call. This is semantically equivalent to array access with array[–index] (pre-decrement semantics).
Internally, the Normalizer iterator normalizes a small piece of text starting at the getIndex() and ending at a following "safe" index. The normalized results is stored in an internal string buffer, and the code points are iterated from there. With multiple iteration calls, this is repeated until the next piece of text needs to be normalized, and the getIndex() needs to be moved.
The following "safe" index, the internal buffer, and the secondary iteration index into that buffer are not exposed on the API. This also means that it is currently not practical to return to a particular, arbitrary position in the text because one would need to know, and be able to set, in addition to the getIndex(), at least also the current index into the internal buffer. It is currently only possible to observe when getIndex() changes (with careful consideration of the iteration semantics), at which time the internal index will be 0. For example, if getIndex() is different after next() than before it, then the internal index is 0 and one can return to this getIndex() later with setIndexOnly().
Note: While the setIndex() and getIndex() refer to indices in the underlying Unicode input text, the next() and previous() methods iterate through characters in the normalized output. This means that there is not necessarily a one-to-one correspondence between characters returned by next() and previous() and the indices passed to and returned from setIndex() and getIndex(). It is for this reason that Normalizer does not implement the CharacterIterator interface.
anonymous enum |
If DONE is returned from an iteration function that returns a code point, then there are no more normalization results available.
icu::Normalizer::Normalizer | ( | const UnicodeString & | str, |
UNormalizationMode | mode | ||
) |
Creates a new Normalizer
object for iterating over the normalized form of a given string.
str | The string to be normalized. The normalization will start at the beginning of the string. |
mode | The normalization mode. |
icu::Normalizer::Normalizer | ( | ConstChar16Ptr | str, |
int32_t | length, | ||
UNormalizationMode | mode | ||
) |
Creates a new Normalizer
object for iterating over the normalized form of a given string.
str | The string to be normalized. The normalization will start at the beginning of the string. |
length | Length of the string, or -1 if NUL-terminated. |
mode | The normalization mode. |
icu::Normalizer::Normalizer | ( | const CharacterIterator & | iter, |
UNormalizationMode | mode | ||
) |
Creates a new Normalizer
object for iterating over the normalized form of the given text.
iter | The input text to be normalized. The normalization will start at the beginning of the string. |
mode | The normalization mode. |
icu::Normalizer::Normalizer | ( | const Normalizer & | copy | ) |
Copy constructor.
copy | The object to be copied. |
|
virtual |
Destructor.
Normalizer * icu::Normalizer::clone | ( | ) | const |
Returns a pointer to a new Normalizer that is a clone of this one.
The caller is responsible for deleting the new clone.
|
inlinestatic |
Compare two strings for canonical equivalence.
Further options include case-insensitive comparison and code point order (as opposed to code unit order).
Canonical equivalence between two strings is defined as their normalized forms (NFD or NFC) being identical. This function compares strings incrementally instead of normalizing (and optionally case-folding) both strings entirely, improving performance significantly.
Bulk normalization is only necessary if the strings do not fulfill the FCD conditions. Only in this case, and only if the strings are relatively long, is memory allocated temporarily. For FCD strings and short non-FCD strings there is no memory allocation.
Semantically, this is equivalent to strcmp[CodePointOrder](NFD(foldCase(s1)), NFD(foldCase(s2))) where code point order and foldCase are all optional.
UAX 21 2.5 Caseless Matching specifies that for a canonical caseless match the case folding must be performed first, then the normalization.
s1 | First source string. |
s2 | Second source string. |
options | A bit set of options:
|
errorCode | ICU error code in/out parameter. Must fulfill U_SUCCESS before the function call. |
Definition at line 800 of file normlzr.h.
References icu::UnicodeString::getBuffer(), icu::UnicodeString::length(), icu::toUCharPtr(), and unorm_compare().
|
static |
Compose a UnicodeString
.
This is equivalent to normalize() with mode UNORM_NFC or UNORM_NFKC. This is a wrapper for unorm_normalize(), using UnicodeString's.
The options
parameter specifies which optional Normalizer
features are to be enabled for this operation.
source | the string to be composed. |
compat | Perform compatibility decomposition before composition. If this argument is false , only canonical decomposition will be performed. |
options | the optional features to be enabled (0 for no options) |
result | The composed string (on output). |
status | The error code. |
|
static |
Concatenate normalized strings, making sure that the result is normalized as well.
If both the left and the right strings are in the normalization form according to "mode/options", then the result will be
For details see unorm_concatenate in unorm.h.
left | Left source string. |
right | Right source string. |
result | The output string. |
mode | The normalization mode. |
options | A bit set of normalization options. |
errorCode | ICU error code in/out parameter. Must fulfill U_SUCCESS before the function call. |
UChar32 icu::Normalizer::current | ( | void | ) |
Return the current character in the normalized text.
current() may need to normalize some text at getIndex(). The getIndex() is not changed.
|
static |
Static method to decompose a UnicodeString
.
This is equivalent to normalize() with mode UNORM_NFD or UNORM_NFKD. This is a wrapper for unorm_normalize(), using UnicodeString's.
The options
parameter specifies which optional Normalizer
features are to be enabled for this operation.
source | the string to be decomposed. |
compat | Perform compatibility decomposition. If this argument is false , only canonical decomposition will be performed. |
options | the optional features to be enabled (0 for no options) |
result | The decomposed string (on output). |
status | The error code. |
int32_t icu::Normalizer::endIndex | ( | void | ) | const |
Retrieve the index of the end of the input text.
This is the end index of the CharacterIterator
or the length of the string over which this Normalizer
is iterating. This end index is exclusive, i.e., the Normalizer operates only on characters before this index.
UChar32 icu::Normalizer::first | ( | void | ) |
Return the first character in the normalized text.
This is equivalent to setIndexOnly(startIndex()) followed by next(). (Post-increment semantics.)
|
overridevirtual |
ICU "poor man's RTTI", returns a UClassID for the actual class.
Reimplemented from icu::UObject.
int32_t icu::Normalizer::getIndex | ( | void | ) | const |
Retrieve the current iteration position in the input text that is being normalized.
A following call to next() will return a normalized code point from the input text at or after this index.
After a call to previous(), getIndex() will point at or before the position in the input text where the normalized code point was returned from with previous().
UBool icu::Normalizer::getOption | ( | int32_t | option | ) | const |
Determine whether an option is turned on or off.
If multiple options are specified, then the result is true if any of them are set.
option | the option(s) that are to be checked |
|
static |
ICU "poor man's RTTI", returns a UClassID for this class.
void icu::Normalizer::getText | ( | UnicodeString & | result | ) |
Copies the input text into the UnicodeString argument.
result | Receives a copy of the text under iteration. |
UNormalizationMode icu::Normalizer::getUMode | ( | void | ) | const |
Return the normalization mode for this object.
This is an unusual name because there used to be a getMode() that returned a different type.
Normalizer
int32_t icu::Normalizer::hashCode | ( | void | ) | const |
Generates a hash code for this iterator.
|
static |
Test if a string is in a given normalization form; same as the other version of isNormalized but takes an extra options parameter like most normalization functions.
src | String that is to be tested if it is in a normalization format. |
mode | Which normalization form to test for. |
options | the optional features to be enabled (0 for no options) |
errorCode | ICU error code in/out parameter. Must fulfill U_SUCCESS before the function call. |
|
inlinestatic |
Test if a string is in a given normalization form.
This is semantically equivalent to source.equals(normalize(source, mode)) .
Unlike unorm_quickCheck(), this function returns a definitive result, never a "maybe". For NFD, NFKD, and FCD, both functions work exactly the same. For NFC and NFKC where quickCheck may return "maybe", this function will perform further tests to arrive at a true/false result.
src | String that is to be tested if it is in a normalization format. |
mode | Which normalization form to test for. |
errorCode | ICU error code in/out parameter. Must fulfill U_SUCCESS before the function call. |
UChar32 icu::Normalizer::last | ( | void | ) |
Return the last character in the normalized text.
This is equivalent to setIndexOnly(endIndex()) followed by previous(). (Pre-decrement semantics.)
UChar32 icu::Normalizer::next | ( | void | ) |
Return the next character in the normalized text.
(Post-increment semantics.) If the end of the text has already been reached, DONE is returned. The DONE value could be confused with a U+FFFF non-character code point in the text. If this is possible, you can test getIndex()<endIndex() before calling next(), or (getIndex()<endIndex() || last()!=DONE) after calling next(). (Calling last() will change the iterator state!)
The C API unorm_next() is more efficient and does not have this ambiguity.
|
static |
Normalizes a UnicodeString
according to the specified normalization mode.
This is a wrapper for unorm_normalize(), using UnicodeString's.
The options
parameter specifies which optional Normalizer
features are to be enabled for this operation.
source | the input string to be normalized. |
mode | the normalization mode |
options | the optional features to be enabled (0 for no options) |
result | The normalized string (on output). |
status | The error code. |
|
inline |
Returns false when both iterators refer to the same character in the same input text.
that | a Normalizer object to compare this one to |
Definition at line 781 of file normlzr.h.
References icu::operator==().
bool icu::Normalizer::operator== | ( | const Normalizer & | that | ) | const |
Returns true when both iterators refer to the same character in the same input text.
that | a Normalizer object to compare this one to |
UChar32 icu::Normalizer::previous | ( | void | ) |
Return the previous character in the normalized text and decrement.
(Pre-decrement semantics.) If the beginning of the text has already been reached, DONE is returned. The DONE value could be confused with a U+FFFF non-character code point in the text. If this is possible, you can test (getIndex()>startIndex() || first()!=DONE). (Calling first() will change the iterator state!)
The C API unorm_previous() is more efficient and does not have this ambiguity.
|
static |
Performing quick check on a string; same as the other version of quickCheck but takes an extra options parameter like most normalization functions.
source | string for determining if it is in a normalized format |
mode | normalization format |
options | the optional features to be enabled (0 for no options) |
status | A reference to a UErrorCode to receive any errors |
|
inlinestatic |
Performing quick check on a string, to quickly determine if the string is in a particular normalization format.
This is a wrapper for unorm_quickCheck(), using a UnicodeString.
Three types of result can be returned UNORM_YES, UNORM_NO or UNORM_MAYBE. Result UNORM_YES indicates that the argument string is in the desired normalized format, UNORM_NO determines that argument string is not in the desired normalized format. A UNORM_MAYBE result indicates that a more thorough check is required, the user may have to put the string in its normalized form and compare the results.
source | string for determining if it is in a normalized format |
mode | normalization format |
status | A reference to a UErrorCode to receive any errors |
void icu::Normalizer::reset | ( | void | ) |
Reset the index to the beginning of the text.
This is equivalent to setIndexOnly(startIndex)).
void icu::Normalizer::setIndexOnly | ( | int32_t | index | ) |
Set the iteration position in the input text that is being normalized, without any immediate normalization.
After setIndexOnly(), getIndex() will return the same index that is specified here.
index | the desired index in the input text. |
void icu::Normalizer::setMode | ( | UNormalizationMode | newMode | ) |
Set the normalization mode for this object.
Note:If the normalization mode is changed while iterating over a string, calls to next()
and previous()
may return previously buffers characters in the old normalization mode until the iteration is able to re-sync at the next base character. It is safest to call setIndexOnly
, reset()
, setText
, first()
, last()
, etc. after calling setMode
.
newMode | the new mode for this Normalizer . |
void icu::Normalizer::setOption | ( | int32_t | option, |
UBool | value | ||
) |
Set options that affect this Normalizer
's operation.
Options do not change the basic composition or decomposition operation that is being performed, but they control whether certain optional portions of the operation are done. Currently the only available option is obsolete.
It is possible to specify multiple options that are all turned on or off.
option | the option(s) whose value is/are to be set. |
value | the new setting for the option. Use true to turn the option(s) on and false to turn it/them off. |
void icu::Normalizer::setText | ( | const CharacterIterator & | newText, |
UErrorCode & | status | ||
) |
Set the input text over which this Normalizer
will iterate.
The iteration position is set to the beginning.
newText | a CharacterIterator object that replaces the current input text |
status | a UErrorCode |
void icu::Normalizer::setText | ( | const UnicodeString & | newText, |
UErrorCode & | status | ||
) |
Set the input text over which this Normalizer
will iterate.
The iteration position is set to the beginning.
newText | a string that replaces the current input text |
status | a UErrorCode |
void icu::Normalizer::setText | ( | ConstChar16Ptr | newText, |
int32_t | length, | ||
UErrorCode & | status | ||
) |
Set the input text over which this Normalizer
will iterate.
The iteration position is set to the beginning.
newText | a string that replaces the current input text |
length | the length of the string, or -1 if NUL-terminated |
status | a UErrorCode |
int32_t icu::Normalizer::startIndex | ( | void | ) | const |
Retrieve the index of the start of the input text.
This is the begin index of the CharacterIterator
or the start (i.e. index 0) of the string over which this Normalizer
is iterating.