Class CaseCanonicalize
java.lang.Object
com.google.javascript.jscomp.regex.CaseCanonicalize
Implements the ECMAScript 5
Canonicalize operation
used to specify how case-insensitive regular expressions match.
From section 15.10.2.9,
The abstract operation Canonicalize takes a character parameter ch and performs the following steps:
- If IgnoreCase is false, return ch.
- Let u be ch converted to upper case as if by calling the standard built-in method
String.prototype.toUpperCase
on the one-character String ch.- If u does not consist of a single character, return ch.
- Let cu be u's character.
- If ch's code unit value is greater than or equal to decimal 128 and cu's code unit value is less than decimal 128, then return ch.
- Return cu.
-
Field Summary
Modifier and TypeFieldDescriptionstatic final com.google.javascript.jscomp.regex.CharRanges
Set of code units that are case-insensitively equivalent to some other code unit according to the EcmaScript Canonicalize operation described in section 15.10.2.8. -
Method Summary
Modifier and TypeMethodDescriptionstatic char
caseCanonicalize
(char ch) Returns the case canonical version of the given code-unit.static String
Returns the case canonical version of the given string.static com.google.javascript.jscomp.regex.CharRanges
expandToAllMatched
(com.google.javascript.jscomp.regex.CharRanges ranges) Given a character range that may include case sensitive code-units, such as[0-9B-M]
, returns the character range that includes all the code-units in the input and those that are case-insensitively equivalent to a code-unit in the input.static com.google.javascript.jscomp.regex.CharRanges
reduceToMinimum
(com.google.javascript.jscomp.regex.CharRanges ranges) Given a character range that may include case sensitive code-units, such as[0-9B-M]
, returns the character range that includes the minimal set of code units such that for every code unit in the input there is a case-sensitively equivalent canonical code unit in the output.
-
Field Details
-
CASE_SENSITIVE
public static final com.google.javascript.jscomp.regex.CharRanges CASE_SENSITIVESet of code units that are case-insensitively equivalent to some other code unit according to the EcmaScript Canonicalize operation described in section 15.10.2.8. The case sensitive characters are the ones that canonicalize to a character other than themselves or have a character that canonicalizes to them. Canonicalize is based on the definition ofString.prototype.toUpperCase
which is itself based on Unicode 3.0.0 as specified at UnicodeData-3.0.0 and SpecialCasings-2.txt .This table was generated by running the below on Chrome:
for (var cc = 0; cc < 0x10000; ++cc) { var ch = String.fromCharCode(cc); var u = ch.toUpperCase(); if (ch != u && u.length === 1) { var cu = u.charCodeAt(0); if (cc <= 128 || u.charCodeAt(0) > 128) { print('0x' + cc.toString(16) + ', 0x' + cu.toString(16) + ','); } } }
-
-
Method Details
-
caseCanonicalize
Returns the case canonical version of the given string. -
caseCanonicalize
public static char caseCanonicalize(char ch) Returns the case canonical version of the given code-unit. ECMAScript 5 explicitly says that code-units are to be treated as their code-point equivalent, even surrogates. -
expandToAllMatched
public static com.google.javascript.jscomp.regex.CharRanges expandToAllMatched(com.google.javascript.jscomp.regex.CharRanges ranges) Given a character range that may include case sensitive code-units, such as[0-9B-M]
, returns the character range that includes all the code-units in the input and those that are case-insensitively equivalent to a code-unit in the input. -
reduceToMinimum
public static com.google.javascript.jscomp.regex.CharRanges reduceToMinimum(com.google.javascript.jscomp.regex.CharRanges ranges) Given a character range that may include case sensitive code-units, such as[0-9B-M]
, returns the character range that includes the minimal set of code units such that for every code unit in the input there is a case-sensitively equivalent canonical code unit in the output.
-