Class CaseCanonicalize

java.lang.Object
com.google.javascript.jscomp.regex.CaseCanonicalize

public final class CaseCanonicalize extends Object
Implements the ECMAScript 5 Canonicalize operation used to specify how case-insensitive regular expressions match.

From section 15.10.2.9,

The abstract operation Canonicalize takes a character parameter ch and performs the following steps:
  • If IgnoreCase is false, return ch.
  • Let u be ch converted to upper case as if by calling the standard built-in method String.prototype.toUpperCase on the one-character String ch.
  • If u does not consist of a single character, return ch.
  • Let cu be u's character.
  • If ch's code unit value is greater than or equal to decimal 128 and cu's code unit value is less than decimal 128, then return ch.
  • Return cu.
  • Field Summary

    Fields
    Modifier and Type
    Field
    Description
    static final com.google.javascript.jscomp.regex.CharRanges
    Set of code units that are case-insensitively equivalent to some other code unit according to the EcmaScript Canonicalize operation described in section 15.10.2.8.
  • Method Summary

    Modifier and Type
    Method
    Description
    static char
    Returns the case canonical version of the given code-unit.
    static String
    Returns the case canonical version of the given string.
    static com.google.javascript.jscomp.regex.CharRanges
    expandToAllMatched(com.google.javascript.jscomp.regex.CharRanges ranges)
    Given a character range that may include case sensitive code-units, such as [0-9B-M], returns the character range that includes all the code-units in the input and those that are case-insensitively equivalent to a code-unit in the input.
    static com.google.javascript.jscomp.regex.CharRanges
    reduceToMinimum(com.google.javascript.jscomp.regex.CharRanges ranges)
    Given a character range that may include case sensitive code-units, such as [0-9B-M], returns the character range that includes the minimal set of code units such that for every code unit in the input there is a case-sensitively equivalent canonical code unit in the output.

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Field Details

    • CASE_SENSITIVE

      public static final com.google.javascript.jscomp.regex.CharRanges CASE_SENSITIVE
      Set of code units that are case-insensitively equivalent to some other code unit according to the EcmaScript Canonicalize operation described in section 15.10.2.8. The case sensitive characters are the ones that canonicalize to a character other than themselves or have a character that canonicalizes to them. Canonicalize is based on the definition of String.prototype.toUpperCase which is itself based on Unicode 3.0.0 as specified at UnicodeData-3.0.0 and SpecialCasings-2.txt .

      This table was generated by running the below on Chrome:

       for (var cc = 0; cc < 0x10000; ++cc) {
         var ch = String.fromCharCode(cc);
         var u = ch.toUpperCase();
         if (ch != u && u.length === 1) {
           var cu = u.charCodeAt(0);
           if (cc <= 128 || u.charCodeAt(0) > 128) {
             print('0x' + cc.toString(16) + ', 0x' + cu.toString(16) + ',');
           }
         }
       }
       
  • Method Details

    • caseCanonicalize

      public static String caseCanonicalize(String s)
      Returns the case canonical version of the given string.
    • caseCanonicalize

      public static char caseCanonicalize(char ch)
      Returns the case canonical version of the given code-unit. ECMAScript 5 explicitly says that code-units are to be treated as their code-point equivalent, even surrogates.
    • expandToAllMatched

      public static com.google.javascript.jscomp.regex.CharRanges expandToAllMatched(com.google.javascript.jscomp.regex.CharRanges ranges)
      Given a character range that may include case sensitive code-units, such as [0-9B-M], returns the character range that includes all the code-units in the input and those that are case-insensitively equivalent to a code-unit in the input.
    • reduceToMinimum

      public static com.google.javascript.jscomp.regex.CharRanges reduceToMinimum(com.google.javascript.jscomp.regex.CharRanges ranges)
      Given a character range that may include case sensitive code-units, such as [0-9B-M], returns the character range that includes the minimal set of code units such that for every code unit in the input there is a case-sensitively equivalent canonical code unit in the output.