7.7.2.7. `NormalizerTable`¶

7.7.2.7.1. Summary¶

New in version 11.0.4.

NormalizerTable normalizes text by user defined normalization table. User defined normalization table is just a normal table but it must satisfy some conditions. They are described later.

Note

The normalized text is depends on contents of user defined normalization table. If you want to use this normalizer for lexicon, you need to re-index when you change your user defined normalization table.

7.7.2.7.2. Syntax¶

There are required and optional parameters.

Required parameters:

NormalizerTable("normalized", "UserDefinedTable.normalized_column")

Optional parameters:

NormalizerTable("normalized", "UserDefinedTable.normalized_column",
                "target", "target_column")

NormalizerTable("normalized", "UserDefinedTable.normalized_column",
                "unicode_version", "13.0.0")

7.7.2.7.3. Usage¶

7.7.2.7.3.1. Simple usage¶

Here is an example of NormalizerTable.

NormalizerTable normalizes text by user defined normalization table. You use the following user defined normalization table here:

Table type must be TABLE_PAT_KEY.

Table key type must be ShortText.

Table must have at least one ShortText column.

Here are schema and data for this example:

Execution example:

table_create Normalizations TABLE_PAT_KEY ShortText
# [[0, 1337566253.89858, 0.000355720520019531], true]
column_create Normalizations normalized COLUMN_SCALAR ShortText
# [[0, 1337566253.89858, 0.000355720520019531], true]
load --table Normalizations
[
{"_key": "a", "normalized": "<A>"},
{"_key": "ac", "normalized": "<AC>"}
]
# [[0, 1337566253.89858, 0.000355720520019531], 2]

You can normalize a with <A> and oo with <OO> with this user defined normalization table. For example:

Groonga -> Groong<A>

hack -> h<AC>k

Here are examples of NormalizerTable with the user defined normalization table:

Execution example:

normalize 'NormalizerTable("normalized", "ColumnNormalizations.normalized", "target", "target_column")' "Groonga"
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   {
#     "normalized": "Groong<A>",
#     "types": [],
#     "checks": []
#   }
# ]
normalize 'NormalizerTable("normalized", "ColumnNormalizations.normalized", "target", "target_column")' "hack"
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   {
#     "normalized": "h<AC>k",
#     "types": [],
#     "checks": []
#   }
# ]

7.7.2.7.3.2. Unicode version¶

Some internal processings such as tokenization and highlight use character type. NormalizerTable provides character type based on Unicode. You can specify used Unicode version by unicode_version option.

Here is an example to use Unicode 13.0.0:

Execution example:

normalize 'NormalizerTable("normalized", "Normalizations.normalized")' "Groonga" WITH_TYPES
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   {
#     "normalized": "Groong<A>",
#     "types": [
#       "alpha",
#       "alpha",
#       "alpha",
#       "alpha",
#       "alpha",
#       "alpha",
#       "symbol",
#       "alpha",
#       "symbol"
#     ],
#     "checks": []
#   }
# ]

The default Unicode version is 5.0.0.

7.7.2.7.3.3. Advanced usage¶

You can put a normalized string to a column instead of _key. In this case, you need to create the following index column for the column:

Lexicon type of the index column must be TABLE_PAT_KEY.

Lexicon key type of the index column must be ShortText.

Lexicon of the index column must not have tokenizer.

You can use any table type for this usage such as TABLE_NO_KEY. This is useful when you can’t control table type. For example, PGroonga users can only use this usage.

Here are schema and data for this example:

Execution example:

table_create ColumnNormalizations TABLE_NO_KEY
# [[0, 1337566253.89858, 0.000355720520019531], true]
column_create ColumnNormalizations target_column COLUMN_SCALAR ShortText
# [[0, 1337566253.89858, 0.000355720520019531], true]
column_create ColumnNormalizations normalized COLUMN_SCALAR ShortText
# [[0, 1337566253.89858, 0.000355720520019531], true]
table_create Targets TABLE_PAT_KEY ShortText
# [[0, 1337566253.89858, 0.000355720520019531], true]
column_create Targets column_normalizations_target_column \
   COLUMN_INDEX ColumnNormalizations target_column
# [[0, 1337566253.89858, 0.000355720520019531], true]
load --table ColumnNormalizations
[
{"target_column": "a", "normalized": "<A>"},
{"target_column": "ac", "normalized": "<AC>"}
]
# [[0, 1337566253.89858, 0.000355720520019531], 2]

You need to use target option to use the user defined normalization table. The above schema uses target_column for explanation. Generally, _column in target_column is redundant but it’s added for easy to distinct parameter name and parameter value.

Here are examples of NormalizerTable with the user defined normalization table:

Execution example:

normalize 'NormalizerTable("normalized", "ColumnNormalizations.normalized", "target", "target_column")' "Groonga"
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   {
#     "normalized": "Groong<A>",
#     "types": [],
#     "checks": []
#   }
# ]
normalize 'NormalizerTable("normalized", "ColumnNormalizations.normalized", "target", "target_column")' "hack"
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   {
#     "normalized": "h<AC>k",
#     "types": [],
#     "checks": []
#   }
# ]

7.7.2.7.4. Parameters¶

7.7.2.7.4.1. Required parameter¶

7.7.2.7.4.1.1. `normalized`¶

This option specifies a column that has normalized texts. Normalized target texts are texts in corresponding _key column or column specified by target.

Value type of the column specified for this option must be one of ShortText, Text and LongText.

If you don’t use target, the table of column specified for this option must satisfy the followings:

Table type is TABLE_PAT_KEY

Table key type is ShortText

See Simple usage for usage of this case.

7.7.2.7.4.2. Optional parameters¶

7.7.2.7.4.2.1. `target`¶

This option specifies a column that has normalization target texts.

Value type of the column specified for this option must be one of ShortText, Text and LongText.

You must create an index column for the column specified for this option. The index column and its lexicon must satisfies the followings:

Index column can be a single column index or a multi column index.

Lexicon type of the index column must be TABLE_PAT_KEY.

Lexicon key type of the index column must be ShortText.

Lexicon of the index must not have tokenizer.

See Advanced usage for usage of this case.

7.7.2.7.4.2.2. `unicode_version`¶

This option specifies Unicode version to use determining character type.

The default Unicode version is 5.0.0.

See Unicode version for usage.

7.7.2.7.5. See also¶

normalize

7.7.2.6. NormalizerNFKC51

7.8. Tokenizers

7.7.2.7. NormalizerTable¶

7.7.2.7.1. Summary¶

7.7.2.7.2. Syntax¶

7.7.2.7.3. Usage¶

7.7.2.7.3.1. Simple usage¶

7.7.2.7.3.2. Unicode version¶

7.7.2.7.3.3. Advanced usage¶

7.7.2.7.4. Parameters¶

7.7.2.7.4.1. Required parameter¶

7.7.2.7.4.1.1. normalized¶

7.7.2.7.4.2. Optional parameters¶

7.7.2.7.4.2.1. target¶

7.7.2.7.4.2.2. unicode_version¶

7.7.2.7.5. See also¶

7.7.2.7. `NormalizerTable`¶

7.7.2.7.4.1.1. `normalized`¶

7.7.2.7.4.2.1. `target`¶

7.7.2.7.4.2.2. `unicode_version`¶