7.7. Normalizers

7.7.1. Summary

Groonga has normalizer module that normalizes text. It is used when tokenizing text and storing table key. For example, A and a are processed as the same character after normalization.

Normalizer module can be added as a plugin. You can customize text normalization by registering your normalizer plugins to Groonga.

A normalizer module is attached to a table. A table can have zero or one normalizer module. You can attach a normalizer module to a table by normalizer option in table_create.

Here is an example table_create that uses NormalizerAuto normalizer module:

Execution example:

table_create Dictionary TABLE_HASH_KEY ShortText --normalizer NormalizerAuto
# [[0, 1337566253.89858, 0.000355720520019531], true]

Note

Groonga 2.0.9 or earlier doesn’t have --normalizer option in table_create. KEY_NORMALIZE flag was used instead.

You can open an old database by Groonga 2.1.0 or later. An old database means that the database is created by Groonga 2.0.9 or earlier. But you cannot open the opened old database by Groonga 2.0.9 or earlier. Once you open the old database by Groonga 2.1.0 or later, KEY_NORMALIZE flag information in the old database is converted to normalizer information. So Groonga 2.0.9 or earlier cannot find KEY_NORMALIZE flag information in the opened old database.

Keys of a table that has a normalizer module are normalized:

Execution example:

load --table Dictionary
[
{"_key": "Apple"},
{"_key": "black"},
{"_key": "COLOR"}
]
# [[0, 1337566253.89858, 0.000355720520019531], 3]
select Dictionary
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     [
#       [
#         3
#       ],
#       [
#         [
#           "_id",
#           "UInt32"
#         ],
#         [
#           "_key",
#           "ShortText"
#         ]
#       ],
#       [
#         1,
#         "apple"
#       ],
#       [
#         2,
#         "black"
#       ],
#       [
#         3,
#         "color"
#       ]
#     ]
#   ]
# ]

NormalizerAuto normalizer normalizes a text as a downcased text. For example, "Apple" is normalized to "apple", "black" is normalized to "black" and "COLOR" is normalized to "color".

If a table is a lexicon for fulltext search, tokenized tokens are normalized. Because tokens are stored as table keys. Table keys are normalized as described above.

7.7.2. Built-in normalizers

Here is a list of built-in normalizers:

7.7.3. Additional normalizers

There are additional normalizers:

7.7.4. See also