7.3.68. tokenize
¶
7.3.68.1. Summary¶
tokenize
command tokenizes text by the specified tokenizer.
It is useful to debug tokenization.
7.3.68.2. Syntax¶
This command takes many parameters.
tokenizer
and string
are required parameters. Others are
optional:
tokenize tokenizer
string
[normalizer=null]
[flags=NONE]
[mode=ADD]
[token_filters=NONE]
7.3.68.3. Usage¶
Here is a simple example.
Execution example:
tokenize TokenBigram "Fulltext Search"
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "position": 0,
# "force_prefix": false,
# "value": "Fu"
# },
# {
# "position": 1,
# "force_prefix": false,
# "value": "ul"
# },
# {
# "position": 2,
# "force_prefix": false,
# "value": "ll"
# },
# {
# "position": 3,
# "force_prefix": false,
# "value": "lt"
# },
# {
# "position": 4,
# "force_prefix": false,
# "value": "te"
# },
# {
# "position": 5,
# "force_prefix": false,
# "value": "ex"
# },
# {
# "position": 6,
# "force_prefix": false,
# "value": "xt"
# },
# {
# "position": 7,
# "force_prefix": false,
# "value": "t "
# },
# {
# "position": 8,
# "force_prefix": false,
# "value": " S"
# },
# {
# "position": 9,
# "force_prefix": false,
# "value": "Se"
# },
# {
# "position": 10,
# "force_prefix": false,
# "value": "ea"
# },
# {
# "position": 11,
# "force_prefix": false,
# "value": "ar"
# },
# {
# "position": 12,
# "force_prefix": false,
# "value": "rc"
# },
# {
# "position": 13,
# "force_prefix": false,
# "value": "ch"
# },
# {
# "position": 14,
# "force_prefix": false,
# "value": "h"
# }
# ]
# ]
It has only required parameters. tokenizer
is TokenBigram
and
string
is "Fulltext Search"
. It returns tokens that is
generated by tokenizing "Fulltext Search"
with TokenBigram
tokenizer. It doesn’t normalize "Fulltext Search"
.
7.3.68.4. Parameters¶
This section describes all parameters. Parameters are categorized.
7.3.68.4.1. Required parameters¶
There are required parameters, tokenizer
and string
.
7.3.68.4.1.1. tokenizer
¶
Specifies the tokenizer name. tokenize
command uses the
tokenizer that is named tokenizer
.
See Tokenizers about built-in tokenizers.
Here is an example to use built-in TokenTrigram
tokenizer.
Execution example:
tokenize TokenTrigram "Fulltext Search"
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "position": 0,
# "force_prefix": false,
# "value": "Ful"
# },
# {
# "position": 1,
# "force_prefix": false,
# "value": "ull"
# },
# {
# "position": 2,
# "force_prefix": false,
# "value": "llt"
# },
# {
# "position": 3,
# "force_prefix": false,
# "value": "lte"
# },
# {
# "position": 4,
# "force_prefix": false,
# "value": "tex"
# },
# {
# "position": 5,
# "force_prefix": false,
# "value": "ext"
# },
# {
# "position": 6,
# "force_prefix": false,
# "value": "xt "
# },
# {
# "position": 7,
# "force_prefix": false,
# "value": "t S"
# },
# {
# "position": 8,
# "force_prefix": false,
# "value": " Se"
# },
# {
# "position": 9,
# "force_prefix": false,
# "value": "Sea"
# },
# {
# "position": 10,
# "force_prefix": false,
# "value": "ear"
# },
# {
# "position": 11,
# "force_prefix": false,
# "value": "arc"
# },
# {
# "position": 12,
# "force_prefix": false,
# "value": "rch"
# },
# {
# "position": 13,
# "force_prefix": false,
# "value": "ch"
# },
# {
# "position": 14,
# "force_prefix": false,
# "value": "h"
# }
# ]
# ]
If you want to use other tokenizers, you need to register additional
tokenizer plugin by register command. For example, you can use
KyTea based tokenizer by
registering tokenizers/kytea
.
7.3.68.4.1.2. string
¶
Specifies any string which you want to tokenize.
If you want to include spaces in string
, you need to quote
string
by single quotation ('
) or double quotation ("
).
Here is an example to use spaces in string
.
Execution example:
tokenize TokenBigram "Groonga is a fast fulltext earch engine!"
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "position": 0,
# "force_prefix": false,
# "value": "Gr"
# },
# {
# "position": 1,
# "force_prefix": false,
# "value": "ro"
# },
# {
# "position": 2,
# "force_prefix": false,
# "value": "oo"
# },
# {
# "position": 3,
# "force_prefix": false,
# "value": "on"
# },
# {
# "position": 4,
# "force_prefix": false,
# "value": "ng"
# },
# {
# "position": 5,
# "force_prefix": false,
# "value": "ga"
# },
# {
# "position": 6,
# "force_prefix": false,
# "value": "a "
# },
# {
# "position": 7,
# "force_prefix": false,
# "value": " i"
# },
# {
# "position": 8,
# "force_prefix": false,
# "value": "is"
# },
# {
# "position": 9,
# "force_prefix": false,
# "value": "s "
# },
# {
# "position": 10,
# "force_prefix": false,
# "value": " a"
# },
# {
# "position": 11,
# "force_prefix": false,
# "value": "a "
# },
# {
# "position": 12,
# "force_prefix": false,
# "value": " f"
# },
# {
# "position": 13,
# "force_prefix": false,
# "value": "fa"
# },
# {
# "position": 14,
# "force_prefix": false,
# "value": "as"
# },
# {
# "position": 15,
# "force_prefix": false,
# "value": "st"
# },
# {
# "position": 16,
# "force_prefix": false,
# "value": "t "
# },
# {
# "position": 17,
# "force_prefix": false,
# "value": " f"
# },
# {
# "position": 18,
# "force_prefix": false,
# "value": "fu"
# },
# {
# "position": 19,
# "force_prefix": false,
# "value": "ul"
# },
# {
# "position": 20,
# "force_prefix": false,
# "value": "ll"
# },
# {
# "position": 21,
# "force_prefix": false,
# "value": "lt"
# },
# {
# "position": 22,
# "force_prefix": false,
# "value": "te"
# },
# {
# "position": 23,
# "force_prefix": false,
# "value": "ex"
# },
# {
# "position": 24,
# "force_prefix": false,
# "value": "xt"
# },
# {
# "position": 25,
# "force_prefix": false,
# "value": "t "
# },
# {
# "position": 26,
# "force_prefix": false,
# "value": " e"
# },
# {
# "position": 27,
# "force_prefix": false,
# "value": "ea"
# },
# {
# "position": 28,
# "force_prefix": false,
# "value": "ar"
# },
# {
# "position": 29,
# "force_prefix": false,
# "value": "rc"
# },
# {
# "position": 30,
# "force_prefix": false,
# "value": "ch"
# },
# {
# "position": 31,
# "force_prefix": false,
# "value": "h "
# },
# {
# "position": 32,
# "force_prefix": false,
# "value": " e"
# },
# {
# "position": 33,
# "force_prefix": false,
# "value": "en"
# },
# {
# "position": 34,
# "force_prefix": false,
# "value": "ng"
# },
# {
# "position": 35,
# "force_prefix": false,
# "value": "gi"
# },
# {
# "position": 36,
# "force_prefix": false,
# "value": "in"
# },
# {
# "position": 37,
# "force_prefix": false,
# "value": "ne"
# },
# {
# "position": 38,
# "force_prefix": false,
# "value": "e!"
# },
# {
# "position": 39,
# "force_prefix": false,
# "value": "!"
# }
# ]
# ]
7.3.68.4.2. Optional parameters¶
There are optional parameters.
7.3.68.4.2.1. normalizer
¶
Specifies the normalizer name. tokenize
command uses the
normalizer that is named normalizer
. Normalizer is important for
N-gram family tokenizers such as TokenBigram
.
Normalizer detects character type for each character while normalizing. N-gram family tokenizers use character types while tokenizing.
Here is an example that doesn’t use normalizer.
Execution example:
tokenize TokenBigram "Fulltext Search"
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "position": 0,
# "force_prefix": false,
# "value": "Fu"
# },
# {
# "position": 1,
# "force_prefix": false,
# "value": "ul"
# },
# {
# "position": 2,
# "force_prefix": false,
# "value": "ll"
# },
# {
# "position": 3,
# "force_prefix": false,
# "value": "lt"
# },
# {
# "position": 4,
# "force_prefix": false,
# "value": "te"
# },
# {
# "position": 5,
# "force_prefix": false,
# "value": "ex"
# },
# {
# "position": 6,
# "force_prefix": false,
# "value": "xt"
# },
# {
# "position": 7,
# "force_prefix": false,
# "value": "t "
# },
# {
# "position": 8,
# "force_prefix": false,
# "value": " S"
# },
# {
# "position": 9,
# "force_prefix": false,
# "value": "Se"
# },
# {
# "position": 10,
# "force_prefix": false,
# "value": "ea"
# },
# {
# "position": 11,
# "force_prefix": false,
# "value": "ar"
# },
# {
# "position": 12,
# "force_prefix": false,
# "value": "rc"
# },
# {
# "position": 13,
# "force_prefix": false,
# "value": "ch"
# },
# {
# "position": 14,
# "force_prefix": false,
# "value": "h"
# }
# ]
# ]
All alphabets are tokenized by two characters. For example, Fu
is
a token.
Here is an example that uses normalizer.
Execution example:
tokenize TokenBigram "Fulltext Search" NormalizerAuto
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "position": 0,
# "force_prefix": false,
# "value": "fulltext"
# },
# {
# "position": 1,
# "force_prefix": false,
# "value": "search"
# }
# ]
# ]
Continuous alphabets are tokenized as one token. For example,
fulltext
is a token.
If you want to tokenize by two characters with noramlizer, use
TokenBigramSplitSymbolAlpha
.
Execution example:
tokenize TokenBigramSplitSymbolAlpha "Fulltext Search" NormalizerAuto
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "position": 0,
# "force_prefix": false,
# "value": "fu"
# },
# {
# "position": 1,
# "force_prefix": false,
# "value": "ul"
# },
# {
# "position": 2,
# "force_prefix": false,
# "value": "ll"
# },
# {
# "position": 3,
# "force_prefix": false,
# "value": "lt"
# },
# {
# "position": 4,
# "force_prefix": false,
# "value": "te"
# },
# {
# "position": 5,
# "force_prefix": false,
# "value": "ex"
# },
# {
# "position": 6,
# "force_prefix": false,
# "value": "xt"
# },
# {
# "position": 7,
# "force_prefix": false,
# "value": "t"
# },
# {
# "position": 8,
# "force_prefix": false,
# "value": "se"
# },
# {
# "position": 9,
# "force_prefix": false,
# "value": "ea"
# },
# {
# "position": 10,
# "force_prefix": false,
# "value": "ar"
# },
# {
# "position": 11,
# "force_prefix": false,
# "value": "rc"
# },
# {
# "position": 12,
# "force_prefix": false,
# "value": "ch"
# },
# {
# "position": 13,
# "force_prefix": false,
# "value": "h"
# }
# ]
# ]
All alphabets are tokenized by two characters. And they are normalized
to lower case characters. For example, fu
is a token.
7.3.68.4.2.2. flags
¶
Specifies a tokenization customize options. You can specify
multiple options separated by “|
”. For example,
NONE|ENABLE_TOKENIZED_DELIMITER
.
Here are available flags.
Flag |
Description |
---|---|
|
Just ignored. |
|
Enables tokenized delimiter. See Tokenizers about tokenized delimiter details. |
Here is an example that uses ENABLE_TOKENIZED_DELIMITER
.
Execution example:
tokenize TokenDelimit "Fulltext Seacrch" NormalizerAuto ENABLE_TOKENIZED_DELIMITER
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "position": 0,
# "force_prefix": false,
# "value": "full"
# },
# {
# "position": 1,
# "force_prefix": false,
# "value": "text sea"
# },
# {
# "position": 2,
# "force_prefix": false,
# "value": "crch"
# }
# ]
# ]
TokenDelimit
tokenizer is one of tokenized delimiter supported
tokenizer. ENABLE_TOKENIZED_DELIMITER
enables tokenized delimiter.
Tokenized delimiter is special character that indicates token
border. It is U+FFFE
. The character is not assigned any
character. It means that the character is not appeared in normal
string. So the character is good character for this puropose. If
ENABLE_TOKENIZED_DELIMITER
is enabled, the target string is
treated as already tokenized string. Tokenizer just tokenizes by
tokenized delimiter.
7.3.68.4.2.3. mode
¶
Specifies a tokenize mode. If the mode is specified ADD
, the text
is tokenized by the rule that adding a document. If the mode is specified
GET
, the text is tokenized by the rule that searching a document. If
the mode is omitted, the text is tokenized by the ADD
mode.
The default mode is ADD
.
Here is an example to the ADD
mode.
Execution example:
tokenize TokenBigram "Fulltext Search" --mode ADD
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "position": 0,
# "force_prefix": false,
# "value": "Fu"
# },
# {
# "position": 1,
# "force_prefix": false,
# "value": "ul"
# },
# {
# "position": 2,
# "force_prefix": false,
# "value": "ll"
# },
# {
# "position": 3,
# "force_prefix": false,
# "value": "lt"
# },
# {
# "position": 4,
# "force_prefix": false,
# "value": "te"
# },
# {
# "position": 5,
# "force_prefix": false,
# "value": "ex"
# },
# {
# "position": 6,
# "force_prefix": false,
# "value": "xt"
# },
# {
# "position": 7,
# "force_prefix": false,
# "value": "t "
# },
# {
# "position": 8,
# "force_prefix": false,
# "value": " S"
# },
# {
# "position": 9,
# "force_prefix": false,
# "value": "Se"
# },
# {
# "position": 10,
# "force_prefix": false,
# "value": "ea"
# },
# {
# "position": 11,
# "force_prefix": false,
# "value": "ar"
# },
# {
# "position": 12,
# "force_prefix": false,
# "value": "rc"
# },
# {
# "position": 13,
# "force_prefix": false,
# "value": "ch"
# },
# {
# "position": 14,
# "force_prefix": false,
# "value": "h"
# }
# ]
# ]
The last alphabet is tokenized by one character.
Here is an example to the GET
mode.
Execution example:
tokenize TokenBigram "Fulltext Search" --mode GET
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "position": 0,
# "force_prefix": false,
# "value": "Fu"
# },
# {
# "position": 1,
# "force_prefix": false,
# "value": "ul"
# },
# {
# "position": 2,
# "force_prefix": false,
# "value": "ll"
# },
# {
# "position": 3,
# "force_prefix": false,
# "value": "lt"
# },
# {
# "position": 4,
# "force_prefix": false,
# "value": "te"
# },
# {
# "position": 5,
# "force_prefix": false,
# "value": "ex"
# },
# {
# "position": 6,
# "force_prefix": false,
# "value": "xt"
# },
# {
# "position": 7,
# "force_prefix": false,
# "value": "t "
# },
# {
# "position": 8,
# "force_prefix": false,
# "value": " S"
# },
# {
# "position": 9,
# "force_prefix": false,
# "value": "Se"
# },
# {
# "position": 10,
# "force_prefix": false,
# "value": "ea"
# },
# {
# "position": 11,
# "force_prefix": false,
# "value": "ar"
# },
# {
# "position": 12,
# "force_prefix": false,
# "value": "rc"
# },
# {
# "position": 13,
# "force_prefix": false,
# "value": "ch"
# }
# ]
# ]
The last alphabet is tokenized by two characters.
7.3.68.4.2.4. token_filters
¶
Specifies the token filter names. tokenize
command uses the
tokenizer that is named token_filters
.
See Token filters about token filters.
7.3.68.5. Return value¶
tokenize
command returns tokenized tokens. Each token has some
attributes except token itself. The attributes will be increased in
the feature:
[HEADER, tokens]
HEADER
See Output format about
HEADER
.
tokens
tokens
is an array of token. Token is an object that has the following attributes.
Name
Description
value
Token itself.
position
The N-th token.