7.8.13. TokenNgram
¶
7.8.13.1. Summary¶
TokenNgram
can change define its behavior dynamically via its options.
For example, we can use it as unigram, bigram, trigram on with changing n
option value as below.
Uni-gram:
Execution example:
tokenize --tokenizer 'TokenNgram("n", 1)' --string "Hello World"
# [
# [
# 0,
# 1556063953.911713,
# 0.05761265754699707
# ],
# [
# {
# "value": "H",
# "position": 0,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "e",
# "position": 1,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "l",
# "position": 2,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "l",
# "position": 3,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "o",
# "position": 4,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": " ",
# "position": 5,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "W",
# "position": 6,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "o",
# "position": 7,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "r",
# "position": 8,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "l",
# "position": 9,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "d",
# "position": 10,
# "force_prefix": false,
# "force_prefix_search": false
# }
# ]
# ]
Bi-gram:
Execution example:
tokenize --tokenizer 'TokenNgram("n", 2)' --string "Hello World"
# [
# [
# 0,
# 1556064877.095833,
# 0.0002193450927734375
# ],
# [
# {
# "value": "He",
# "position": 0,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "el",
# "position": 1,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ll",
# "position": 2,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "lo",
# "position": 3,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "o ",
# "position": 4,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": " W",
# "position": 5,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "Wo",
# "position": 6,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "or",
# "position": 7,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "rl",
# "position": 8,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ld",
# "position": 9,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "d",
# "position": 10,
# "force_prefix": false,
# "force_prefix_search": false
# }
# ]
# ]
Tri-gram:
Execution example:
tokenize --tokenizer 'TokenNgram("n", 3)' --string "Hello World"
# [
# [
# 0,
# 1556064951.790899,
# 0.0002274513244628906
# ],
# [
# {
# "value": "Hel",
# "position": 0,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ell",
# "position": 1,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "llo",
# "position": 2,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "lo ",
# "position": 3,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "o W",
# "position": 4,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": " Wo",
# "position": 5,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "Wor",
# "position": 6,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "orl",
# "position": 7,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "rld",
# "position": 8,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ld",
# "position": 9,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "d",
# "position": 10,
# "force_prefix": false,
# "force_prefix_search": false
# }
# ]
# ]
This tokenizer also has options other than the above.
7.8.13.2. Syntax¶
TokenNgram
has optional parameter.
No options:
TokenNgram
If we don’t use options, TokenNgram
is the same behavior for TokenBigram
.
Specify option:
TokenNgram("n", true)
TokenNgram("loose_symbol", true)
TokenNgram("loose_blank", true)
TokenNgram("remove_blank", true)
TokenNgram("report_source_location", true)
TokenNgram("unify_alphabet", true)
TokenNgram("unify_symbol", true)
TokenNgram("unify_digit", true)
Specify multiple options:
TokenNgram("loose_symbol", true, "loose_blank", true)
We can also specify multiple options at the same time except the above examples.
7.8.13.3. Usage¶
7.8.13.4. Simple usage¶
Here is an example of TokenNgram
.
If we use TokenNgram
in the nothing of option.
TokenNgram
behavior is the same as TokenBigram
as below:
If no normalizer is used.
Execution example:
tokenize --tokenizer TokenNgram --string "Hello World"
# [
# [
# 0,
# 1556063548.341422,
# 0.0002450942993164062
# ],
# [
# {
# "value": "He",
# "position": 0,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "el",
# "position": 1,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ll",
# "position": 2,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "lo",
# "position": 3,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "o ",
# "position": 4,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": " W",
# "position": 5,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "Wo",
# "position": 6,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "or",
# "position": 7,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "rl",
# "position": 8,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ld",
# "position": 9,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "d",
# "position": 10,
# "force_prefix": false,
# "force_prefix_search": false
# }
# ]
# ]
If normalizer is used.
Execution example:
tokenize --tokenizer TokenNgram --string "Hello World" --normalizer NormalizerAuto
# [
# [
# 0,
# 1556063752.798628,
# 0.0001776218414306641
# ],
# [
# {
# "value": "hello",
# "position": 0,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "world",
# "position": 1,
# "force_prefix": false,
# "force_prefix_search": false
# }
# ]
# ]
7.8.13.5. Advanced usage¶
We can specify multiple options for TokenNgram
.
For example, we can deal with variantsas as phone number by using loose_symbol
and loose_blank
as below.
We can search 0123(45)6789
, 0123-45-6789
and, 0123 45 6789
by using 0123456789
as below example.
Execution example:
tokenize --tokenizer 'TokenNgram("loose_symbol", true, "loose_blank", true)' --string "0123(45)6789" --normalizer NormalizerAuto
# [
# [
# 0,
# 1557187317.825164,
# 0.0003116130828857422
# ],
# [
# {
# "value": "0123",
# "position": 0,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "(",
# "position": 1,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "45",
# "position": 2,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": ")",
# "position": 3,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "6789",
# "position": 4,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "",
# "position": 5,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "0123456789",
# "position": 6,
# "force_prefix": false,
# "force_prefix_search": false
# }
# ]
# ]
tokenize --tokenizer 'TokenNgram("loose_symbol", true, "loose_blank", true)' --string "0123-45-6789" --normalizer NormalizerAuto
# [
# [
# 0,
# 1557187366.441741,
# 0.0002820491790771484
# ],
# [
# {
# "value": "0123",
# "position": 0,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "-",
# "position": 1,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "45",
# "position": 2,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "-",
# "position": 3,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "6789",
# "position": 4,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "",
# "position": 5,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "0123456789",
# "position": 6,
# "force_prefix": false,
# "force_prefix_search": false
# }
# ]
# ]
tokenize --tokenizer 'TokenNgram("loose_symbol", true, "loose_blank", true)' --string "0123 45 6789" --normalizer NormalizerAuto
# [
# [
# 0,
# 1557187404.034283,
# 0.0003006458282470703
# ],
# [
# {
# "value": "0123",
# "position": 0,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "45",
# "position": 1,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "6789",
# "position": 2,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "",
# "position": 3,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "0123456789",
# "position": 4,
# "force_prefix": false,
# "force_prefix_search": false
# }
# ]
# ]
7.8.13.6. Parameters¶
7.8.13.6.1. Optional parameter¶
There are eight optional parameters.
7.8.13.6.1.1. n
¶
This option shows N
of Ngram. For example, n
is 3
for trigram.
7.8.13.6.1.2. loose_symbol
¶
Tokenize keywords including symbols, to be searched by both queries with/without symbols.
For example, a keyword 090-1111-2222
will be found by any of 09011112222
, 090
, 1111
, 2222
and 090-1111-2222
.
7.8.13.6.1.3. loose_blank
¶
Tokenize keywords including blanks, to be searched by both queries with/without blanks.
For example, a keyword 090 1111 2222
will be found by any of 09011112222
, 090
, 1111
, 2222
and 090 1111 2222
.
7.8.13.6.1.4. remove_blank
¶
Tokenize keywords including blanks, to be searched by queries without blanks.
For example, a keyword 090 1111 2222
will be found by any of 09011112222
, 090
, 1111
or 2222
.
Note that the keyword won’t be found by a query including blanks like 090 1111 2222
.
7.8.13.6.1.5. report_source_location
¶
This option tells us a location of token of highlight target when we highlight token by highlight_html
.
We only use this option when we want to highlight token by highlight_html
.
When Groonga tokenize texts that target highlight, always used NormalizerAuto
to normalizer until now.
Therefore, if we use NormalizerNFKC100
to normalizer, sometimes it
can’t find the position of the highlight as below.
Execution example:
table_create Entries TABLE_NO_KEY
#[[0,0.0,0.0],true]
column_create Entries body COLUMN_SCALAR ShortText
#[[0,0.0,0.0],true]
table_create Terms TABLE_PAT_KEY ShortText --normalizer 'NormalizerNFKC100'
#[[0,0.0,0.0],true]
column_create Terms document_index COLUMN_INDEX|WITH_POSITION Entries body
#[[0,0.0,0.0],true]
load --table Entries
[
{"body": "ア㌕Az"}
]
#[[0,0.0,0.0],1]
select Entries --match_columns body --query 'グラム' --output_columns 'highlight_html(body, Terms)'
#[
# [
# 0,
# 1558915202.24493,
# 0.0003714561462402344
# ],
# [[[0],[["highlight_html",null]]]]
#]
Because we use different normalizer to normalize token.
This option is used to reduce the shift of the position of the highlight as below.
Execution example:
table_create Entries TABLE_NO_KEY
#[[0,0.0,0.0],true]
column_create Entries body COLUMN_SCALAR ShortText
#[[0,0.0,0.0],true]
table_create Terms TABLE_PAT_KEY ShortText --default_tokenizer 'TokenNgram("report_source_location", true)' --normalizer 'NormalizerNFKC100'
#[[0,0.0,0.0],true]
column_create Terms document_index COLUMN_INDEX|WITH_POSITION Entries body
#[[0,0.0,0.0],true]
load --table Entries
[
{"body": "ア㌕Az"}
]
#[[0,0.0,0.0],1]
select Entries --match_columns body --query 'グラム' --output_columns 'highlight_html(body, Terms)'
#[
# [
# 0,
# 0.0,
# 0.0
# ],
# [
# [
# [
# 1
# ],
# [
# [
# "highlight_html",
# null
# ]
# ],
# [
# "ア<span class=\"keyword\">㌕</span>Az"
# ]
# ]
# ]
#]
7.8.13.6.1.6. unify_alphabet
¶
If we set false, TokenNgram
uses bigram tokenize method for ASCII character.
Default value of this option is true.
Execution example:
tokenize --tokenizer 'TokenNgram("unify_alphabet", false)' --string "abcd ABCD" --normaluzer NormalizeAuto
#[
# [
# 0,
# 1558570398.145967,
# 0.0009109973907470703
# ],
# [
# {
# "value": "ab",
# "position": 0,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "bc",
# "position": 1,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "cd",
# "position": 2,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "d ",
# "position": 3,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": " A",
# "position": 4,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "AB",
# "position": 5,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "BC",
# "position": 6,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "CD",
# "position": 7,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "D",
# "position": 8,
# "force_prefix": false,
# "force_prefix_search": false
# }
# ]
#]
7.8.13.6.1.7. unify_symbol
¶
If we set false, TokenNgram
uses bigram tokenize method for symbols.
TokenNgram("unify_symbol", false)
is same behavior of TokenBigramSplitSymbol
.
Default value of this option is true.
Execution example:
tokenize --tokenizer 'TokenNgram("unify_symbol", false)' --string "___---" --normalizer NormalizerAuto
#[
# [
# 0,
# 1558913369.875591,
# 0.0008268356323242188
# ],
# [
# {
# "value": "__",
# "position": 0,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "__",
# "position": 1,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "_-",
# "position": 2,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "--",
# "position": 3,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "--",
# "position": 4,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "-",
# "position": 5,
# "force_prefix": false,
# "force_prefix_search": false
# }
# ]
#]
7.8.13.6.1.8. unify_digit
¶
If we set false, TokenNgram
uses bigram tokenize method for digits.
Default value of this option is true.
Execution example:
tokenize --tokenizer 'TokenNgram("unify_digit", false)' --string "012345 6789" --normalizer NormalizerAuto
#[
# [
# 0,
# 1558914023.967506,
# 0.001635313034057617
# ],
# [
# {
# "value": "01",
# "position": 0,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "12",
# "position": 1,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "23",
# "position": 2,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "34",
# "position": 3,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "45",
# "position": 4,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "5",
# "position": 5,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "67",
# "position": 6,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "78",
# "position": 7,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "89",
# "position": 8,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "9",
# "position": 9,
# "force_prefix": false,
# "force_prefix_search": false
# }
# ]
#]