7.8.13. TokenNgram

7.8.13.1. Summary

TokenNgram can change define its behavior dynamically via its options. For example, we can use it as unigram, bigram, trigram on with changing n option value as below.

Uni-gram:

Execution example:

tokenize --tokenizer 'TokenNgram("n", 1)' --string "Hello World"
# [
#   [
#     0,
#     1556063953.911713,
#     0.05761265754699707
#   ],
#   [
#     {
#       "value": "H",
#       "position": 0,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "e",
#       "position": 1,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "l",
#       "position": 2,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "l",
#       "position": 3,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "o",
#       "position": 4,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": " ",
#       "position": 5,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "W",
#       "position": 6,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "o",
#       "position": 7,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "r",
#       "position": 8,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "l",
#       "position": 9,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "d",
#       "position": 10,
#       "force_prefix": false,
#       "force_prefix_search": false
#     }
#   ]
# ]

Bi-gram:

Execution example:

tokenize --tokenizer 'TokenNgram("n", 2)' --string "Hello World"
# [
#   [
#     0,
#     1556064877.095833,
#     0.0002193450927734375
#   ],
#   [
#     {
#       "value": "He",
#       "position": 0,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "el",
#       "position": 1,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "ll",
#       "position": 2,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "lo",
#       "position": 3,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "o ",
#       "position": 4,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": " W",
#       "position": 5,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "Wo",
#       "position": 6,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "or",
#       "position": 7,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "rl",
#       "position": 8,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "ld",
#       "position": 9,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "d",
#       "position": 10,
#       "force_prefix": false,
#       "force_prefix_search": false
#     }
#   ]
# ]

Tri-gram:

Execution example:

tokenize --tokenizer 'TokenNgram("n", 3)' --string "Hello World"
# [
#   [
#     0,
#     1556064951.790899,
#     0.0002274513244628906
#   ],
#   [
#     {
#       "value": "Hel",
#       "position": 0,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "ell",
#       "position": 1,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "llo",
#       "position": 2,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "lo ",
#       "position": 3,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "o W",
#       "position": 4,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": " Wo",
#       "position": 5,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "Wor",
#       "position": 6,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "orl",
#       "position": 7,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "rld",
#       "position": 8,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "ld",
#       "position": 9,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "d",
#       "position": 10,
#       "force_prefix": false,
#       "force_prefix_search": false
#     }
#   ]
# ]

This tokenizer also has options other than the above.

7.8.13.2. Syntax

TokenNgram has optional parameter.

No options:

TokenNgram

If we don’t use options, TokenNgram is the same behavior for TokenBigram.

Specify option:

TokenNgram("n", true)

TokenNgram("loose_symbol", true)

TokenNgram("loose_blank", true)

TokenNgram("remove_blank", true)

TokenNgram("report_source_location", true)

TokenNgram("unify_alphabet", true)

TokenNgram("unify_symbol", true)

TokenNgram("unify_digit", true)

Specify multiple options:

TokenNgram("loose_symbol", true, "loose_blank", true)

We can also specify multiple options at the same time except the above examples.

7.8.13.3. Usage

7.8.13.4. Simple usage

Here is an example of TokenNgram.

If we use TokenNgram in the nothing of option. TokenNgram behavior is the same as TokenBigram as below:

  • If no normalizer is used.

Execution example:

tokenize --tokenizer TokenNgram --string "Hello World"
# [
#   [
#     0,
#     1556063548.341422,
#     0.0002450942993164062
#   ],
#   [
#     {
#       "value": "He",
#       "position": 0,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "el",
#       "position": 1,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "ll",
#       "position": 2,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "lo",
#       "position": 3,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "o ",
#       "position": 4,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": " W",
#       "position": 5,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "Wo",
#       "position": 6,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "or",
#       "position": 7,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "rl",
#       "position": 8,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "ld",
#       "position": 9,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "d",
#       "position": 10,
#       "force_prefix": false,
#       "force_prefix_search": false
#     }
#   ]
# ]
  • If normalizer is used.

Execution example:

tokenize --tokenizer TokenNgram --string "Hello World" --normalizer NormalizerAuto
# [
#   [
#     0,
#     1556063752.798628,
#     0.0001776218414306641
#   ],
#   [
#     {
#       "value": "hello",
#       "position": 0,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "world",
#       "position": 1,
#       "force_prefix": false,
#       "force_prefix_search": false
#     }
#   ]
# ]

7.8.13.5. Advanced usage

We can specify multiple options for TokenNgram.

For example, we can deal with variantsas as phone number by using loose_symbol and loose_blank as below.

We can search 0123(45)6789, 0123-45-6789 and, 0123 45 6789 by using 0123456789 as below example.

Execution example:

tokenize --tokenizer 'TokenNgram("loose_symbol", true, "loose_blank", true)' --string "0123(45)6789" --normalizer NormalizerAuto
# [
#   [
#     0,
#     1557187317.825164,
#     0.0003116130828857422
#   ],
#   [
#     {
#       "value": "0123",
#       "position": 0,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "(",
#       "position": 1,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "45",
#       "position": 2,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": ")",
#       "position": 3,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "6789",
#       "position": 4,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "￰",
#       "position": 5,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "0123456789",
#       "position": 6,
#       "force_prefix": false,
#       "force_prefix_search": false
#     }
#   ]
# ]
tokenize --tokenizer 'TokenNgram("loose_symbol", true, "loose_blank", true)' --string "0123-45-6789" --normalizer NormalizerAuto
# [
#   [
#     0,
#     1557187366.441741,
#     0.0002820491790771484
#   ],
#   [
#     {
#       "value": "0123",
#       "position": 0,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "-",
#       "position": 1,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "45",
#       "position": 2,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "-",
#       "position": 3,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "6789",
#       "position": 4,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "￰",
#       "position": 5,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "0123456789",
#       "position": 6,
#       "force_prefix": false,
#       "force_prefix_search": false
#     }
#   ]
# ]
tokenize --tokenizer 'TokenNgram("loose_symbol", true, "loose_blank", true)' --string "0123 45 6789" --normalizer NormalizerAuto
# [
#   [
#     0,
#     1557187404.034283,
#     0.0003006458282470703
#   ],
#   [
#     {
#       "value": "0123",
#       "position": 0,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "45",
#       "position": 1,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "6789",
#       "position": 2,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "￰",
#       "position": 3,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "0123456789",
#       "position": 4,
#       "force_prefix": false,
#       "force_prefix_search": false
#     }
#   ]
# ]

7.8.13.6. Parameters

7.8.13.6.1. Optional parameter

There are eight optional parameters.

7.8.13.6.1.1. n

This option shows N of Ngram. For example, n is 3 for trigram.

7.8.13.6.1.2. loose_symbol

Tokenize keywords including symbols, to be searched by both queries with/without symbols. For example, a keyword 090-1111-2222 will be found by any of 09011112222, 090, 1111, 2222 and 090-1111-2222.

7.8.13.6.1.3. loose_blank

Tokenize keywords including blanks, to be searched by both queries with/without blanks. For example, a keyword 090 1111 2222 will be found by any of 09011112222, 090, 1111, 2222 and 090 1111 2222.

7.8.13.6.1.4. remove_blank

Tokenize keywords including blanks, to be searched by queries without blanks. For example, a keyword 090 1111 2222 will be found by any of 09011112222, 090, 1111 or 2222.

Note that the keyword won’t be found by a query including blanks like 090 1111 2222.

7.8.13.6.1.5. report_source_location

This option tells us a location of token of highlight target when we highlight token by highlight_html.

We only use this option when we want to highlight token by highlight_html.

When Groonga tokenize texts that target highlight, always used NormalizerAuto to normalizer until now. Therefore, if we use NormalizerNFKC100 to normalizer, sometimes it can’t find the position of the highlight as below.

Execution example:

table_create Entries TABLE_NO_KEY
#[[0,0.0,0.0],true]
column_create Entries body COLUMN_SCALAR ShortText
#[[0,0.0,0.0],true]
table_create Terms TABLE_PAT_KEY ShortText --normalizer 'NormalizerNFKC100'
#[[0,0.0,0.0],true]
column_create Terms document_index COLUMN_INDEX|WITH_POSITION Entries body
#[[0,0.0,0.0],true]
load --table Entries
[
{"body": "ア㌕Az"}
]
#[[0,0.0,0.0],1]
select Entries   --match_columns body   --query 'グラム'   --output_columns 'highlight_html(body, Terms)'
#[
#  [
#    0,
#    1558915202.24493,
#    0.0003714561462402344
#  ],
#  [[[0],[["highlight_html",null]]]]
#]

Because we use different normalizer to normalize token.

This option is used to reduce the shift of the position of the highlight as below.

Execution example:

table_create Entries TABLE_NO_KEY
#[[0,0.0,0.0],true]
column_create Entries body COLUMN_SCALAR ShortText
#[[0,0.0,0.0],true]
table_create Terms TABLE_PAT_KEY ShortText   --default_tokenizer 'TokenNgram("report_source_location", true)'   --normalizer 'NormalizerNFKC100'
#[[0,0.0,0.0],true]
column_create Terms document_index COLUMN_INDEX|WITH_POSITION Entries body
#[[0,0.0,0.0],true]
load --table Entries
[
{"body": "ア㌕Az"}
]
#[[0,0.0,0.0],1]
select Entries   --match_columns body   --query 'グラム'   --output_columns 'highlight_html(body, Terms)'
#[
#  [
#    0,
#    0.0,
#    0.0
#  ],
#  [
#    [
#      [
#        1
#      ],
#      [
#        [
#          "highlight_html",
#          null
#        ]
#      ],
#      [
#        "ア<span class=\"keyword\">㌕</span>Az"
#      ]
#    ]
#  ]
#]

7.8.13.6.1.6. unify_alphabet

If we set false, TokenNgram uses bigram tokenize method for ASCII character.

Default value of this option is true.

Execution example:

tokenize --tokenizer 'TokenNgram("unify_alphabet", false)' --string "abcd ABCD" --normaluzer NormalizeAuto
#[
#  [
#    0,
#    1558570398.145967,
#    0.0009109973907470703
#  ],
#  [
#    {
#      "value": "ab",
#      "position": 0,
#      "force_prefix": false,
#      "force_prefix_search": false
#    },
#    {
#      "value": "bc",
#      "position": 1,
#      "force_prefix": false,
#      "force_prefix_search": false
#    },
#    {
#      "value": "cd",
#      "position": 2,
#      "force_prefix": false,
#      "force_prefix_search": false
#    },
#    {
#      "value": "d ",
#      "position": 3,
#      "force_prefix": false,
#      "force_prefix_search": false
#    },
#    {
#      "value": " A",
#      "position": 4,
#      "force_prefix": false,
#      "force_prefix_search": false
#    },
#    {
#      "value": "AB",
#      "position": 5,
#      "force_prefix": false,
#      "force_prefix_search": false
#    },
#    {
#      "value": "BC",
#      "position": 6,
#      "force_prefix": false,
#      "force_prefix_search": false
#    },
#    {
#      "value": "CD",
#      "position": 7,
#      "force_prefix": false,
#      "force_prefix_search": false
#    },
#    {
#      "value": "D",
#      "position": 8,
#      "force_prefix": false,
#      "force_prefix_search": false
#    }
#  ]
#]

7.8.13.6.1.7. unify_symbol

If we set false, TokenNgram uses bigram tokenize method for symbols. TokenNgram("unify_symbol", false) is same behavior of TokenBigramSplitSymbol.

Default value of this option is true.

Execution example:

tokenize --tokenizer 'TokenNgram("unify_symbol", false)' --string "___---" --normalizer NormalizerAuto
#[
#  [
#    0,
#    1558913369.875591,
#    0.0008268356323242188
#  ],
#  [
#    {
#      "value": "__",
#      "position": 0,
#      "force_prefix": false,
#      "force_prefix_search": false
#    },
#    {
#      "value": "__",
#      "position": 1,
#      "force_prefix": false,
#      "force_prefix_search": false
#    },
#    {
#      "value": "_-",
#      "position": 2,
#      "force_prefix": false,
#      "force_prefix_search": false
#    },
#    {
#      "value": "--",
#      "position": 3,
#      "force_prefix": false,
#      "force_prefix_search": false
#    },
#    {
#      "value": "--",
#      "position": 4,
#      "force_prefix": false,
#      "force_prefix_search": false
#    },
#    {
#      "value": "-",
#      "position": 5,
#      "force_prefix": false,
#      "force_prefix_search": false
#    }
#  ]
#]

7.8.13.6.1.8. unify_digit

If we set false, TokenNgram uses bigram tokenize method for digits.

Default value of this option is true.

Execution example:

tokenize --tokenizer 'TokenNgram("unify_digit", false)' --string "012345 6789" --normalizer NormalizerAuto
#[
#  [
#    0,
#    1558914023.967506,
#    0.001635313034057617
#  ],
#  [
#    {
#      "value": "01",
#      "position": 0,
#      "force_prefix": false,
#      "force_prefix_search": false
#    },
#    {
#      "value": "12",
#      "position": 1,
#      "force_prefix": false,
#      "force_prefix_search": false
#    },
#    {
#      "value": "23",
#      "position": 2,
#      "force_prefix": false,
#      "force_prefix_search": false
#    },
#    {
#      "value": "34",
#      "position": 3,
#      "force_prefix": false,
#      "force_prefix_search": false
#    },
#    {
#      "value": "45",
#      "position": 4,
#      "force_prefix": false,
#      "force_prefix_search": false
#    },
#    {
#      "value": "5",
#      "position": 5,
#      "force_prefix": false,
#      "force_prefix_search": false
#    },
#    {
#      "value": "67",
#      "position": 6,
#      "force_prefix": false,
#      "force_prefix_search": false
#    },
#    {
#      "value": "78",
#      "position": 7,
#      "force_prefix": false,
#      "force_prefix_search": false
#    },
#    {
#      "value": "89",
#      "position": 8,
#      "force_prefix": false,
#      "force_prefix_search": false
#    },
#    {
#      "value": "9",
#      "position": 9,
#      "force_prefix": false,
#      "force_prefix_search": false
#    }
#  ]
#]

7.8.13.7. See also