7.8.13. TokenNgram

7.8.13.1. 概要

TokenNgram はオプションによって、動的に動作を定義できます。例えば、以下のように n オプションの値を変更する事によって、ユニグラム、バイグラム、トリグラムとして使えます。

ユニグラム:

実行例:

tokenize --tokenizer 'TokenNgram("n", 1)' --string "Hello World"
# [
#   [
#     0,
#     1556063953.911713,
#     0.05761265754699707
#   ],
#   [
#     {
#       "value": "H",
#       "position": 0,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "e",
#       "position": 1,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "l",
#       "position": 2,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "l",
#       "position": 3,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "o",
#       "position": 4,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": " ",
#       "position": 5,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "W",
#       "position": 6,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "o",
#       "position": 7,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "r",
#       "position": 8,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "l",
#       "position": 9,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "d",
#       "position": 10,
#       "force_prefix": false,
#       "force_prefix_search": false
#     }
#   ]
# ]

バイグラム:

実行例:

tokenize --tokenizer 'TokenNgram("n", 2)' --string "Hello World"
# [
#   [
#     0,
#     1556064877.095833,
#     0.0002193450927734375
#   ],
#   [
#     {
#       "value": "He",
#       "position": 0,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "el",
#       "position": 1,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "ll",
#       "position": 2,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "lo",
#       "position": 3,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "o ",
#       "position": 4,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": " W",
#       "position": 5,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "Wo",
#       "position": 6,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "or",
#       "position": 7,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "rl",
#       "position": 8,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "ld",
#       "position": 9,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "d",
#       "position": 10,
#       "force_prefix": false,
#       "force_prefix_search": false
#     }
#   ]
# ]

トリグラム:

実行例:

tokenize --tokenizer 'TokenNgram("n", 3)' --string "Hello World"
# [
#   [
#     0,
#     1556064951.790899,
#     0.0002274513244628906
#   ],
#   [
#     {
#       "value": "Hel",
#       "position": 0,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "ell",
#       "position": 1,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "llo",
#       "position": 2,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "lo ",
#       "position": 3,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "o W",
#       "position": 4,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": " Wo",
#       "position": 5,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "Wor",
#       "position": 6,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "orl",
#       "position": 7,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "rld",
#       "position": 8,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "ld",
#       "position": 9,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "d",
#       "position": 10,
#       "force_prefix": false,
#       "force_prefix_search": false
#     }
#   ]
# ]

このトークナイザーには、上記以外のオプションもあります。

7.8.13.2. 構文

TokenNgram は、省略可能な引数があります。

オプションなし:

TokenNgram

オプションを使わない場合、 TokenNgramTokenBigram と同じ動作になります。

オプション指定:

TokenNgram("n", true)

TokenNgram("loose_symbol", true)

TokenNgram("loose_blank", true)

TokenNgram("remove_blank", true)

TokenNgram("report_source_location", true)

TokenNgram("unify_alphabet", true)

TokenNgram("unify_symbol", true)

TokenNgram("unify_digit", true)

複数のオプション指定:

TokenNgram("loose_symbol", true, "loose_blank", true)

上記の例以外にも複数のオプションを組み合わせて指定できます。

7.8.13.3. 使い方

7.8.13.4. 簡単な使い方

以下は TokenNgram の例です。

TokenNgram をオプション無しで使った場合、以下のように TokenNgramTokenBigram と同じ動作をします。:

  • ノーマライザーを使用しない場合。

実行例:

tokenize --tokenizer TokenNgram --string "Hello World"
# [
#   [
#     0,
#     1556063548.341422,
#     0.0002450942993164062
#   ],
#   [
#     {
#       "value": "He",
#       "position": 0,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "el",
#       "position": 1,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "ll",
#       "position": 2,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "lo",
#       "position": 3,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "o ",
#       "position": 4,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": " W",
#       "position": 5,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "Wo",
#       "position": 6,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "or",
#       "position": 7,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "rl",
#       "position": 8,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "ld",
#       "position": 9,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "d",
#       "position": 10,
#       "force_prefix": false,
#       "force_prefix_search": false
#     }
#   ]
# ]
  • ノーマライザーを使用する場合。

実行例:

tokenize --tokenizer TokenNgram --string "Hello World" --normalizer NormalizerAuto
# [
#   [
#     0,
#     1556063752.798628,
#     0.0001776218414306641
#   ],
#   [
#     {
#       "value": "hello",
#       "position": 0,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "world",
#       "position": 1,
#       "force_prefix": false,
#       "force_prefix_search": false
#     }
#   ]
# ]

7.8.13.5. 高度な使い方

TokenNgram は複数のオプションを指定できます。

例えば、以下のように loose_symbolloose_blank を使って、電話番号の表記ゆれに対処できます。

以下の例のように、 01234567890123(45)67890123-45-67890123 45 6789 を検索できます。

実行例:

tokenize --tokenizer 'TokenNgram("loose_symbol", true, "loose_blank", true)' --string "0123(45)6789" --normalizer NormalizerAuto
# [
#   [
#     0,
#     1557187317.825164,
#     0.0003116130828857422
#   ],
#   [
#     {
#       "value": "0123",
#       "position": 0,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "(",
#       "position": 1,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "45",
#       "position": 2,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": ")",
#       "position": 3,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "6789",
#       "position": 4,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "￰",
#       "position": 5,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "0123456789",
#       "position": 6,
#       "force_prefix": false,
#       "force_prefix_search": false
#     }
#   ]
# ]
tokenize --tokenizer 'TokenNgram("loose_symbol", true, "loose_blank", true)' --string "0123-45-6789" --normalizer NormalizerAuto
# [
#   [
#     0,
#     1557187366.441741,
#     0.0002820491790771484
#   ],
#   [
#     {
#       "value": "0123",
#       "position": 0,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "-",
#       "position": 1,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "45",
#       "position": 2,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "-",
#       "position": 3,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "6789",
#       "position": 4,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "￰",
#       "position": 5,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "0123456789",
#       "position": 6,
#       "force_prefix": false,
#       "force_prefix_search": false
#     }
#   ]
# ]
tokenize --tokenizer 'TokenNgram("loose_symbol", true, "loose_blank", true)' --string "0123 45 6789" --normalizer NormalizerAuto
# [
#   [
#     0,
#     1557187404.034283,
#     0.0003006458282470703
#   ],
#   [
#     {
#       "value": "0123",
#       "position": 0,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "45",
#       "position": 1,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "6789",
#       "position": 2,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "￰",
#       "position": 3,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "0123456789",
#       "position": 4,
#       "force_prefix": false,
#       "force_prefix_search": false
#     }
#   ]
# ]

7.8.13.6. 引数

7.8.13.6.1. 省略可能引数

8つの省略可能な引数があります。

7.8.13.6.1.1. n

このオプションは、Ngramの N を表します。例えば、trigramでは、 n3 です。

7.8.13.6.1.2. loose_symbol

記号を含むクエリー、含まないクエリーの両方で検索されるよう、記号を含むキーワードをトークナイズします。例えば、キーワード 090-1111-22220901111222209011112222090-1111-2222 で見つけられます。

7.8.13.6.1.3. loose_blank

空白を含むクエリー、含まないクエリーの両方で検索されるように、空白を含むキーワードをトークナイズします。例えば、キーワード 090 1111 22220901111222209011112222 、 090 1111 2222 で見つけられます。

7.8.13.6.1.4. remove_blank

空白を含まないクエリーで検索できるよう、空白を含むキーワードをトークナイズします。例えば、キーワード 090 1111 22220901111222209011112222 でも見つけられます。

090 1111 2222 のような空白を含むクエリーでは、キーワードを見つけられないことに注意して下さい。

7.8.13.6.1.5. report_source_location

このオプションは、 highlight_html によってトークンをハイライトする際、ハイライト対象のトークンの位置を教えてくれます。

このオプションは、 highlight_html を使ってトークンをハイライトしたい時のみ使用します。

Groongaがハイライト対象のテキストをトークナイズするとき、いままでは、常に NormalizerAuto が使われていました。そのため、 NormalizerNFKC100 を使ってノーマライズすると、ハイライトの位置が見つけられないことがあります。

実行例:

table_create Entries TABLE_NO_KEY
#[[0,0.0,0.0],true]
column_create Entries body COLUMN_SCALAR ShortText
#[[0,0.0,0.0],true]
table_create Terms TABLE_PAT_KEY ShortText --normalizer 'NormalizerNFKC100'
#[[0,0.0,0.0],true]
column_create Terms document_index COLUMN_INDEX|WITH_POSITION Entries body
#[[0,0.0,0.0],true]
load --table Entries
[
{"body": "ア㌕Az"}
]
#[[0,0.0,0.0],1]
select Entries   --match_columns body   --query 'グラム'   --output_columns 'highlight_html(body, Terms)'
#[
#  [
#    0,
#    1558915202.24493,
#    0.0003714561462402344
#  ],
#  [[[0],[["highlight_html",null]]]]
#]

異なるノーマライザーを使ってトークンをノーマライズするためです。

このオプションは、ハイライト位置のズレを減らすためのものです。

実行例:

table_create Entries TABLE_NO_KEY
#[[0,0.0,0.0],true]
column_create Entries body COLUMN_SCALAR ShortText
#[[0,0.0,0.0],true]
table_create Terms TABLE_PAT_KEY ShortText   --default_tokenizer 'TokenNgram("report_source_location", true)'   --normalizer 'NormalizerNFKC100'
#[[0,0.0,0.0],true]
column_create Terms document_index COLUMN_INDEX|WITH_POSITION Entries body
#[[0,0.0,0.0],true]
load --table Entries
[
{"body": "ア㌕Az"}
]
#[[0,0.0,0.0],1]
select Entries   --match_columns body   --query 'グラム'   --output_columns 'highlight_html(body, Terms)'
#[
#  [
#    0,
#    0.0,
#    0.0
#  ],
#  [
#    [
#      [
#        1
#      ],
#      [
#        [
#          "highlight_html",
#          null
#        ]
#      ],
#      [
#        "ア<span class=\"keyword\">㌕</span>Az"
#      ]
#    ]
#  ]
#]

7.8.13.6.1.6. unify_alphabet

falseを設定すると、 TokenNgram はASCII文字にトークナイズ方法としてバイグラムを使います。

デフォルト値はtrueです。

実行例:

tokenize --tokenizer 'TokenNgram("unify_alphabet", false)' --string "abcd ABCD" --normaluzer NormalizeAuto
#[
#  [
#    0,
#    1558570398.145967,
#    0.0009109973907470703
#  ],
#  [
#    {
#      "value": "ab",
#      "position": 0,
#      "force_prefix": false,
#      "force_prefix_search": false
#    },
#    {
#      "value": "bc",
#      "position": 1,
#      "force_prefix": false,
#      "force_prefix_search": false
#    },
#    {
#      "value": "cd",
#      "position": 2,
#      "force_prefix": false,
#      "force_prefix_search": false
#    },
#    {
#      "value": "d ",
#      "position": 3,
#      "force_prefix": false,
#      "force_prefix_search": false
#    },
#    {
#      "value": " A",
#      "position": 4,
#      "force_prefix": false,
#      "force_prefix_search": false
#    },
#    {
#      "value": "AB",
#      "position": 5,
#      "force_prefix": false,
#      "force_prefix_search": false
#    },
#    {
#      "value": "BC",
#      "position": 6,
#      "force_prefix": false,
#      "force_prefix_search": false
#    },
#    {
#      "value": "CD",
#      "position": 7,
#      "force_prefix": false,
#      "force_prefix_search": false
#    },
#    {
#      "value": "D",
#      "position": 8,
#      "force_prefix": false,
#      "force_prefix_search": false
#    }
#  ]
#]

7.8.13.6.1.7. unify_symbol

falseを設定すると、 TokenNgram は記号にトークナイズ方法としてバイグラムを使います。TokenNgram("unify_symbol", false)TokenBigramSplitSymbol と同じ動作になります。

デフォルト値はtrueです。

実行例:

tokenize --tokenizer 'TokenNgram("unify_symbol", false)' --string "___---" --normalizer NormalizerAuto
#[
#  [
#    0,
#    1558913369.875591,
#    0.0008268356323242188
#  ],
#  [
#    {
#      "value": "__",
#      "position": 0,
#      "force_prefix": false,
#      "force_prefix_search": false
#    },
#    {
#      "value": "__",
#      "position": 1,
#      "force_prefix": false,
#      "force_prefix_search": false
#    },
#    {
#      "value": "_-",
#      "position": 2,
#      "force_prefix": false,
#      "force_prefix_search": false
#    },
#    {
#      "value": "--",
#      "position": 3,
#      "force_prefix": false,
#      "force_prefix_search": false
#    },
#    {
#      "value": "--",
#      "position": 4,
#      "force_prefix": false,
#      "force_prefix_search": false
#    },
#    {
#      "value": "-",
#      "position": 5,
#      "force_prefix": false,
#      "force_prefix_search": false
#    }
#  ]
#]

7.8.13.6.1.8. unify_digit

falseを設定すると、 TokenNgram は数字にトークナイズ方法としてバイグラムを使います。

デフォルト値はtrueです。

実行例:

tokenize --tokenizer 'TokenNgram("unify_digit", false)' --string "012345 6789" --normalizer NormalizerAuto
#[
#  [
#    0,
#    1558914023.967506,
#    0.001635313034057617
#  ],
#  [
#    {
#      "value": "01",
#      "position": 0,
#      "force_prefix": false,
#      "force_prefix_search": false
#    },
#    {
#      "value": "12",
#      "position": 1,
#      "force_prefix": false,
#      "force_prefix_search": false
#    },
#    {
#      "value": "23",
#      "position": 2,
#      "force_prefix": false,
#      "force_prefix_search": false
#    },
#    {
#      "value": "34",
#      "position": 3,
#      "force_prefix": false,
#      "force_prefix_search": false
#    },
#    {
#      "value": "45",
#      "position": 4,
#      "force_prefix": false,
#      "force_prefix_search": false
#    },
#    {
#      "value": "5",
#      "position": 5,
#      "force_prefix": false,
#      "force_prefix_search": false
#    },
#    {
#      "value": "67",
#      "position": 6,
#      "force_prefix": false,
#      "force_prefix_search": false
#    },
#    {
#      "value": "78",
#      "position": 7,
#      "force_prefix": false,
#      "force_prefix_search": false
#    },
#    {
#      "value": "89",
#      "position": 8,
#      "force_prefix": false,
#      "force_prefix_search": false
#    },
#    {
#      "value": "9",
#      "position": 9,
#      "force_prefix": false,
#      "force_prefix_search": false
#    }
#  ]
#]

7.8.13.7. 参考