7.8.15. TokenRegexp

7.8.15.1. Summary

New in version 5.0.1.

Caution

This tokenizer is experimental. Specification may be changed.

Caution

This tokenizer can be used only with UTF-8. You can’t use this tokenizer with EUC-JP, Shift_JIS and so on.

TokenRegexp is a tokenizer for supporting regular expression search by index.

7.8.15.2. Syntax

TokenRegexp hasn’t parameter:

TokenRegexp

7.8.15.3. Usage

In general, regular expression search is evaluated as sequential search. But the following cases can be evaluated as index search:

  • Literal only case such as hello

  • The beginning of text and literal case such as \A/home/alice

  • The end of text and literal case such as \.txt\z

In most cases, index search is faster than sequential search.

TokenRegexp is based on bigram tokenize method. TokenRegexp adds the beginning of text mark (U+FFEF) at the begging of text and the end of text mark (U+FFF0) to the end of text when you index text:

Execution example:

tokenize TokenRegexp "/home/alice/test.txt" NormalizerAuto --mode ADD
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     {
#       "position": 0,
#       "force_prefix": false,
#       "value": "￯"
#     },
#     {
#       "position": 1,
#       "force_prefix": false,
#       "value": "/h"
#     },
#     {
#       "position": 2,
#       "force_prefix": false,
#       "value": "ho"
#     },
#     {
#       "position": 3,
#       "force_prefix": false,
#       "value": "om"
#     },
#     {
#       "position": 4,
#       "force_prefix": false,
#       "value": "me"
#     },
#     {
#       "position": 5,
#       "force_prefix": false,
#       "value": "e/"
#     },
#     {
#       "position": 6,
#       "force_prefix": false,
#       "value": "/a"
#     },
#     {
#       "position": 7,
#       "force_prefix": false,
#       "value": "al"
#     },
#     {
#       "position": 8,
#       "force_prefix": false,
#       "value": "li"
#     },
#     {
#       "position": 9,
#       "force_prefix": false,
#       "value": "ic"
#     },
#     {
#       "position": 10,
#       "force_prefix": false,
#       "value": "ce"
#     },
#     {
#       "position": 11,
#       "force_prefix": false,
#       "value": "e/"
#     },
#     {
#       "position": 12,
#       "force_prefix": false,
#       "value": "/t"
#     },
#     {
#       "position": 13,
#       "force_prefix": false,
#       "value": "te"
#     },
#     {
#       "position": 14,
#       "force_prefix": false,
#       "value": "es"
#     },
#     {
#       "position": 15,
#       "force_prefix": false,
#       "value": "st"
#     },
#     {
#       "position": 16,
#       "force_prefix": false,
#       "value": "t."
#     },
#     {
#       "position": 17,
#       "force_prefix": false,
#       "value": ".t"
#     },
#     {
#       "position": 18,
#       "force_prefix": false,
#       "value": "tx"
#     },
#     {
#       "position": 19,
#       "force_prefix": false,
#       "value": "xt"
#     },
#     {
#       "position": 20,
#       "force_prefix": false,
#       "value": "t"
#     },
#     {
#       "position": 21,
#       "force_prefix": false,
#       "value": "￰"
#     }
#   ]
# ]