7.8.10. TokenDelimit

7.8.10.1. Summary

TokenDelimit extracts token by splitting one or more space characters (U+0020). For example, Hello World is tokenized to Hello and World.

TokenDelimit is suitable for tag text. You can extract groonga and full-text-search and http as tags from groonga full-text-search http.

7.8.10.2. Syntax

TokenDelimit has optional parameter.

No options(Extracts token by splitting one or more space characters (U+0020)):

TokenDelimit

Specify delimiter:

TokenDelimit("delimiter",  "delimiter1", delimiter", "delimiter2", ...)

Specify delimiter with regular expression:

TokenDelimit("pattern", pattern)

The delimiter option and a pattern option are not use at the same time.

7.8.10.3. Usage

7.8.10.4. Simple usage

Here is an example of TokenDelimit:

Execution example:

tokenize TokenDelimit "Groonga full-text-search HTTP" NormalizerAuto
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     {
#       "position": 0,
#       "force_prefix": false,
#       "value": "groonga"
#     },
#     {
#       "position": 1,
#       "force_prefix": false,
#       "value": "full-text-search"
#     },
#     {
#       "position": 2,
#       "force_prefix": false,
#       "value": "http"
#     }
#   ]
# ]

TokenDelimit can also specify options. TokenDelimit has delimiter option and pattern option.

delimiter option can split token with a specified character.

For example, Hello,World is tokenized to Hello and World with delimiter option as below.

Execution example:

tokenize 'TokenDelimit("delimiter", ",")' "Hello,World"
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     {
#       "value": "Hello",
#       "position": 0,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "World",
#       "position": 1,
#       "force_prefix": false,
#       "force_prefix_search": false
#     }
#   ]
# ]

pattern option can split token with a regular expression. You can except needless space by pattern option.

For example, This is a pen. This is an apple is tokenized to This is a pen and This is an apple with pattern option as below.

Normally, when This is a pen. This is an apple. is splitted by ., needless spaces are included at the beginning of “This is an apple.”.

You can except the needless spaces by a pattern option as below example.

Execution example:

tokenize 'TokenDelimit("pattern", "\\.\\s*")' "This is a pen. This is an apple."
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     {
#       "value": "This is a pen.",
#       "position": 0,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "This is an apple.",
#       "position": 1,
#       "force_prefix": false,
#       "force_prefix_search": false
#     }
#   ]
# ]

7.8.10.5. Advanced usage

delimiter option can also specify multiple delimiters.

For example, Hello, World is tokenized to Hello and World. , and `` `` are delimiters in below example.

Execution example:

tokenize 'TokenDelimit("delimiter", ",", "delimiter", " ")' "Hello, World"
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     {
#       "value": "Hello",
#       "position": 0,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "World",
#       "position": 1,
#       "force_prefix": false,
#       "force_prefix_search": false
#     }
#   ]
# ]

You can extract token in complex conditions by pattern option.

For example, これはペンですか!?リンゴですか?「リンゴです。」 is tokenize to これはペンですか and リンゴですか, 「リンゴです。」 with delimiter option as below.

Execution example:

tokenize 'TokenDelimit("pattern", "([。!?]+(?![)」])|[\\r\\n]+)\\s*")' "これはペンですか!?リンゴですか?「リンゴです。」"
# [
#   [
#     0,
#     1545179416.22277,
#     0.0002887248992919922
#   ],
#   [
#     {
#       "value": "これはペンですか",
#       "position": 0,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "リンゴですか",
#       "position": 1,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "「リンゴです。」",
#       "position": 2,
#       "force_prefix": false,
#       "force_prefix_search": false
#     }
#   ]
# ]

\\s* of the end of above regular expression match 0 or more spaces after a delimiter.

[。!?]+ matches 1 or more or , . For example, [。!?]+ matches !? of これはペンですか!?.

(?![)」]) is negative lookahead. (?![)」]) matches if a character is not matched or . negative lookahead interprets in combination regular expression of just before.

Therefore it interprets [。!?]+(?![)」]).

[。!?]+(?![)」]) matches if there are not or after or , .

In other words, [。!?]+(?![)」]) matches of これはペンですか。. But [。!?]+(?![)」]) doesn’t match of 「リンゴです。」. Because there is after .

[\\r\\n]+ match 1 or more newline character.

In conclusion, ([。!?]+(?![)」])|[\\r\\n]+)\\s* uses and and , newline character as delimiter. However, and !, are not delimiters if there is or after or , .

7.8.10.6. Parameters

7.8.10.6.1. Optional parameter

There are two optional parameters delimiter and pattern.

7.8.10.6.1.1. delimiter

Split token with a specified one or more characters.

You can use one or more characters for a delimiter.

7.8.10.6.1.2. pattern

Split token with a regular expression.

7.8.10.7. See also