7.8.10. `TokenDelimit`¶

7.8.10.1. Summary¶

TokenDelimit extracts token by splitting one or more space characters (U+0020). For example, Hello World is tokenized to Hello and World.

TokenDelimit is suitable for tag text. You can extract groonga and full-text-search and http as tags from groonga full-text-search http.

7.8.10.2. Syntax¶

TokenDelimit has optional parameter.

No options(Extracts token by splitting one or more space characters (U+0020)):

TokenDelimit

Specify delimiter:

TokenDelimit("delimiter",  "delimiter1", delimiter", "delimiter2", ...)

Specify delimiter with regular expression:

TokenDelimit("pattern", pattern)

The delimiter option and a pattern option are not use at the same time.

7.8.10.3. Usage¶

7.8.10.4. Simple usage¶

Here is an example of TokenDelimit:

Execution example:

tokenize TokenDelimit "Groonga full-text-search HTTP" NormalizerAuto
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     {
#       "position": 0,
#       "force_prefix": false,
#       "value": "groonga"
#     },
#     {
#       "position": 1,
#       "force_prefix": false,
#       "value": "full-text-search"
#     },
#     {
#       "position": 2,
#       "force_prefix": false,
#       "value": "http"
#     }
#   ]
# ]

TokenDelimit can also specify options. TokenDelimit has delimiter option and pattern option.

delimiter option can split token with a specified character.

For example, Hello,World is tokenized to Hello and World with delimiter option as below.

Execution example:

tokenize 'TokenDelimit("delimiter", ",")' "Hello,World"
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     {
#       "value": "Hello",
#       "position": 0,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "World",
#       "position": 1,
#       "force_prefix": false,
#       "force_prefix_search": false
#     }
#   ]
# ]

pattern option can split token with a regular expression. You can except needless space by pattern option.

For example, This is a pen. This is an apple is tokenized to This is a pen and This is an apple with pattern option as below.

Normally, when This is a pen. This is an apple. is splitted by ., needless spaces are included at the beginning of “This is an apple.”.

You can except the needless spaces by a pattern option as below example.

Execution example:

tokenize 'TokenDelimit("pattern", "\\.\\s*")' "This is a pen. This is an apple."
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     {
#       "value": "This is a pen.",
#       "position": 0,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "This is an apple.",
#       "position": 1,
#       "force_prefix": false,
#       "force_prefix_search": false
#     }
#   ]
# ]

7.8.10.5. Advanced usage¶

delimiter option can also specify multiple delimiters.

For example, Hello, World is tokenized to Hello and World. , and `` `` are delimiters in below example.

Execution example:

tokenize 'TokenDelimit("delimiter", ",", "delimiter", " ")' "Hello, World"
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     {
#       "value": "Hello",
#       "position": 0,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "World",
#       "position": 1,
#       "force_prefix": false,
#       "force_prefix_search": false
#     }
#   ]
# ]

You can extract token in complex conditions by pattern option.

For example, これはペンですか！？リンゴですか？「リンゴです。」 is tokenize to これはペンですか and リンゴですか, 「リンゴです。」 with delimiter option as below.

Execution example:

tokenize 'TokenDelimit("pattern", "([。！？]+(?![）」])|[\\r\\n]+)\\s*")' "これはペンですか！？リンゴですか？「リンゴです。」"
# [
#   [
#     0,
#     1545179416.22277,
#     0.0002887248992919922
#   ],
#   [
#     {
#       "value": "これはペンですか",
#       "position": 0,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "リンゴですか",
#       "position": 1,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "「リンゴです。」",
#       "position": 2,
#       "force_prefix": false,
#       "force_prefix_search": false
#     }
#   ]
# ]

\\s* of the end of above regular expression match 0 or more spaces after a delimiter.

[。！？]+ matches 1 or more 。 or ！, ？. For example, [。！？]+ matches ！？ of これはペンですか！？.

(?![）」]) is negative lookahead. (?![）」]) matches if a character is not matched ） or 」. negative lookahead interprets in combination regular expression of just before.

Therefore it interprets [。！？]+(?![）」]).

[。！？]+(?![）」]) matches if there are not ） or 」 after 。 or ！, ？.

In other words, [。！？]+(?![）」]) matches 。 of これはペンですか。. But [。！？]+(?![）」]) doesn’t match 。 of 「リンゴです。」. Because there is 」 after 。.

[\\r\\n]+ match 1 or more newline character.

In conclusion, ([。！？]+(?![）」])|[\\r\\n]+)\\s* uses 。 and ！ and ？, newline character as delimiter. However, 。 and !, ？ are not delimiters if there is ） or 」 after 。 or ！, ？.

7.8.10.6. Parameters¶

7.8.10.6.1. Optional parameter¶

There are two optional parameters delimiter and pattern.

7.8.10.6.1.1. `delimiter`¶

Split token with a specified one or more characters.

You can use one or more characters for a delimiter.

7.8.10.6.1.2. `pattern`¶

Split token with a regular expression.

7.8.10.7. See also¶

tokenize

7.8.9. TokenBigramSplitSymbolAlphaDigit

7.8.11. TokenDelimitNull

7.8.10. TokenDelimit¶

7.8.10.1. Summary¶

7.8.10.2. Syntax¶

7.8.10.3. Usage¶

7.8.10.4. Simple usage¶

7.8.10.5. Advanced usage¶

7.8.10.6. Parameters¶

7.8.10.6.1. Optional parameter¶

7.8.10.6.1.1. delimiter¶

7.8.10.6.1.2. pattern¶

7.8.10.7. See also¶

7.8.10. `TokenDelimit`¶

7.8.10.6.1.1. `delimiter`¶

7.8.10.6.1.2. `pattern`¶