7.8.1. Summary¶
Groonga has tokenizer module that tokenizes text. It is used when the following cases:
Indexing text
Searching by query
Tokenizer is an important module for full-text search. You can change trade-off between precision and recall by changing tokenizer.
Normally, TokenBigram is a suitable tokenizer. If you don’t know much about tokenizer, it’s recommended that you choose TokenBigram.
You can try a tokenizer by tokenize and table_tokenize. Here is an example to try TokenBigram tokenizer by tokenize:
Execution example:
tokenize TokenBigram "Hello World"
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "position": 0,
# "force_prefix": false,
# "value": "He"
# },
# {
# "position": 1,
# "force_prefix": false,
# "value": "el"
# },
# {
# "position": 2,
# "force_prefix": false,
# "value": "ll"
# },
# {
# "position": 3,
# "force_prefix": false,
# "value": "lo"
# },
# {
# "position": 4,
# "force_prefix": false,
# "value": "o "
# },
# {
# "position": 5,
# "force_prefix": false,
# "value": " W"
# },
# {
# "position": 6,
# "force_prefix": false,
# "value": "Wo"
# },
# {
# "position": 7,
# "force_prefix": false,
# "value": "or"
# },
# {
# "position": 8,
# "force_prefix": false,
# "value": "rl"
# },
# {
# "position": 9,
# "force_prefix": false,
# "value": "ld"
# },
# {
# "position": 10,
# "force_prefix": false,
# "value": "d"
# }
# ]
# ]
“tokenize” is the process that extracts zero or more tokens from a text. There are some “tokenize” methods.
For example, Hello World
is tokenized to the following tokens by
bigram tokenize method:
He
el
ll
lo
o_
(_
means a white-space)
_W
(_
means a white-space)
Wo
or
rl
ld
In the above example, 10 tokens are extracted from one text Hello
World
.
For example, Hello World
is tokenized to the following tokens by
white-space-separate tokenize method:
Hello
World
In the above example, 2 tokens are extracted from one text Hello
World
.
Token is used as search key. You can find indexed documents only by
tokens that are extracted by used tokenize method. For example, you
can find Hello World
by ll
with bigram tokenize method but you
can’t find Hello World
by ll
with white-space-separate tokenize
method. Because white-space-separate tokenize method doesn’t extract
ll
token. It just extracts Hello
and World
tokens.
In general, tokenize method that generates small tokens increases recall but decreases precision. Tokenize method that generates large tokens increases precision but decreases recall.
For example, we can find Hello World
and A or B
by or
with
bigram tokenize method. Hello World
is a noise for people who
wants to search “logical and”. It means that precision is
decreased. But recall is increased.
We can find only A or B
by or
with white-space-separate
tokenize method. Because World
is tokenized to one token World
with white-space-separate tokenize method. It means that precision is
increased for people who wants to search “logical and”. But recall is
decreased because Hello World
that contains or
isn’t found.