7.15.21. snippet

7.15.21.1. Summary

This function extracts snippets of target text around search keywords (KWIC. KeyWord In Context).

If you want to use this function for normal Web application, snippet_html may be suitable. It’s a HTML specific version of this function.

7.15.21.2. Syntax

snippet requires at least one parameter that is the snippet target text:

snippet(column, ...)

You can specify one ore more tuples of keyword, open tag and close tag:

snippet(column,
        "keyword1", "open-tag1", "close-tag1",
        "keyword2", "open-tag2", "close-tag2",
        ...)

If you specify default open tag and default close tag, you can specify only keywords:

snippet(column,
        "keyword1",
        "keyword2",
        ...,
        {
          "default_open_tag": "open-tag",
          "default_close_tag": "close-tag"
        })

New in version 11.0.9: If you specify default open tag and default close tag and omit keywords, keywords are extracted from the current condition automatically like snippet_html:

snippet(column,
        {
          "default_open_tag": "open-tag",
          "default_close_tag": "close-tag"
        })

You can specify options as the last argument with all syntaxes:

snippet(column,
        ...,
        {
          "width": 200,
          "max_n_results": 3,
          "skip_leading_spaces": true,
          "html_escape": false,
          "prefix": null,
          "suffix": null,
          "normalizer": null,
          "default_open_tag": null,
          "default_close_tag": null,
          "default": null,
          "delimiter_pattern": null,
        })

7.15.21.3. Usage

Here are a schema definition and sample data to show usage.

Execution example:

table_create Documents TABLE_NO_KEY
# [[0, 1337566253.89858, 0.000355720520019531], true]
column_create Documents content COLUMN_SCALAR Text
# [[0, 1337566253.89858, 0.000355720520019531], true]
table_create Terms TABLE_PAT_KEY ShortText --default_tokenizer TokenBigram  --normalizer NormalizerAuto
# [[0, 1337566253.89858, 0.000355720520019531], true]
column_create Terms documents_content_index COLUMN_INDEX|WITH_POSITION Documents content
# [[0, 1337566253.89858, 0.000355720520019531], true]
load --table Documents
[
["content"],
["Groonga is a fast and accurate full text search engine based on inverted index. One of the characteristics of groonga is that a newly registered document instantly appears in search results. Also, groonga allows updates without read locks. These characteristics result in superior performance on real-time applications."],
["Groonga is also a column-oriented database management system (DBMS). Compared with well-known row-oriented systems, such as MySQL and PostgreSQL, column-oriented systems are more suited for aggregate queries. Due to this advantage, groonga can cover weakness of row-oriented systems."]
]
# [[0, 1337566253.89858, 0.000355720520019531], 2]

snippet extracts keywords from conditions specified in --query and/or --filter automatically when you specify default_open_tag option and default_close_tag and don’t specify keywords. It’s similar to snippet_html.

The following example uses --query "fast performance". In this case, fast and performance are used as keywords.

Execution example:

select Documents \
  --output_columns 'snippet(content, \
                            { \
                               "default_open_tag": "[", \
                               "default_close_tag": "]" \
                            })' \
  --match_columns content \
  --query "fast performance"
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     [
#       [
#         1
#       ],
#       [
#         [
#           "snippet",
#           null
#         ]
#       ],
#       [
#         [
#           "Groonga is a [fast] and accurate full text search engine based on inverted index. One of the characteristics of groonga is that a newly registered document instantly appears in search results. Also, gro",
#           "onga allows updates without read locks. These characteristics result in superior [performance] on real-time applications."
#         ]
#       ]
#     ]
#   ]
# ]

--query "fast performance" matches to only the first record’s content. This snippet extracts two text parts that include the keywords fast or performance and surrounds the keywords with [ and ].

The max number of text parts is 3 by default. You can change it by max_n_results option:

Execution example:

select Documents \
  --output_columns 'snippet(content, \
                            { \
                               "default_open_tag": "[", \
                               "default_close_tag": "]", \
                               "max_n_results": 1 \
                            })' \
  --match_columns content \
  --query "fast performance"
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     [
#       [
#         1
#       ],
#       [
#         [
#           "snippet",
#           null
#         ]
#       ],
#       [
#         [
#           "Groonga is a [fast] and accurate full text search engine based on inverted index. One of the characteristics of groonga is that a newly registered document instantly appears in search results. Also, gro"
#         ]
#       ]
#     ]
#   ]
# ]

It returns only one snippet because "max_n_results": 1 is specified.

The max size of a text part is 200byte by default. The unit is bytes not characters. The size doesn’t include inserted [ and [. You can change it by width option:

Execution example:

select Documents \
  --output_columns 'snippet(content, \
                            { \
                               "default_open_tag": "[", \
                               "default_close_tag": "]", \
                               "width": 50 \
                            })' \
  --match_columns content \
  --query "fast performance"
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     [
#       [
#         1
#       ],
#       [
#         [
#           "snippet",
#           null
#         ]
#       ],
#       [
#         [
#           "Groonga is a [fast] and accurate full text search en",
#           " result in superior [performance] on real-time appli"
#         ]
#       ]
#     ]
#   ]
# ]

You can detect snippet delimiter with regular expression by delimiter_regexp option. You can use \.\s* to use only text in the target sentence. Note that you need to escape \ in string:

Execution example:

select Documents \
  --output_columns 'snippet(content, \
                            { \
                               "default_open_tag": "[", \
                               "default_close_tag": "]", \
                               "delimiter_regexp": "\\\\.\\\\s*" \
                            })' \
  --match_columns content \
  --query "fast performance"
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     [
#       [
#         1
#       ],
#       [
#         [
#           "snippet",
#           null
#         ]
#       ],
#       [
#         [
#           "Groonga is a [fast] and accurate full text search engine based on inverted index",
#           "These characteristics result in superior [performance] on real-time applications"
#         ]
#       ]
#     ]
#   ]
# ]

You can see the detected delimiters (. and following white spaces) aren’t included in the result snippets. This is intentional behavior.

You can specify keywords explicitly instead of extracting keywords from the current condition:

Execution example:

select Documents \
  --output_columns 'snippet(content, \
                            "fast", \
                            "performance", \
                            { \
                               "default_open_tag": "[", \
                               "default_close_tag": "]" \
                            })'
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     [
#       [
#         2
#       ],
#       [
#         [
#           "snippet",
#           null
#         ]
#       ],
#       [
#         [
#           "Groonga is a [fast] and accurate full text search engine based on inverted index. One of the characteristics of groonga is that a newly registered document instantly appears in search results. Also, gro",
#           "onga allows updates without read locks. These characteristics result in superior [performance] on real-time applications."
#         ]
#       ],
#       [
#         null
#       ]
#     ]
#   ]
# ]

This snippet returns two snippets for the first record and null for the second record. Because the second record doesn’t have any specified keywords.

You can specify open tag and close tag for each keyword:

Execution example:

select Documents \
  --output_columns 'snippet(content, \
                            "fast", "[", "]", \
                            "performance", "(", ")")'
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     [
#       [
#         2
#       ],
#       [
#         [
#           "snippet",
#           null
#         ]
#       ],
#       [
#         [
#           "Groonga is a [fast] and accurate full text search engine based on inverted index. One of the characteristics of groonga is that a newly registered document instantly appears in search results. Also, gro",
#           "onga allows updates without read locks. These characteristics result in superior (performance) on real-time applications."
#         ]
#       ],
#       [
#         null
#       ]
#     ]
#   ]
# ]

This snippet surrounds fast with [ and ]] and performance with ( and ).

TODO: html_escape option and so on

7.15.21.4. Parameters

7.15.21.4.1. Required parameters

TODO

7.15.21.4.2. Optional parameters

TODO

7.15.21.4.2.1. max_n_results

TODO

7.15.21.4.2.2. width

TODO

7.15.21.5. Return value

This function returns an array of string or null. If This function can’t find any snippets, it returns null.

An element of array is a snippet:

[SNIPPET1, SNIPPET2, ...]

A snippet includes one or more keywords. The max byte size of a snippet except open tag and close tag is 200byte. The unit isn’t the number of characters.

You can change this by width option.

The array size is larger than or equal to 1 and less than or equal to 3.

You can change this by max_n_results option.

7.15.21.6. See also