7.11.3.2. scorer_tf_idf

New in version 5.0.1.

7.11.3.2.1. Summary

scorer_tf_idf is a scorer based of TF-IDF (term frequency-inverse document frequency) score function.

To put it simply, TF (term frequency) divided by DF (document frequency) is TF-IDF. “TF” means that “the number of occurrences is more important”. “TF divided by DF” means that “the number of occurrences of important term is more important”.

The default score function in Groonga is TF (term frequency). It doesn’t care about term importance but is fast.

TF-IDF cares about term importance but is slower than TF.

TF-IDF will compute more suitable score rather than TF for many cases. But it’s not perfect.

If document contains many same keywords such as “They are keyword, keyword, keyword … and keyword”, it increases score by TF and TF-IDF. Search engine spammer may use the technique. But TF-IDF doesn’t guard from the technique.

Okapi BM25 can solve the case. But it’s more slower than TF-IDF and not implemented yet in Groonga.

Groonga provides scorer_tf_at_most scorer that can also solve the case.

You don’t need to resolve scoring only by score function. Score function is highly depends on search query. You may be able to use metadata of matched record.

For example, Google uses PageRank for scoring. You may be able to use data type (“title” data are important rather than “memo” data), tag, geolocation and so on.

Please stop to think about only score function for scoring.

7.11.3.2.2. Syntax

This scorer has only one parameter:

scorer_tf_idf(column)
scorer_tf_idf(index)

7.11.3.2.3. Usage

This section describes how to use this scorer.

Here are a schema definition and sample data to show usage.

Sample schema:

Execution example:

table_create Logs TABLE_NO_KEY
# [[0, 1337566253.89858, 0.000355720520019531], true]
column_create Logs message COLUMN_SCALAR Text
# [[0, 1337566253.89858, 0.000355720520019531], true]
table_create Terms TABLE_PAT_KEY ShortText \
  --default_tokenizer TokenBigram \
  --normalizer NormalizerAuto
# [[0, 1337566253.89858, 0.000355720520019531], true]
column_create Terms message_index COLUMN_INDEX|WITH_POSITION Logs message
# [[0, 1337566253.89858, 0.000355720520019531], true]

Sample data:

Execution example:

load --table Logs
[
{"message": "Error"},
{"message": "Warning"},
{"message": "Warning Warning"},
{"message": "Warning Warning Warning"},
{"message": "Info"},
{"message": "Info Info"},
{"message": "Info Info Info"},
{"message": "Info Info Info Info"},
{"message": "Notice"},
{"message": "Notice Notice"},
{"message": "Notice Notice Notice"},
{"message": "Notice Notice Notice Notice"},
{"message": "Notice Notice Notice Notice Notice"}
]
# [[0, 1337566253.89858, 0.000355720520019531], 13]

You specify scorer_tf_idf in match_columns like the following:

Execution example:

select Logs \
  --match_columns "scorer_tf_idf(message)" \
  --query "Error OR Info" \
  --output_columns "message, _score" \
  --sort_keys "-_score"
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     [
#       [
#         5
#       ],
#       [
#         [
#           "message",
#           "Text"
#         ],
#         [
#           "_score",
#           "Int32"
#         ]
#       ],
#       [
#         "Info Info Info Info",
#         3
#       ],
#       [
#         "Error",
#         2
#       ],
#       [
#         "Info Info Info",
#         2
#       ],
#       [
#         "Info Info",
#         1
#       ],
#       [
#         "Info",
#         1
#       ]
#     ]
#   ]
# ]

Both the score of Info Info Info and the score of Error are 2 even Info Info Info includes three Info terms. Because Error is more important term rather than Info. The number of documents that include Info is 4. The number of documents that include Error is 1. Term that is included in less documents means that the term is more characteristic term. Characteristic term is important term.

7.11.3.2.4. Parameters

This section describes all parameters.

7.11.3.2.4.1. Required parameters

There is only one required parameter.

7.11.3.2.4.1.1. column

The data column that is match target. The data column must be indexed.

7.11.3.2.4.1.2. index

The index column to be used for search.

7.11.3.2.4.2. Optional parameters

There is no optional parameter.

7.11.3.2.5. Return value

This scorer returns score as Float32.

select returns _score as Int32 not Float. Because it casts to Int32 from Float for keeping backward compatibility.

Score is computed as TF-IDF based algorithm.

7.11.3.2.6. See also