7.11.3.1. scorer_tf_at_most

Note

This scorer is an experimental feature.

New in version 5.0.1.

7.11.3.1.1. Summary

scorer_tf_at_most is a scorer based on TF (term frequency).

TF based scorer includes TF-IDF based scorer has a problem for the following case:

If document contains many same keywords such as “They are keyword, keyword, keyword … and keyword”, the document has high score. It’s not expected. Search engine spammer may use the technique.

scorer_tf_at_most is a TF based scorer but it can solve the case.

scorer_tf_at_most limits the maximum score value. It means that scorer_tf_at_most limits effect of a match.

If document contains many same keywords such as “They are keyword, keyword, keyword … and keyword”, scorer_tf_at_most(column, 2.0) returns at most 2 as score.

You don’t need to resolve scoring only by score function. Score function is highly depends on search query. You may be able to use metadata of matched record.

For example, Google uses PageRank for scoring. You may be able to use data type (“title” data are important rather than “memo” data), tag, geolocation and so on.

Please stop to think about only score function for scoring.

7.11.3.1.2. Syntax

This scorer has two parameters:

scorer_tf_at_most(column, max)
scorer_tf_at_most(index, max)

7.11.3.1.3. Usage

This section describes how to use this scorer.

Here are a schema definition and sample data to show usage.

Sample schema:

Execution example:

table_create Logs TABLE_NO_KEY
# [[0, 1337566253.89858, 0.000355720520019531], true]
column_create Logs message COLUMN_SCALAR Text
# [[0, 1337566253.89858, 0.000355720520019531], true]
table_create Terms TABLE_PAT_KEY ShortText \
  --default_tokenizer TokenBigram \
  --normalizer NormalizerAuto
# [[0, 1337566253.89858, 0.000355720520019531], true]
column_create Terms message_index COLUMN_INDEX|WITH_POSITION Logs message
# [[0, 1337566253.89858, 0.000355720520019531], true]

Sample data:

Execution example:

load --table Logs
[
{"message": "Notice"},
{"message": "Notice Notice"},
{"message": "Notice Notice Notice"},
{"message": "Notice Notice Notice Notice"},
{"message": "Notice Notice Notice Notice Notice"}
]
# [[0, 1337566253.89858, 0.000355720520019531], 5]

You specify scorer_tf_at_most in match_columns like the following:

Execution example:

select Logs \
  --match_columns "scorer_tf_at_most(message, 3.0)" \
  --query "Notice" \
  --output_columns "message, _score" \
  --sort_keys "-_score"
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     [
#       [
#         5
#       ],
#       [
#         [
#           "message",
#           "Text"
#         ],
#         [
#           "_score",
#           "Int32"
#         ]
#       ],
#       [
#         "Notice Notice Notice Notice Notice",
#         3
#       ],
#       [
#         "Notice Notice Notice Notice",
#         3
#       ],
#       [
#         "Notice Notice Notice",
#         3
#       ],
#       [
#         "Notice Notice",
#         2
#       ],
#       [
#         "Notice",
#         1
#       ]
#     ]
#   ]
# ]

If a document has three or more Notice terms, its score is 3. Because the select specify 3.0 as the max score.

If a document has one or two Notice terms, its score is 1 or 2. Because the score is less than 3.0 specified as the max score.

7.11.3.1.4. Parameters

This section describes all parameters.

7.11.3.1.4.1. Required parameters

There is only one required parameter.

7.11.3.1.4.1.1. column

The data column that is match target. The data column must be indexed.

7.11.3.1.4.1.2. index

The index column to be used for search.

7.11.3.1.4.2. Optional parameters

There is no optional parameter.

7.11.3.1.5. Return value

This scorer returns score as Float32.

select returns _score as Int32 not Float. Because it casts to Int32 from Float for keeping backward compatibility.

Score is computed as TF with limitation.

7.11.3.1.6. See also