7.11.3.1. scorer_tf_at_most
¶
Note
This scorer is an experimental feature.
New in version 5.0.1.
7.11.3.1.1. Summary¶
scorer_tf_at_most
is a scorer based on TF (term frequency).
TF based scorer includes TF-IDF based scorer has a problem for the following case:
If document contains many same keywords such as “They are keyword, keyword, keyword … and keyword”, the document has high score. It’s not expected. Search engine spammer may use the technique.
scorer_tf_at_most
is a TF based scorer but it can solve the case.
scorer_tf_at_most
limits the maximum score value. It means that
scorer_tf_at_most
limits effect of a match.
If document contains many same keywords such as “They are keyword,
keyword, keyword … and keyword”, scorer_tf_at_most(column, 2.0)
returns at most 2
as score.
You don’t need to resolve scoring only by score function. Score function is highly depends on search query. You may be able to use metadata of matched record.
For example, Google uses PageRank for scoring. You may be able to use data type (“title” data are important rather than “memo” data), tag, geolocation and so on.
Please stop to think about only score function for scoring.
7.11.3.1.2. Syntax¶
This scorer has two parameters:
scorer_tf_at_most(column, max)
scorer_tf_at_most(index, max)
7.11.3.1.3. Usage¶
This section describes how to use this scorer.
Here are a schema definition and sample data to show usage.
Sample schema:
Execution example:
table_create Logs TABLE_NO_KEY
# [[0, 1337566253.89858, 0.000355720520019531], true]
column_create Logs message COLUMN_SCALAR Text
# [[0, 1337566253.89858, 0.000355720520019531], true]
table_create Terms TABLE_PAT_KEY ShortText \
--default_tokenizer TokenBigram \
--normalizer NormalizerAuto
# [[0, 1337566253.89858, 0.000355720520019531], true]
column_create Terms message_index COLUMN_INDEX|WITH_POSITION Logs message
# [[0, 1337566253.89858, 0.000355720520019531], true]
Sample data:
Execution example:
load --table Logs
[
{"message": "Notice"},
{"message": "Notice Notice"},
{"message": "Notice Notice Notice"},
{"message": "Notice Notice Notice Notice"},
{"message": "Notice Notice Notice Notice Notice"}
]
# [[0, 1337566253.89858, 0.000355720520019531], 5]
You specify scorer_tf_at_most
in match_columns like
the following:
Execution example:
select Logs \
--match_columns "scorer_tf_at_most(message, 3.0)" \
--query "Notice" \
--output_columns "message, _score" \
--sort_keys "-_score"
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# [
# [
# 5
# ],
# [
# [
# "message",
# "Text"
# ],
# [
# "_score",
# "Int32"
# ]
# ],
# [
# "Notice Notice Notice Notice Notice",
# 3
# ],
# [
# "Notice Notice Notice Notice",
# 3
# ],
# [
# "Notice Notice Notice",
# 3
# ],
# [
# "Notice Notice",
# 2
# ],
# [
# "Notice",
# 1
# ]
# ]
# ]
# ]
If a document has three or more Notice
terms, its score is 3
.
Because the select
specify 3.0
as the max score.
If a document has one or two Notice
terms, its score is 1
or
2
. Because the score is less than 3.0
specified as the max score.
7.11.3.1.4. Parameters¶
This section describes all parameters.
7.11.3.1.4.1. Required parameters¶
There is only one required parameter.
7.11.3.1.4.1.1. column
¶
The data column that is match target. The data column must be indexed.
7.11.3.1.4.1.2. index
¶
The index column to be used for search.
7.11.3.1.4.2. Optional parameters¶
There is no optional parameter.
7.11.3.1.5. Return value¶
This scorer returns score as Float32.
select returns _score
as Int32
not
Float
. Because it casts to Int32
from Float
for keeping
backward compatibility.
Score is computed as TF with limitation.