/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
package org.apache.lucene.search.similarities;


import java.util.Collections;

import org.apache.lucene.document.NumericDocValuesField;
import org.apache.lucene.index.FieldInvertState;
import org.apache.lucene.search.CollectionStatistics;
import org.apache.lucene.search.Explanation;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.TermStatistics;
import org.apache.lucene.util.SmallFloat;

Similarity defines the components of Lucene scoring.

Expert: Scoring API.

This is a low-level API, you should only extend this API if you want to implement an information retrieval model. If you are instead looking for a convenient way to alter Lucene's scoring, consider just tweaking the default implementation: BM25Similarity or extend SimilarityBase, which makes it easy to compute a score from index statistics.

Similarity determines how Lucene weights terms, and Lucene interacts with this class at both index-time and query-time.

Indexing Time At indexing time, the indexer calls computeNorm(FieldInvertState), allowing the Similarity implementation to set a per-document value for the field that will be later accessible via LeafReader.getNormValues(String). Lucene makes no assumption about what is in this norm, but it is most useful for encoding length normalization information.

Implementations should carefully consider how the normalization is encoded: while Lucene's BM25Similarity encodes length normalization information with SmallFloat into a single byte, this might not be suitable for all purposes.

Many formulas require the use of average document length, which can be computed via a combination of CollectionStatistics.sumTotalTermFreq() and CollectionStatistics.docCount().

Additional scoring factors can be stored in named NumericDocValuesFields and accessed at query-time with LeafReader.getNumericDocValues(String). However this should not be done in the Similarity but externally, for instance by using FunctionScoreQuery.

Finally, using index-time boosts (either via folding into the normalization byte or via DocValues), is an inefficient way to boost the scores of different fields if the boost will be the same for every document, instead the Similarity can simply take a constant boost parameter C, and PerFieldSimilarityWrapper can return different instances with different boosts depending upon field name.

Query time At query-time, Queries interact with the Similarity via these steps:

  1. The scorer(float, CollectionStatistics, TermStatistics...) method is called a single time, allowing the implementation to compute any statistics (such as IDF, average document length, etc) across the entire collection. The TermStatistics and CollectionStatistics passed in already contain all of the raw statistics involved, so a Similarity can freely use any combination of statistics without causing any additional I/O. Lucene makes no assumption about what is stored in the returned SimScorer object.
  2. Then SimScorer.score(float, long) is called for every matching document to compute its score.

Explanations When IndexSearcher.explain(Query, int) is called, queries consult the Similarity's DocScorer for an explanation of how it computed its score. The query passes in a the document id and an explanation of how the frequency was computed.

See Also:
@lucene.experimental
/** * Similarity defines the components of Lucene scoring. * <p> * Expert: Scoring API. * <p> * This is a low-level API, you should only extend this API if you want to implement * an information retrieval <i>model</i>. If you are instead looking for a convenient way * to alter Lucene's scoring, consider just tweaking the default implementation: * {@link BM25Similarity} or extend {@link SimilarityBase}, which makes it easy to compute * a score from index statistics. * <p> * Similarity determines how Lucene weights terms, and Lucene interacts with * this class at both <a href="#indextime">index-time</a> and * <a href="#querytime">query-time</a>. * <p> * <a name="indextime">Indexing Time</a> * At indexing time, the indexer calls {@link #computeNorm(FieldInvertState)}, allowing * the Similarity implementation to set a per-document value for the field that will * be later accessible via {@link org.apache.lucene.index.LeafReader#getNormValues(String)}. * Lucene makes no assumption about what is in this norm, but it is most useful for * encoding length normalization information. * <p> * Implementations should carefully consider how the normalization is encoded: while * Lucene's {@link BM25Similarity} encodes length normalization information with * {@link SmallFloat} into a single byte, this might not be suitable for all purposes. * <p> * Many formulas require the use of average document length, which can be computed via a * combination of {@link CollectionStatistics#sumTotalTermFreq()} and * {@link CollectionStatistics#docCount()}. * <p> * Additional scoring factors can be stored in named {@link NumericDocValuesField}s and * accessed at query-time with {@link org.apache.lucene.index.LeafReader#getNumericDocValues(String)}. * However this should not be done in the {@link Similarity} but externally, for instance * by using <tt>FunctionScoreQuery</tt>. * <p> * Finally, using index-time boosts (either via folding into the normalization byte or * via DocValues), is an inefficient way to boost the scores of different fields if the * boost will be the same for every document, instead the Similarity can simply take a constant * boost parameter <i>C</i>, and {@link PerFieldSimilarityWrapper} can return different * instances with different boosts depending upon field name. * <p> * <a name="querytime">Query time</a> * At query-time, Queries interact with the Similarity via these steps: * <ol> * <li>The {@link #scorer(float, CollectionStatistics, TermStatistics...)} method is called a single time, * allowing the implementation to compute any statistics (such as IDF, average document length, etc) * across <i>the entire collection</i>. The {@link TermStatistics} and {@link CollectionStatistics} passed in * already contain all of the raw statistics involved, so a Similarity can freely use any combination * of statistics without causing any additional I/O. Lucene makes no assumption about what is * stored in the returned {@link Similarity.SimScorer} object. * <li>Then {@link SimScorer#score(float, long)} is called for every matching document to compute its score. * </ol> * <p> * <a name="explaintime">Explanations</a> * When {@link IndexSearcher#explain(org.apache.lucene.search.Query, int)} is called, queries consult the Similarity's DocScorer for an * explanation of how it computed its score. The query passes in a the document id and an explanation of how the frequency * was computed. * * @see org.apache.lucene.index.IndexWriterConfig#setSimilarity(Similarity) * @see IndexSearcher#setSimilarity(Similarity) * @lucene.experimental */
public abstract class Similarity {
Sole constructor. (For invocation by subclass constructors, typically implicit.)
/** * Sole constructor. (For invocation by subclass * constructors, typically implicit.) */
public Similarity() {}
Computes the normalization value for a field, given the accumulated state of term processing for this field (see FieldInvertState).

Matches in longer fields are less precise, so implementations of this method usually set smaller values when state.getLength() is large, and larger values when state.getLength() is small.

Note that for a given term-document frequency, greater unsigned norms must produce scores that are lower or equal, ie. for two encoded norms n1 and n2 so that Long.compareUnsigned(n1, n2) &gt; 0 then SimScorer.score(freq, n1) &lt;= SimScorer.score(freq, n2) for any legal freq.

0 is not a legal norm, so 1 is the norm that produces the highest scores.

Params:
  • state – current processing state for this field
@lucene.experimental
Returns:computed norm value
/** * Computes the normalization value for a field, given the accumulated * state of term processing for this field (see {@link FieldInvertState}). * * <p>Matches in longer fields are less precise, so implementations of this * method usually set smaller values when <code>state.getLength()</code> is large, * and larger values when <code>state.getLength()</code> is small. * * <p>Note that for a given term-document frequency, greater unsigned norms * must produce scores that are lower or equal, ie. for two encoded norms * {@code n1} and {@code n2} so that * {@code Long.compareUnsigned(n1, n2) &gt; 0} then * {@code SimScorer.score(freq, n1) &lt;= SimScorer.score(freq, n2)} * for any legal {@code freq}. * * <p>{@code 0} is not a legal norm, so {@code 1} is the norm that produces * the highest scores. * * @lucene.experimental * * @param state current processing state for this field * @return computed norm value */
public abstract long computeNorm(FieldInvertState state);
Compute any collection-level weight (e.g. IDF, average document length, etc) needed for scoring a query.
Params:
  • boost – a multiplicative factor to apply to the produces scores
  • collectionStats – collection-level statistics, such as the number of tokens in the collection.
  • termStats – term-level statistics, such as the document frequency of a term across the collection.
Returns:SimWeight object with the information this Similarity needs to score a query.
/** * Compute any collection-level weight (e.g. IDF, average document length, etc) needed for scoring a query. * * @param boost a multiplicative factor to apply to the produces scores * @param collectionStats collection-level statistics, such as the number of tokens in the collection. * @param termStats term-level statistics, such as the document frequency of a term across the collection. * @return SimWeight object with the information this Similarity needs to score a query. */
public abstract SimScorer scorer(float boost, CollectionStatistics collectionStats, TermStatistics... termStats);
Stores the weight for a query across the indexed collection. This abstract implementation is empty; descendants of Similarity should subclass SimWeight and define the statistics they require in the subclass. Examples include idf, average field length, etc.
/** Stores the weight for a query across the indexed collection. This abstract * implementation is empty; descendants of {@code Similarity} should * subclass {@code SimWeight} and define the statistics they require in the * subclass. Examples include idf, average field length, etc. */
public static abstract class SimScorer {
Sole constructor. (For invocation by subclass constructors.)
/** * Sole constructor. (For invocation by subclass * constructors.) */
protected SimScorer() {}
Score a single document. freq is the document-term sloppy frequency and must be finite and positive. norm is the encoded normalization factor as computed by Similarity.computeNorm(FieldInvertState) at index time, or 1 if norms are disabled. norm is never 0.

Score must not decrease when freq increases, ie. if freq1 &gt; freq2, then score(freq1, norm) &gt;= score(freq2, norm) for any value of norm that may be produced by Similarity.computeNorm(FieldInvertState).

Score must not increase when the unsigned norm increases, ie. if Long.compareUnsigned(norm1, norm2) &gt; 0 then score(freq, norm1) &lt;= score(freq, norm2) for any legal freq.

As a consequence, the maximum score that this scorer can produce is bound by score(Float.MAX_VALUE, 1).

Params:
  • freq – sloppy term frequency, must be finite and positive
  • norm – encoded normalization factor or 1 if norms are disabled
Returns:document's score
/** * Score a single document. {@code freq} is the document-term sloppy * frequency and must be finite and positive. {@code norm} is the * encoded normalization factor as computed by * {@link Similarity#computeNorm(FieldInvertState)} at index time, or * {@code 1} if norms are disabled. {@code norm} is never {@code 0}. * <p> * Score must not decrease when {@code freq} increases, ie. if * {@code freq1 &gt; freq2}, then {@code score(freq1, norm) &gt;= * score(freq2, norm)} for any value of {@code norm} that may be produced * by {@link Similarity#computeNorm(FieldInvertState)}. * <p> * Score must not increase when the unsigned {@code norm} increases, ie. if * {@code Long.compareUnsigned(norm1, norm2) &gt; 0} then * {@code score(freq, norm1) &lt;= score(freq, norm2)} for any legal * {@code freq}. * <p> * As a consequence, the maximum score that this scorer can produce is bound * by {@code score(Float.MAX_VALUE, 1)}. * @param freq sloppy term frequency, must be finite and positive * @param norm encoded normalization factor or {@code 1} if norms are disabled * @return document's score */
public abstract float score(float freq, long norm);
Explain the score for a single document
Params:
  • freq – Explanation of how the sloppy term frequency was computed
  • norm – encoded normalization factor, as returned by Similarity.computeNorm, or 1 if norms are disabled
Returns:document's score
/** * Explain the score for a single document * @param freq Explanation of how the sloppy term frequency was computed * @param norm encoded normalization factor, as returned by {@link Similarity#computeNorm}, or {@code 1} if norms are disabled * @return document's score */
public Explanation explain(Explanation freq, long norm) { return Explanation.match( score(freq.getValue().floatValue(), norm), "score(freq=" + freq.getValue() +"), with freq of:", Collections.singleton(freq)); } } }