BM25 Retriever

This guide shows how to use the BM25Retriever for keyword-based document retrieval using Okapi BM25 scoring.

Overview

BM25 (Best Matching 25) is a classic information retrieval algorithm that scores documents based on term frequency, inverse document frequency, and document length normalization. Unlike vector-based retrieval, BM25 does not require embeddings -- it works directly on the text.

BM25 is a good choice when:

You need exact keyword matching rather than semantic similarity.
You want fast retrieval without the cost of computing embeddings.
You want to combine it with a vector retriever in an ensemble (see Ensemble Retriever).

Basic usage

use synaptic::retrieval::{BM25Retriever, Document, Retriever};

let docs = vec![
    Document::new("1", "Rust is a systems programming language focused on safety"),
    Document::new("2", "Python is widely used for data science and machine learning"),
    Document::new("3", "Go was designed at Google for concurrent programming"),
    Document::new("4", "Rust provides memory safety without garbage collection"),
];

let retriever = BM25Retriever::new(docs);

let results = retriever.retrieve("Rust memory safety", 2).await?;
// Returns documents 4 and 1 (highest BM25 scores for those query terms)

The retriever pre-computes term frequencies, document lengths, and inverse document frequencies at construction time, so retrieval itself is fast.

Custom BM25 parameters

BM25 has two tuning parameters:

k1 (default 1.5) -- controls term frequency saturation. Higher values give more weight to term frequency.
b (default 0.75) -- controls document length normalization. 1.0 means full length normalization; 0.0 means no length normalization.

let retriever = BM25Retriever::with_params(docs, 1.2, 0.8);

How scoring works

For each query term, BM25 computes:

score = IDF * (tf * (k1 + 1)) / (tf + k1 * (1 - b + b * dl / avgdl))

Where:

IDF = ln((N - df + 0.5) / (df + 0.5) + 1) -- inverse document frequency
tf -- term frequency in the document
dl -- document length (in tokens)
avgdl -- average document length across the corpus
N -- total number of documents
df -- number of documents containing the term

Documents with a total score of zero (no matching terms) are excluded from results.

Combining with vector search

BM25 pairs well with vector retrieval through the EnsembleRetriever. This gives you the best of both keyword matching and semantic search:

use std::sync::Arc;
use synaptic::retrieval::{BM25Retriever, EnsembleRetriever, Retriever};

let bm25 = Arc::new(BM25Retriever::new(docs.clone()));
let vector_retriever = Arc::new(/* VectorStoreRetriever */);

let ensemble = EnsembleRetriever::new(vec![
    (vector_retriever as Arc<dyn Retriever>, 0.5),
    (bm25 as Arc<dyn Retriever>, 0.5),
]);

let results = ensemble.retrieve("query", 5).await?;

See Ensemble Retriever for more details on combining retrievers.