BM25 + a semantic stand-in, fused with RRF — offline, polyglot, identical rankings in Python, Node and .NET.
Two documents about the same problem rarely use the same words. Hybrid retrieval runs two searchers over your documents and merges their results, so you catch both the exact matches and the ones that only match in meaning.
BM25 is keyword search — unbeatable at exact terms like an error code or a product ID, but blind to synonyms. A semantic searcher (embeddings) matches by meaning, so it finds paraphrases BM25 misses. Reciprocal Rank Fusion (RRF) combines the two ranked lists into one.
Everything here — the code and the tiny document set — already ships in the repo, so there's nothing to copy, paste or create. You'll read the program one piece at a time — tokenizer, BM25, semantic arm, fusion — in Python, Node.js or C#, then run the finished demo with a single command. Pick a language with the selector above; every step follows your choice. Press → to begin.
┌──────────────────────┐
┌────▶│ BM25 (lexical) │ exact terms: codes, IDs, names
│ └──────────┬───────────┘
query ───────┤ │ ranked list A
│ ▼
│ ┌──────────┐
│ │ RRF │ fuse by rank ──▶ ranked results
│ └──────────┘
│ ▲
│ │ ranked list B
│ ┌──────────┴───────────┐
└────▶│ semantic (meaning) │ paraphrases, synonyms
└──────────────────────┘You follow this lesson by reading and running — not by writing code. Clone the repo and cd into it; the walkthrough, the demo program and its data all live inside. The demo is offline and dependency-free — nothing to install for Python; for the other ports you only need the runtime (Node.js 18+ or the .NET 8 SDK). Every command below runs through the dispatcher from the repo root:
$ ./run -l 3 # or: ./run -l 3 --lang node | --lang csharpOur "knowledge base" is three small Markdown files that already ship with the lesson in data/ — about error codes, power problems, and first-time setup. You don't create these — they're already there; this step just shows what the demo searches over. A deliberately tiny corpus keeps the rankings easy to eyeball. Here is one of them:
# Error code reference
## E_4096 — overheating shutdown
If you see **error E_4096**, the device detected an over-temperature condition and shut down to protect
itself. Let it cool for ten minutes, clear any vents, then power it on again. If E_4096 returns, the
fan may have failed.
## E_2001 — storage full
Error E_2001 means internal storage is full. Delete old recordings to continue.The next four steps walk through the program that already lives in the repo — python/hybrid_demo.py and its Node/C# twins. *Just read along to see how it works — nothing to copy, and nothing to run until Run the demo below.*
Both searchers work on tokens, not raw text. We lowercase and split on runs of letters, digits and underscores — the same rule in every port, so the three implementations agree exactly.
def tokenize(text):
return re.findall(r"[a-z0-9_]+", text.lower())function tokenize(text) {
return (text.toLowerCase().match(/[a-z0-9_]+/g) ?? []);
}static List<string> Tokenize(string text) =>
Regex.Matches(text.ToLowerInvariant(), "[a-z0-9_]+").Select(m => m.Value).ToList();BM25 scores a document by how many query terms it contains, weighting rare terms far more (a code like E_4096 counts for much more than the) and damping very long documents. It is fast, needs no model, and is unbeatable when the user knows the exact string — but it can't match a synonym it has never seen.
def bm25_scores(query_tokens, docs):
n = len(docs)
if n == 0: # empty corpus → no scores (avoid /0 on avgdl)
return []
avgdl = sum(len(d["tokens"]) for d in docs) / n
df = {}
for d in docs:
for t in set(d["tokens"]):
df[t] = df.get(t, 0) + 1
scores = []
for d in docs:
dl = len(d["tokens"])
score = 0.0
for t in query_tokens:
if t not in df:
continue
idf = math.log(1 + (n - df[t] + 0.5) / (df[t] + 0.5))
tf = d["tokens"].count(t)
score += idf * (tf * (K1 + 1)) / (tf + K1 * (1 - B + B * dl / avgdl))
scores.append(score)
return scoresfunction bm25Scores(queryTokens, docs) {
const n = docs.length;
if (n === 0) return []; // empty corpus → no scores (avoid Infinity/NaN on avgdl)
const avgdl = docs.reduce((s, d) => s + d.tokens.length, 0) / n;
const df = new Map();
for (const d of docs) for (const t of new Set(d.tokens)) df.set(t, (df.get(t) ?? 0) + 1);
return docs.map((d) => {
const dl = d.tokens.length;
let score = 0;
for (const t of queryTokens) {
if (!df.has(t)) continue;
const idf = Math.log(1 + (n - df.get(t) + 0.5) / (df.get(t) + 0.5));
const tf = count(d.tokens, t);
score += idf * (tf * (K1 + 1)) / (tf + K1 * (1 - B + B * dl / avgdl));
}
return score;
});
}List<double> Bm25Scores(List<string> queryTokens, List<Doc> corpus)
{
int n = corpus.Count;
if (n == 0) return new List<double>(); // empty corpus → no scores (Average throws on empty)
double avgdl = corpus.Average(d => d.Tokens.Count);
var df = new Dictionary<string, int>();
foreach (var d in corpus)
foreach (var t in d.Tokens.Distinct())
df[t] = df.GetValueOrDefault(t) + 1;
return corpus.Select(d =>
{
int dl = d.Tokens.Count;
double score = 0;
foreach (var t in queryTokens)
{
if (!df.TryGetValue(t, out int dft)) continue;
double idf = Math.Log(1 + (n - dft + 0.5) / (dft + 0.5));
int tf = d.Tokens.Count(x => x == t);
score += idf * (tf * (K1 + 1)) / (tf + K1 * (1 - B + B * dl / avgdl));
}
return score;
}).ToList();
}A real system embeds text into vectors and matches by meaning, so paraphrases land near each other even with no shared words. To stay offline we stand in a small synonym map (broken → fail, dead) and score by overlap — enough to behave like embeddings for the demo, recovering documents BM25 would miss.
def semantic_scores(query_tokens, docs):
q = set(query_tokens)
for t in list(q):
q.update(SYNONYMS.get(t, []))
scores = []
for d in docs:
toks = set(d["tokens"])
scores.append(len(q & toks) / (len(q) or 1))
return scoresfunction semanticScores(queryTokens, docs) {
const q = new Set(queryTokens);
for (const t of [...q]) for (const s of SYNONYMS[t] ?? []) q.add(s);
return docs.map((d) => {
const toks = new Set(d.tokens);
let overlap = 0;
for (const t of q) if (toks.has(t)) overlap++;
return overlap / (q.size || 1);
});
}List<double> SemanticScores(List<string> queryTokens, List<Doc> corpus)
{
var q = new HashSet<string>(queryTokens);
foreach (var t in q.ToList())
if (synonyms.TryGetValue(t, out var syns))
foreach (var s in syns) q.Add(s);
return corpus.Select(d =>
{
var toks = new HashSet<string>(d.Tokens);
int overlap = q.Count(toks.Contains);
return (double)overlap / Math.Max(q.Count, 1);
}).ToList();
}Each arm produces a ranked list. RRF adds up 1 / (k + rank) across the lists, so a document near the top of either list rises — it fuses by rank, not raw score, giving the two arms an equal vote. We also drop zero-score documents, so an arm that finds nothing contributes nothing.
def rank(docs, scores):
"""Best-first doc names, dropping zero-score (unmatched) docs so an arm that
finds nothing contributes nothing to RRF. Deterministic: score desc, name asc."""
order = sorted(range(len(docs)), key=lambda i: (-scores[i], docs[i]["name"]))
return [docs[i]["name"] for i in order if scores[i] > 0]
def rrf(rankings):
fused = {}
for ranking in rankings:
for pos, name in enumerate(ranking):
fused[name] = fused.get(name, 0.0) + 1.0 / (RRF_K + pos + 1)
return [name for name, _ in sorted(fused.items(), key=lambda kv: (-kv[1], kv[0]))]function rank(docs, scores) {
return docs
.map((d, i) => ({ name: d.name, score: scores[i] }))
.filter((x) => x.score > 0)
.sort((a, b) => b.score - a.score || cmpOrdinal(a.name, b.name))
.map((x) => x.name);
}
function rrf(rankings) {
const fused = new Map();
for (const ranking of rankings) {
ranking.forEach((name, pos) => fused.set(name, (fused.get(name) ?? 0) + 1 / (RRF_K + pos + 1)));
}
return [...fused.entries()]
.sort((a, b) => b[1] - a[1] || cmpOrdinal(a[0], b[0]))
.map(([name]) => name);
}static List<string> Rank(List<Doc> corpus, List<double> scores) =>
corpus.Select((d, i) => (d.Name, Score: scores[i]))
.Where(x => x.Score > 0)
.OrderByDescending(x => x.Score)
.ThenBy(x => x.Name, StringComparer.Ordinal)
.Select(x => x.Name)
.ToList();
static List<string> Rrf(IEnumerable<List<string>> rankings)
{
var fused = new Dictionary<string, double>();
foreach (var ranking in rankings)
for (int pos = 0; pos < ranking.Count; pos++)
fused[ranking[pos]] = fused.GetValueOrDefault(ranking[pos]) + 1.0 / (RrfK + pos + 1);
return fused.OrderByDescending(kv => kv.Value)
.ThenBy(kv => kv.Key, StringComparer.Ordinal)
.Select(kv => kv.Key)
.ToList();
}Now run it — this executes the exact file you just read through, over the corpus from Step 1. No flags, no editing. It prints the BM25, semantic and fused (RRF) rankings for two contrasting queries, then exits — identical output in all three languages.
$ ./run -l 3$ ./run -l 3 --lang node$ ./run -l 3 --lang csharpLook at the two queries. error E_4096 is an exact code: BM25 nails error_codes.md, and the semantic arm finds it too. broken gadget is a paraphrase whose words appear in no document: BM25 returns nothing, yet the semantic arm maps broken → fail/dead and recovers power_issues.md — and the hybrid keeps that hit. That single example is the whole reason to fuse.
Query: "error E_4096"
BM25 (lexical): ['error_codes.md']
Semantic (stand): ['error_codes.md']
Hybrid (RRF): ['error_codes.md']
Query: "broken gadget"
BM25 (lexical): []
Semantic (stand): ['power_issues.md']
Hybrid (RRF): ['power_issues.md']Want to experiment? This is the one place you change something: edit the query list near the bottom of python/hybrid_demo.py (or add a file to the corpus in data/), then re-run ./run -l 3. Start with the exact code — BM25 pins the matching document:
error E_4096Now one whose words appear in no document — BM25 returns nothing, the semantic arm recovers it, and the hybrid agrees:
broken gadgetAn offline test pins the lesson's claims: BM25 nails the exact term, BM25 finds nothing for the paraphrase, and the hybrid recovers it. No network, no model.
$ ./run -l 3 testFor a richer, hands-on feel, launch an interactive search over a bundled 5-chapter story (The Voyage of Caretta the Magnificent). Type a query and watch all three rankings update live. The web app reuses the same hybrid() you just built — no retrieval logic is duplicated.
$ ./run -l 3 webAn exact name BM25 loves:
Nuevo Edén…and a paraphrase the semantic arm recovers:
who found the new planetThis is a teaching demo. For real systems:
Replace the synonym stand-in with a real embedding model for the semantic arm. Add a cross-encoder reranker over the fused top-k — usually the single biggest quality win. Rewrite vague queries before retrieval, and filter by metadata (product, version, date). And measure it: a golden-question set tells you whether hybrid actually beats either arm on your data.
BM25 wins exact terms; a semantic arm wins paraphrases; RRF fuses them so you keep whichever arm found the answer. You built all three by hand — the same algorithm in Python, Node.js and C#, with byte-identical results — and saw a query where keyword search returns nothing yet the hybrid still answers.
Next: Lesson 4 · RAG safety & prompt injection.
#step-N.