#1 on LoCoMo benchmark — zero LLM required

Conversational memory
that actually remembers.

93.9% LoCoMo R@5

0 LLM calls

$0 per query

State-of-the-art retrieval over past conversations. Your words, stored exactly as you said them — retrieved with zero LLM calls at query time.

$ pip install engram-search

View on GitHub

MIT licensed

Local-first, cloud-ready

Python 3.9+

The shift

Memory that breaks → memory that works.

Without memory infrastructure

Forgets past turns

Re-embeds or paraphrases on every call

$ per query, rate limits, prompt drift

With Engram

93.9% recall across sessions

Exact words, retrieved verbatim

$0 per query, deterministic, reproducible

Benchmarks

Independently verified on two benchmarks.

Tested on the two widely-used conversational memory benchmarks. No LLM in the loop — just embeddings, sparse retrieval, and a free cross-encoder reranker.

LoCoMo

1,982 questions · 10 conversations

93.9%

R@5 — top result on the benchmark

R@1095.0%

NDCG@50.894

Single-hop90.4%

Temporal93.1%

Contextual97.1%

Adversarial94.6%

LongMemEval

500 questions

98.4%

R@5 — 492 of 500 questions retrieved

R@1099.4%

NDCG@50.934

Multi-session99.2%

Single-session-user100.0%

Knowledge-update98.7%

Temporal-reasoning97.0%

LoCoMo benchmark comparison

Disclaimer: results are compiled from multiple papers and evaluation reports. They are not directly comparable due to differences in backbone LLMs, prompting strategies, and evaluation setups.

System	LoCoMo Accuracy	LLM required	Open source	Source
Engram	93.9% (R@5)	No	Yes (MIT)	This repo (reproducible)
EverMemOS	86.76% – 93.05%	Yes	No	arXiv:2601.02163
Zep	85.22%	Yes	Partial	EverMemOS evaluation
MemOS	80.76%	Yes	Partial	EverMemOS evaluation
Mem0	64.20%	Yes	Partial	EverMemOS evaluation
MemU	61.15%	Yes	Partial	arXiv:2601.02163
Other LLM-based (Hindsight, MemGPT, Letta)	~83 – 92%	Yes	Varies	Secondary reports
Non-LLM (SLM variants)	~74 – 75%	No	Yes	Secondary reports

Architecture

Three-stage hybrid retrieval.

Dense semantic search catches meaning. Sparse BM25 catches exact words. A cross-encoder reranker scores the finalists. Nothing is summarized.

1

Dense

bge-large bi-encoder (1024d) finds semantically similar past turns.

2

Sparse

BM25 catches exact names, dates, and rare terms embeddings miss.

3

RRF fusion

Reciprocal Rank Fusion combines both signals without per-query tuning.

4

Rerank

Cross-encoder scores top candidates jointly for the final ranking.

Session chunking

Long sessions dilute embeddings. Chunking at ~6 turns with 1-turn overlap keeps individual facts retrievable.

Timestamp prefix

Prepending [2024-01-15] to each document lets both dense and BM25 match temporal queries.

Speaker-name injection

First-person turns don't contain the speaker's name, so entity-attribute queries fail. Prepending it bridges the gap and lifts LoCoMo R@5 by ~3pts.

Quickstart

Running in two minutes.

One pip install. Works locally with FAISS + SQLite, or plugs into Qdrant for cloud deployment.

# Install
$ pip install engram-search

# Initialize a memory store
$ engram init ./my_memories

# Ingest past conversations
$ engram ingest conversations.json --store ./my_memories

# Search
$ engram search "why did we switch to GraphQL" --store ./my_memories

from engram.backends.faiss_backend import FaissBackend
from engram.backends.base import Document
from engram.ingestion.parser import session_to_documents
from engram.retrieval.embedder import Embedder
from engram.retrieval.pipeline import RetrievalPipeline

embedder = Embedder("bge-large")
backend = FaissBackend(path="./my_memories", dimension=1024)
pipeline = RetrievalPipeline(embedder=embedder)

turns = [
    {"role": "user", "content": "I'm switching our API from REST to GraphQL."},
    {"role": "assistant", "content": "What's driving the switch?"},
    {"role": "user", "content": "Too many round trips — 12 calls per screen."},
]
docs = session_to_documents(turns, session_id="s1", timestamp="2025-01-15")

results = pipeline.search("why did we switch to GraphQL", documents=docs, top_k=3)
for r in results:
    print(r.text)

# Point Engram at a managed Qdrant cluster
$ export ENGRAM_BACKEND=qdrant
$ export ENGRAM_QDRANT_URL=https://your-cluster.qdrant.io:6333
$ export ENGRAM_QDRANT_API_KEY=your-api-key

# Start the API server
$ pip install fastapi uvicorn
$ uvicorn engram.server:app --host 0.0.0.0 --port 8000

# Endpoints available
# POST /ingest   — add conversations
# POST /search   — retrieve memories
# GET  /health   — health check
# GET  /stats    — store statistics

# Install with MCP extras
$ pip install "engram-search[mcp]"
$ engram init ./engram_store

# Add to claude_desktop_config.json (Claude Desktop, Cursor, Windsurf…)
{
  "mcpServers": {
    "engram": {
      "command": "engram-mcp",
      "env": {
        "ENGRAM_STORE_PATH": "/absolute/path/to/engram_store"
      }
    }
  }
}

# Restart the client. Engram exposes three tools:
#   search_memory(query, top_k, min_score)  — retrieve relevant memories
#   add_memory(text, metadata)              — store a new memory fact
#   memory_stats()                          — count documents in the store

Why Engram

LLMs are getting better — but memory is still broken.

Even the best agents forget past interactions, lose long-term context, and rely on expensive reprocessing.

Engram fixes this at the infrastructure layer.

Zero LLM calls

Retrieval only. Deterministic, reproducible, no per-query spend, no prompt drift, no rate limits.

Exact words preserved

Nothing is summarized or paraphrased on the way in. What you said is what gets returned.

Local-first

FAISS + SQLite out of the box. Runs entirely on your machine. No API keys needed to get started.

Cloud-ready

Plug into Qdrant for multi-tenant, horizontally-scalable memory. Same API, same accuracy.

Use Cases

Where Engram fits.

Drop-in memory for any system that needs to remember conversations across time.

AI assistants with long-term memory

Recall user preferences, past decisions, and prior context across sessions — without re-feeding the full history into every prompt.

Customer support agents

Pull a customer's full history on every interaction. No transcript dumps, no context bloat, no $ per token.

Agent memory layer

Give autonomous agents persistent memory across runs without blowing up the context window.

Multi-session chatbots

Resolve references to prior conversations ("like we discussed last week") without re-embedding history on every turn.

RAG over conversations

Index dialogues, meeting transcripts, or support tickets with higher recall than vanilla semantic search.

Personal AI with real history

Build personal assistants that actually remember what you told them — exact words, not summaries.

Roadmap

What's next.

Shipping more integrations, more backends, and deeper temporal reasoning.

LangChain + LlamaIndex integrations

Drop-in memory modules for existing agent stacks.

MCP server Shipped

Plug Engram into Claude Desktop, Cursor, Windsurf, or any MCP client with three lines of config. See the MCP tab in Quickstart.

Streaming ingestion

Append turns to a live session without re-indexing.

Multi-tenant isolation

Per-user namespaces for hosted deployments.

Async API

Non-blocking ingest/search for high-throughput workloads.

More backends

pgvector, Weaviate, and Pinecone adapters.

Temporal reasoning boost

Improved date-grounding for "when did we…" queries.

Benchmark expansion

Add MSC, DialogSum, and custom domain benchmarks.

Have a use case we're missing? Open an issue.

Ready to give your agent a memory?

MIT licensed. Reproducible benchmarks. Drop it into your RAG pipeline today.