State-of-the-art retrieval over past conversations — 93.9% R@5 on LoCoMo, 98.4% on LongMemEval. No LLM calls. $0 per query. Your words, stored exactly as you said them.
Tested on the two widely-used conversational memory benchmarks. No LLM in the loop — just embeddings, sparse retrieval, and a free cross-encoder reranker.
| System | LoCoMo R@5 | LLM required | Cost / query |
|---|---|---|---|
| Engram | 93.9% | No | $0 |
| EverMemOS | 92.3% | Yes (cloud) | $$ |
| Hindsight | 89.6% | Yes (cloud) | $$ |
| Zep | ~85% | Yes (cloud) | $$ |
| Letta / MemGPT | ~83.2% | Yes (cloud) | $$ |
| SLM V3 (zero-cloud) | 74.8% | No | $0 |
| Supermemory | ~70% | Yes | $$ |
| Mem0 (independent) | ~58% | Yes | $$ |
Dense semantic search catches meaning. Sparse BM25 catches exact words. A cross-encoder reranker scores the finalists. Nothing is summarized.
bge-large bi-encoder (1024d) finds semantically similar past turns.
BM25 catches exact names, dates, and rare terms embeddings miss.
Reciprocal Rank Fusion combines both signals without per-query tuning.
Cross-encoder scores top candidates jointly for the final ranking.
Long sessions dilute embeddings. Chunking at ~6 turns with 1-turn overlap keeps individual facts retrievable.
Prepending [2024-01-15] to each document lets both dense and BM25 match temporal queries.
First-person turns don't contain the speaker's name, so entity-attribute queries fail. Prepending it bridges the gap and lifts LoCoMo R@5 by ~3pts.
One pip install. Works locally with FAISS + SQLite, or plugs into Qdrant for cloud deployment.
# Install $ pip install engram-search # Initialize a memory store $ engram init ./my_memories # Ingest past conversations $ engram ingest conversations.json --store ./my_memories # Search $ engram search "why did we switch to GraphQL" --store ./my_memories
from engram.backends.faiss_backend import FaissBackend from engram.backends.base import Document from engram.ingestion.parser import session_to_documents from engram.retrieval.embedder import Embedder from engram.retrieval.pipeline import RetrievalPipeline embedder = Embedder("bge-large") backend = FaissBackend(path="./my_memories", dimension=1024) pipeline = RetrievalPipeline(embedder=embedder) turns = [ {"role": "user", "content": "I'm switching our API from REST to GraphQL."}, {"role": "assistant", "content": "What's driving the switch?"}, {"role": "user", "content": "Too many round trips — 12 calls per screen."}, ] docs = session_to_documents(turns, session_id="s1", timestamp="2025-01-15") results = pipeline.search("why did we switch to GraphQL", documents=docs, top_k=3) for r in results: print(r.text)
# Point Engram at a managed Qdrant cluster $ export ENGRAM_BACKEND=qdrant $ export ENGRAM_QDRANT_URL=https://your-cluster.qdrant.io:6333 $ export ENGRAM_QDRANT_API_KEY=your-api-key # Start the API server $ pip install fastapi uvicorn $ uvicorn engram.server:app --host 0.0.0.0 --port 8000 # Endpoints available # POST /ingest — add conversations # POST /search — retrieve memories # GET /health — health check # GET /stats — store statistics
Retrieval only. Deterministic, reproducible, no per-query spend, no prompt drift, no rate limits.
Nothing is summarized or paraphrased on the way in. What you said is what gets returned.
FAISS + SQLite out of the box. Runs entirely on your machine. No API keys needed to get started.
Plug into Qdrant for multi-tenant, horizontally-scalable memory. Same API, same accuracy.
MIT licensed. Reproducible benchmarks. Drop it into your RAG pipeline today.