The repo was about 11,000 files. The agent had a bash tool and a vague instruction: "find where we validate webhook signatures." It ran grep -rn "signature" ., got 600 hits across vendored JS, test fixtures, a CHANGELOG, and three unrelated crypto helpers, then confidently pointed at the wrong file. We watched it do this four times in a row with different phrasings before someone muttered the obvious thing: grep matches strings, and the agent was asking a question about meaning.
That's the whole problem in one sentence. grep and rg are excellent at "where does this exact token appear." They are useless at "where is the thing that does X," because the code that does X almost never contains the words you'd use to describe X. The function is called verify_hmac. You searched for "webhook signature." Zero overlap, and the one tool the agent reaches for first returns nothing useful.
The fix is semantic search: embed the code, embed the query, find the chunks whose meaning is closest. The catch is that most "just use embeddings" advice assumes you'll ship your source to a cloud API and rent a GPU. For a private monorepo, neither is acceptable — legal won't sign off on uploading proprietary code to a third party, and you don't want a per-token bill that scales with every reindex.
So this is the setup we actually run: Octocode, indexing a large repo for semantic search entirely locally — no GPU, no cloud embeddings, nothing leaving the machine. Here's exactly how it works, the config that matters, and the four things that bit us.
Why grep isn't enough, concretely
Before reaching for embeddings, it's worth being precise about why text search fails for agents, because it tells you what the replacement has to do.
- Vocabulary mismatch. The query is intent ("rate limiting"), the code is mechanism (
Semaphore,token_bucket,RetryAfter). No shared tokens. - No ranking.
rggives you every match with equal weight. 600 hits is not an answer; it's a second search problem. - No structure. A text match in a comment, a test, and a core function all look identical to
grep. It can't tell you the third one is the definition.
What you want instead is: rank by semantic closeness, but don't lose the thing grep is genuinely good at — exact identifier matches. Octocode's answer is to do both and fuse them, which I'll get to. First, how the code gets into a searchable form at all.
How Octocode chunks and indexes code
Standard RAG slices files into fixed-size text windows — every N characters, overlap a bit, embed. For prose that's fine. For code it's actively bad: it splits a function across two chunks, embeds half a match arm, and loses the fact that a method belongs to a type.
Octocode parses each file with tree-sitter first, walks the AST, and chunks along real symbol boundaries — functions, methods, classes, modules — instead of arbitrary byte offsets. The architecture is a parser per language extracting functions, classes, imports and exports, then chunk-based processing for anything too large to embed in one piece. A method pulled from inside an impl block or a class carries its parent type's name in its symbol list, so searching for Suppression.mark_set or Foo.bar hits the method chunk directly.
The language coverage is real tree-sitter grammars, not regex heuristics. From the repo, indexed as code with full AST parsing:
| Language | Extensions |
|---|---|
| Rust | .rs |
| Python | .py |
| TypeScript / JavaScript | .ts, .tsx, .js, .jsx |
| Go | .go |
| PHP | .php |
| C / C++ | .c, .h, .cpp, .hpp, .cc, .cxx, plus C++20 module extensions .cppm, .ixx, .mxx, .ccm, .cxxm |
| Ruby | .rb |
| Java | .java |
| Swift | .swift |
| JSON, Bash, CSS, Lua, Svelte, Markdown | dedicated tree-sitter paths |
Everything else useful for context — yaml, toml, dockerfile, makefile, ini, conf, env, xml, html, sql, rst, and friends — is indexed as text blocks so it's still searchable, just without semantic symbol extraction.
The chunking knobs live under [index]:
[index]
chunk_size = 2000 # max characters per code chunk
chunk_overlap = 100 # overlap between adjacent chunks
quantization = true # RaBitQ vector compression, ~32x, minimal quality loss
require_git = true # only index inside a git repo by default
chunk_size is a ceiling, not a target — a 40-line function smaller than the limit stays whole. The limit only kicks in for genuinely large bodies, and Octocode already indexes large classes in Python, TypeScript, C++, and Ruby method-by-method rather than dumping the whole class into one chunk. The point of all this: when you later search for "the function that handles remote pull setup," the unit that gets embedded is the function, with its name and its parent type attached — not a window that happens to straddle it.
Local embeddings: no GPU, no cloud, no API key
The part people assume requires a cloud: turning code into vectors. It doesn't.
Octocode ships a local-first stack built on fastembed (ONNX Runtime under the hood). You point the config at a local model with the fastembed: prefix and embeddings run on CPU, on your machine:
[embedding]
code_model = "fastembed:jinaai/jina-embeddings-v2-base-code" # 768-d, purpose-built for code
text_model = "fastembed:nomic-ai/nomic-embed-text-v1.5" # text and docs
A model string is always provider:model. The provider prefix is the whole routing decision — fastembed: and huggingface: are local (no key, no network); voyage:, jina:, google:, openai:, together:, octohub: are cloud and need the corresponding API key in the environment. Mix them if you want — local code embeddings, cloud text embeddings — but for a private repo the all-local path is the point.
There's no GPU requirement and no GPU code path here: this is CPU ONNX inference. On a modern multi-core machine with AVX2, that's fast enough that the embedder is rarely the bottleneck — disk I/O and tree-sitter parsing usually are. The repo's own performance notes put local FastEmbed indexing around 45 seconds for 1,000 files as a rough order of magnitude; your mileage varies with file size and core count, so treat that as illustrative, not a promise.
Two honest caveats:
- First run downloads the model. The first
octocode indexafter pointing at afastembed:model pulls the ONNX weights — a few hundred MB — into a cache directory (~/.local/share/octocode/fastembed/on Linux/macOS). One-time cost, then it's offline forever. - Local providers are feature-gated and platform-sensitive. FastEmbed and HuggingFace require the
fastembed/huggingfacebuild features. Default builds ship them where they're well-supported; on platforms where they aren't, you fall back to a cloud provider. Checkoctocode models list fastembedto see what your binary actually exposes. The recent local embedding provider work in 0.17.0 is what made this listing honest about what's available where.
If you've never set up semantic search over code before, the Octocode semantic code search intro covers the conceptual basics; this post assumes you're past that and want the production setup.
Indexing the repo
With the config in place, indexing is one command from the repo root:
cd /path/to/big-monorepo
octocode index
# → Indexed 12,847 blocks across 342 files
What actually happens, in order: discover files (respecting ignore rules), parse each with tree-sitter, extract symbols, chunk along AST boundaries, embed each chunk locally, and write everything to a LanceDB columnar store. The database does not live in your repo — it's keyed per-project under the system data dir (~/.local/share/octocode/<project-id>/storage on Linux/macOS). That keeps your working tree clean and means a git clean -fdx won't nuke your index.
Useful flags:
octocode index --force # ignore the incremental cache, rebuild from scratch
octocode index --verbose # see per-file progress
octocode stats # how many code/text/doc blocks, graph nodes, index staleness
octocode stats is the one to run when something feels off — it tells you whether your HEAD matches the last indexed commit, so you can see at a glance whether the index is stale.
Incremental vs. full reindex
Subsequent octocode index runs are incremental — only changed files get re-parsed and re-embedded. Each indexing run ends with a table optimization pass so the query path stays fast as the database grows (repeated incremental indexing used to leave an unindexed tail that searches had to brute-force scan; that's handled automatically now).
When you change how things are indexed — a new chunking rule, a different embedding model — incremental won't retroactively re-chunk files it already has. Switching the embedding model is the big one: vectors from one model are not comparable with another's. After changing code_model or text_model, do a clean rebuild:
octocode clear # drop the indexed tables
octocode index # rebuild with the new model
Watch mode: keep the index live
A one-shot index goes stale the moment someone commits. For an agent that's supposed to answer questions about the current state of the repo, run the watcher:
octocode watch # auto-reindex on file changes
octocode watch --debounce 5 # wait 5s after the last change before reindexing
octocode watch --quiet
The watcher debounces (configurable, 1–30 seconds) so a git checkout that touches 400 files triggers one reindex, not 400. It honors the same ignore rules as a full index, so editing a node_modules file doesn't wake it up.
If you're running the MCP server (next section), there's an even cleaner option: set mcp_index = true under [index] and the MCP server indexes and watches in-process — one running thing instead of two. The default is false, which serves an existing index read-only and assumes you index separately; flip it on when you want the server to own the whole lifecycle.
Hybrid search: semantic and exact, fused
Here's the part that fixes the original grep failure without throwing away grep's one strength.
Pure vector search loses on identifier-heavy queries. Search for parse_remote and dense embeddings will happily return things that are semantically near parse_remote while missing the function literally named parse_remote. BM25 keyword search has the opposite failure: it nails exact identifiers and whiffs on paraphrased intent like "function that handles remote pull setup."
Octocode runs both on every query and fuses them with Weighted Reciprocal Rank Fusion, inside LanceDB. The weights are exposed:
[search.hybrid]
enabled = true
default_vector_weight = 0.7
default_keyword_weight = 0.3
For a code-heavy repo where exact identifiers carry a lot of signal, tilt toward keyword. The repo's own retrieval benchmark — 127 code-search queries with line-range ground truth, run fully local with jina-embeddings-v2-base-code and no reranker — shows the effect clearly: moving from the default 0.7/0.3 split to a keyword-tuned 0.3/0.7 lifted Hit@5 from 0.598 to 0.732 and Recall@10 from 0.671 to 0.807, at zero added cost. That's a +22% Hit@5 from one config line. For long-form docs, tilt the other way (0.8/0.2) since intent dominates there.
On top of fusion, an optional reranker reorders the top candidates:
[search.reranker]
enabled = true
model = "fastembed:jina-reranker-v2-base-multilingual" # local reranker, no key
top_k_candidates = 50 # fetch this many from vector search
final_top_k = 10 # return this many after reranking
The reranker runs locally too if you point it at a fastembed: model. It's a cross-encoder pass over the top 50 candidates, which catches cases where the first-stage ranking got the order wrong. Cost is real but bounded — it only sees top_k_candidates, not the whole index.
From the CLI, the search surface is straightforward:
octocode search "webhook signature verification"
octocode search "auth" "middleware" "session" # multi-query, up to 10
octocode search "database connection pool" --mode code # restrict to code blocks
octocode search "auth" --detail-level signatures # signatures only, compact
octocode search "authentication refactor" --mode commits # search git history
The --mode commits path is worth knowing about: commit history is indexed lazily on first commit-search (not during octocode index), so you can ask "when did we change the auth flow" and get semantically-ranked commits, not just a git log --grep.
Structural search: when you need AST patterns, not meaning
Sometimes you don't want semantics at all — you want "every .unwrap() in Rust" or "every new Foo() in JS." That's a structural query, and text search gets it wrong constantly (a comment mentioning .unwrap() is not a call). Octocode wraps ast-grep:
octocode grep '$FUNC.unwrap()' --lang rust
octocode grep 'new $CLASS($$$ARGS)' --lang javascript
octocode grep 'console.log($ARG)' --lang javascript --rewrite 'logger.info($ARG)' --update-all
$VAR matches one AST node, $$$ARGS matches a run of arguments, and --rewrite does AST-aware refactoring in place. The genuinely useful bit for agents: LLMs reliably get the node kind wrong — they'll write function_declaration for Python, which is actually function_definition. When a pattern matches nothing, Octocode tries progressively looser interpretations and knows the right kind per language, so the agent's near-miss still returns results instead of a silent zero.
GraphRAG: relationships, not just similarity
Semantic search finds similar code. It can't tell you that auth_middleware.rs imports jwt.rs, calls user_store.rs, and is wired into router.rs. That's a graph question, and Octocode builds the graph during indexing.
[graphrag]
enabled = true
use_llm = true
It extracts imports, calls, extends, and implements edges across nine languages from the AST, and with use_llm = true adds higher-level architectural relationships (configures, uses, factory/observer/strategy patterns) discovered by an LLM. From the CLI:
octocode graphrag get-relationships --node-id "src/auth/middleware.rs"
octocode graphrag find-path --source-id "src/auth/mod.rs" --target-id "src/database/mod.rs"
octocode graphrag overview
This is the difference between "show me code that looks like auth" and "show me everything that depends on the auth module." For an agent doing a refactor, the second question is the one that prevents breakage.
There's a newer, optional twist: graph-aware retrieval expansion. With [search] graph_expansion = true (and GraphRAG enabled), search pulls in code blocks from files that are structurally related to your top hits before reranking — so a query that lands on the auth middleware can surface the JWT helper it calls even if that helper didn't match the query text. It's off by default and the code comments are blunt about it: A/B it on your own eval before trusting it, because expansion can add noise as easily as signal.
Wiring it to an agent over MCP
All of the above is exposed to AI assistants through a built-in MCP server. This is the primary way to use Octocode — the agent gets tools, not a shell.
octocode mcp --path /path/to/your/project
Or in a client config (Claude Desktop, Cursor, Windsurf, Claude Code):
{
"mcpServers": {
"octocode": {
"command": "octocode",
"args": ["mcp", "--path", "/path/to/your/project"]
}
}
}
The tools the agent sees, verified against the server:
| MCP tool | What it does |
|---|---|
semantic_search |
Hybrid semantic + keyword search, multi-query, all modes including commits |
view_signatures |
File structure — signatures, classes, imports — by glob, without reading whole files |
structural_search |
AST pattern matching via ast-grep, with the kind-fallback recovery |
graphrag |
Relationship queries: search, get-node, get-relationships, find-path, overview |
view_signatures is the unsung one. Instead of an agent burning context reading three 800-line files to find a function signature, it asks for signatures by glob and gets the skeleton. (As of recent versions you can pass a single glob string, not just an array — a small thing that stopped a lot of malformed tool calls.)
For a monorepo with many sub-projects, there's a multi-repo mode — octocode mcp --multi --path /workspace scans the immediate subdirectories for git repos and serves all of them from one endpoint, with each tool gaining a project argument to pick the target. One MCP server, every repo in the workspace.
The four things that bit us
Grounded, not hypothetical — these are what cost us time.
1. The index lives outside the repo, and that confused everyone. First time someone ran the agent on a fresh clone, it had no index — the database is per-project under ~/.local/share/octocode/, not in the working tree. Clone ≠ indexed. The fix is a one-liner in onboarding: after cloning, run octocode index once. Obvious in hindsight; not obvious at 2am.
2. Switching embedding models silently degraded results. Someone changed code_model to try a different local model and didn't clear the index. New queries embedded with the new model, old chunks embedded with the old one, vectors not comparable, garbage rankings. Vectors from different models do not live in the same space. Always octocode clear && octocode index after changing an embedding model. There's no warning that will save you here — it just quietly gets worse.
3. The default vector/keyword split was wrong for our repo. We ran with 0.7/0.3 (vector-heavy) for a week and kept missing exact-identifier queries. Code is identifier-dense; the benchmark above shows keyword-tuned weights are dramatically better for code search. Flipping to 0.3/0.7 was the single highest-leverage config change we made.
4. Binary and generated files don't get indexed — and that's correct, but check it. Octocode skips binary files (it checks for null bytes and a printable-character ratio before treating content as text) and respects .gitignore, .git/info/exclude, and .noindex files. That's exactly what you want — you don't want embeddings of minified vendor bundles. But if a directory you care about is gitignored (some teams gitignore generated API clients), it won't be indexed and you'll wonder why search can't find it. Drop a .noindex to exclude extra paths; check your .gitignore if something you expected is missing. octocode stats will show you the block counts so you can sanity-check coverage.
A minimal end-to-end setup
Putting it together, this is the whole thing for a private monorepo, all local:
# ~/.local/share/octocode/config.toml (or a project-local override)
[embedding]
code_model = "fastembed:jinaai/jina-embeddings-v2-base-code"
text_model = "fastembed:nomic-ai/nomic-embed-text-v1.5"
[search.hybrid]
enabled = true
default_vector_weight = 0.3 # identifier-heavy code → favor keyword
default_keyword_weight = 0.7
[search.reranker]
enabled = true
model = "fastembed:jina-reranker-v2-base-multilingual"
[graphrag]
enabled = true
use_llm = false # AST-only relationships, no LLM calls, fully offline
cd /path/to/monorepo
octocode index # first run downloads models, then indexes
octocode search "webhook signature verification" # sanity check
octocode watch --quiet & # keep it live
claude mcp add octocode -- octocode mcp --path . # wire to the agent
Nothing in that flow touches the network after the one-time model download. No GPU. No per-token bill. The agent that was grepping for "signature" and pointing at the wrong file now calls semantic_search, gets verify_hmac ranked first, and view_signatures to confirm before it touches anything.
FAQ
Do I need a GPU? No. Local embeddings run on CPU via ONNX Runtime. A modern multi-core CPU is fine; the embedder is rarely the bottleneck.
Does any code leave my machine? With fastembed:/huggingface: models and graphrag.use_llm = false, no — indexing, search, and reranking are all local. Cloud embedding providers and the LLM-powered GraphRAG/commit-message features are opt-in and clearly need keys.
Where does the index live, and is it in my repo? Under the system data dir, keyed per project (~/.local/share/octocode/<project-id>/ on Linux/macOS), not in your working tree. A git clean won't touch it; a fresh clone won't have it.
How big does the database get? Roughly on the order of ~10KB per file, with RaBitQ quantization giving ~32x vector compression. A 10k-file repo is tens to low-hundreds of MB, not gigabytes.
Incremental or full reindex? Incremental by default — only changed files are re-embedded. Force a full rebuild with octocode index --force, or octocode clear && octocode index after changing the embedding model.
Can it search git history? Yes, octocode search "..." --mode commits. Commits are indexed lazily on first commit-search, so the initial octocode index stays fast.
— Don
Octocode is open source under Apache-2.0. The local-first stack landed in 0.15.0 and the local embedding provider listing in 0.17.x. If your setup hits a gotcha this post didn't cover, open an issue — that's how most of these tips got written down in the first place.



