Octocode 0.17.1: Local Embeddings with Ollama, vLLM, and Any OpenAI-Compatible Server

0.15.0 made Octocode local-first by default: no API key required, embeddings running on your own machine through fastembed. Clone a repo, run octocode index, get results, nothing leaves your laptop.

fastembed runs the handful of models it bundles, on your CPU, in-process. Great for getting started — less great when you want a bigger or newer embedding model, a GPU, or one server the whole team shares. The local provider was added for exactly that: point Octocode at any OpenAI-compatible embedding server — Ollama, LM Studio, vLLM, llama.cpp, LocalAI, text-embeddings-inference — and index with whatever model it hosts.

0.17.1 is the release where that works end to end. The provider now detects your model's embedding dimension automatically, which is the piece indexing needs to build its vector store. Pick the model, pick the hardware, and your code still never leaves your network.

This post covers the local provider plus the other changes since 0.16.0 (the June round-up caught you up through there): read-only MCP mode and the AST-precise structural search from 0.16.1.

Bring Your Own Embedding Server

Two lines of config:

[embedding]
code_model = "local:nomic-embed-text"
text_model = "local:nomic-embed-text"

The local: prefix tells Octocode to send embedding requests to a server speaking the OpenAI /v1/embeddings API. By default it points at http://localhost:11434/v1/embeddings — Ollama's port — so if Ollama is already running, there's nothing else to set up:

ollama pull nomic-embed-text
octocode index

Pointing somewhere else is one environment variable:

# vLLM, LM Studio, llama.cpp server, LocalAI, text-embeddings-inference —
# anything that exposes POST /v1/embeddings
export LOCAL_EMBED_API_URL="http://gpu-box.internal:8000/v1/embeddings"

# Optional — sent as `Authorization: Bearer <key>` if your server wants one
export LOCAL_EMBED_API_KEY="..."

That's the whole setup. No SDK, no provider-specific glue.

No dimension to configure

Different embedding models produce vectors of different widths — 384, 768, 1024 — and the vector store has to know that width up front to build its index. An OpenAI-compatible server doesn't advertise it anywhere, so Octocode figures it out for you: the first time it builds a provider for your model, it sends one small probe request, reads the vector width straight from the response, and caches it for the rest of the run. The index is then built around that dimension.

The practical result: any model your server can serve just works. There's nothing per-model to look up or set. Swap nomic-embed-text for a larger code model and Octocode adapts to the new width on the next index — the only requirement is a re-index, since vectors from different models aren't comparable.

`fastembed` vs. `local` — which one?

They're both local. The difference is who runs the model.

	`fastembed:`	`local:`
Runs	In-process (ONNX, on your CPU)	A separate server you manage
Models	The curated set Octocode bundles	Anything your server can host
Hardware	This machine's CPU	Wherever the server lives — GPU, a shared box
Setup	Zero — models download on first use	You run Ollama/vLLM/etc.

Reach for fastembed when you want zero-setup and a solid default. Reach for local when you've outgrown it:

A model fastembed doesn't bundle — a newer code embedding model, a larger one, something domain-tuned. If your server can serve it, Octocode can index with it.
Real hardware. Embedding a large monorepo on a laptop CPU is slow. Point LOCAL_EMBED_API_URL at a GPU server and the embedding runs there instead of on your laptop.
One server for the team. Run a single Ollama or vLLM instance; every developer's Octocode points at it. Consistent vectors, one place to upgrade the model.

And the property that doesn't change between them: it's self-hosted, private code search — your code stays inside your network. Same guarantee 0.15.0 shipped, now without being boxed into the bundled models or your laptop's CPU.

Read-Only MCP Mode

By default, when an AI agent connects to Octocode's MCP server, the server now serves search over your existing index and never touches it. No background indexer, no file watcher, no surprise re-index kicked off the moment your assistant connects.

This is the mcp_index toggle, and false is the default:

[index]
# false (default): MCP serves search, view_signatures, and structural_search
#   over the EXISTING index, read-only. No background indexing or file watcher.
# true: MCP keeps the index fresh in-process with a background indexer + watcher.
mcp_index = false

Why default to read-only? Indexing and serving are different jobs. You index in CI, on a git hook, or by hand — deliberately, when the code changes. The MCP server's job is to answer the agent's questions fast and predictably, not to spin up a file watcher and re-embed your repo because an editor touched a file. Read-only means lower idle resource use, no contention, and no unexpected work the moment a tool connects.

The index CLI command is unaffected — octocode index indexes, same as always. mcp_index only gates the in-process indexer the MCP server would otherwise run.

There's a small touch for read-only mode too: if a semantic search comes back empty (because the index is stale or hasn't been built yet), Octocode appends a one-line nudge steering the agent toward structural_search and view_signatures instead of letting it retry semantic search into the void.

Want the old always-fresh behavior — a long-running server that keeps itself current? Set mcp_index = true and the background indexer and watcher come back.

Structural Search, Now AST-Precise (0.16.1)

0.16.1 didn't get its own post, so it's worth calling out: Octocode's MCP server gained a proper structural search tool, separate from semantic search and grep.

Semantic search is great for "where does payment retry live?" Grep is great for literal strings. Structural search is for the questions in between — find this code shape, regardless of formatting or naming — and for symbols:

Symbol definitions and references with wildcard support — "where is handle_* defined, and who calls it?"
Smart and relaxed matching strategies. Like the structural grep improvements in 0.15.0, it recovers when the requested node kind is wrong for the language instead of returning nothing.
Metavariable constraints — narrow matches with regex and relational filters, not just raw patterns.
Breadcrumb tracking so each match tells you which class/function/module it's nested in, across languages (Swift included).
Pagination and request caching so an agent can page through large result sets without re-running the whole scan.

Under the hood it walks the tree gitignore-aware and in parallel, with literal-token prefiltering and a lexical fallback when the AST path comes up empty — fast on big repos, and it doesn't silently miss matches. For an AI agent, that's a precise, navigable way to find exact code structures and trace symbols.

Everything Else

view_signatures takes a single file as a string. Agents (and humans) can pass one file as a plain string instead of wrapping it in a single-element array. Less friction for the most common case.

Tool definitions are no longer hardcoded. The MCP server's tool list is generated rather than maintained by hand, which keeps the schema and the implementation from drifting apart.

Dependency and CI cleanup. Rust dependencies bumped, the release pipeline migrated to shared reusable workflows. No behavior change — just a tidier, more reliable build.

Upgrade

# Homebrew
brew upgrade muvon/tap/octocode

# Universal installer
curl -fsSL https://raw.githubusercontent.com/Muvon/octocode/master/install.sh | sh

# Cargo
cargo install octocode --version 0.17.1

If you're happy with fastembed, there's nothing to do — your config and index keep working. To switch to your own embedding server:

Start a server that speaks the OpenAI embeddings API (e.g. ollama serve) and load a model.
Set code_model / text_model to local:<model> in config.toml. Point LOCAL_EMBED_API_URL at the server if it isn't on Ollama's default port.
Re-index — the vectors change with the model: octocode clear && octocode index.

Want your MCP server to keep indexing in the background like older versions did? Set mcp_index = true. Otherwise the read-only default takes over — index from the CLI or CI, and let the server serve.

Octocode is open source (Apache 2.0) at github.com/Muvon/octocode. It's the code-search engine behind Octomind — and now you can run every embedding it generates on hardware you control, with whatever model you choose.

Bring Your Own Embedding Server

No dimension to configure

fastembed vs. local — which one?

Read-Only MCP Mode

Structural Search, Now AST-Precise (0.16.1)

Everything Else

Upgrade

Related Articles

Release Round, June 2026: Octocode 0.16.0, Octobrain 0.8.0, Octolib 0.23.0

Octobrain 0.7.0: Your AI's Memory Now Sleeps, Forgets, and Focuses

TypeTab 1.0.0: On-Device Autocomplete for Your Whole Mac — Finish Every Sentence Before You Type It

`fastembed` vs. `local` — which one?