Octolib: The Rust Library That Powers Every AI Feature We Ship

Every time Octocode runs a semantic search, every time Octomind routes a prompt to the right model, every time one of our products talks to an LLM — it's going through the same piece of code. We extracted it from Octocode last year. Now it runs everything.

It's called Octolib. And we just open-sourced it.

From internal glue to shared infrastructure

Octolib started as the messy part of Octocode nobody wanted to think about.

Octocode needs embeddings to understand code semantically. LLMs to answer questions about your codebase. Reranking to surface the results that actually matter. Early on, that meant three different client libraries, three different error handling patterns, and three different ways of parsing token usage. (I still remember adding Cohere embeddings and having to learn their entire SDK just to get a vector back.)

Adding a new provider meant touching code in six places. We did what most teams do: we wrote a thin wrapper.

Then the wrapper got features. Then it got providers. Then we realized Octomind — our AI agent platform — needed the exact same thing. So did Octobrain.

Copy-pasting the wrapper three times felt stupid. Extracting it felt obvious.

Octolib is that extraction. A self-sufficient Rust library for LLM inference, embeddings, and reranking across every major provider.

One string, any backend

The pitch is simple. One provider:model string. One API. Any backend.

let provider = ProviderFactory::get_provider_for_model("openai:gpt-5.5")?;
let response = provider.chat_completion(params).await?;

Swap openai:gpt-5.5 for anthropic:claude-opus-4-5-20251101, deepseek:deepseek-chat, nvidia:meta/llama-3.3-70b-instruct, or openrouter:anthropic/claude-opus-4.5. The code doesn't change. The response type doesn't change. Token usage and cost tracking still work.

Here's the trait that makes that possible:

#[async_trait]
pub trait AiProvider {
    fn name(&self) -> &str;
    fn supports_model(&self, model: &str) -> bool;
    async fn chat_completion(&self, params: ChatCompletionParams) -> Result<ProviderResponse>;
    fn get_model_pricing(&self, model: &str) -> Option<Pricing>;
}

Every provider implements the same interface. OpenAI has its own native implementation. NVIDIA NIM, Cerebras, Together, Ollama, and local endpoints all go through a shared OpenAI-compatible layer. The factory handles model string parsing and routes to the right backend.

Boring by design. That's the point.

LLM providers

Provider	Type	Chat	Structured output	Caching
OpenAI	Native	✅	✅	—
Anthropic	Native	✅	✅	✅
Google (Gemini)	Native	✅	✅	—
DeepSeek	Native	✅	✅	—
Moonshot	Native	✅	✅	—
MiniMax	Native	✅	✅	—
Z.ai	Native	✅	—	—
OpenRouter	OpenAI-compatible	✅	✅	—
NVIDIA NIM	OpenAI-compatible	✅	—	—
Cerebras	OpenAI-compatible	✅	—	—
Together	OpenAI-compatible	✅	—	—
Cloudflare Workers AI	OpenAI-compatible	✅	—	—
OctoHub	OpenAI-compatible	✅	—	—
Ollama	OpenAI-compatible	✅	—	—
Local endpoints	OpenAI-compatible	✅	—	—

Embeddings & reranking

Provider	Embeddings	Reranking	Local
Jina	✅	✅	—
Voyage	✅	✅	—
Cohere	—	✅	—
OpenAI	✅	—	—
Google	✅	—	—
FastEmbed	✅	✅	✅

New providers land as pull requests. If you need one that's missing, the template is in src/llm/providers/openai.rs.

The parts nobody talks about until they break

Multi-provider libraries aren't rare anymore. The difference is in the details that only matter once you're running them hard.

Cost tracking. Every response includes token usage and calculated cost. For native API providers, we maintain per-model pricing tables. For proxies and aggregators that host open-weight models, we fall back to reference pricing matched by model name. If the upstream API doesn't return a cost field, we compute it ourselves from the token count.

This sounds minor. It isn't. Try running thousands of requests across five providers and guessing where your money went.

Structured output. Not every provider supports JSON schema constraints. Octolib knows which ones do — OpenAI, Anthropic, DeepSeek, Moonshot, MiniMax — and handles the request formatting differences. You set response_format and get back parsed JSON. No manual prompt engineering. No "please respond in valid JSON" hacks.

Embeddings and reranking. The same provider abstraction applies to generate_embeddings() and rerank(). Jina, Voyage, Cohere, OpenAI, FastEmbed — same API, different backend. Modern RAG pipelines touch three different service types. Managing separate clients for each is busywork you shouldn't have to do.

No panics, no println. It's a library, not an app. Every public function returns Result. Errors carry context. Debug output goes through tracing, not stdout. These aren't exciting features. They're the difference between a library you trust in production and one that pages you at 3am.

Three products, one library

Octolib is the AI layer for our entire stack:

Octocode — semantic code search, codebase Q&A, diff summarization
Octomind — agent reasoning, tool calling, multi-step workflows
Octobrain — knowledge retrieval, document processing, embedding pipelines

When we add a new model — say, a reasoning model from DeepSeek or the latest Claude release — we update Octolib once. Every product gets it. When OpenAI changes their API response format or Anthropic introduces cache pricing, the fix lives in one place.

This is why we extracted it. Not because wrapper code is interesting. Because maintaining four versions of the same wrapper is expensive, and getting provider integrations wrong means broken products.

The Rust question

We get asked this sometimes. The short answer: we build infrastructure in Rust, and Octolib is infrastructure.

The longer answer: LLM calls are I/O-bound network requests. But the surrounding pipeline — embedding generation, reranking, token counting, request batching, structured output validation — benefits from Rust's combination of async performance and compile-time correctness. When you're running inference at the scale where milliseconds matter, you don't want garbage collection pauses or runtime type errors in your hot path.

And the Rust ecosystem for AI infrastructure is maturing fast. Candle for local inference. tokio for async. serde for the endless JSON parsing. It all fits together cleanly.

Open source, open ecosystem

Octolib is one piece of a larger picture. Octocode is open source. Octomind's architecture is documented publicly. We're building in public because the problems we're solving — how to talk to ten different AI providers without losing your mind — aren't unique to us.

The library is on GitHub. It compiles with cargo check. It runs the examples without configuration beyond an API key. If you're building AI features in Rust and tired of writing the same provider boilerplate for the third time, it might save you a few weeks.

We're adding providers as things evolve. The next one is probably already in a pull request.

From internal glue to shared infrastructure

One string, any backend

LLM providers

Embeddings & reranking

The parts nobody talks about until they break

Three products, one library

The Rust question

Open source, open ecosystem

Related Articles

Octofs: The File Server That Stops Your AI from Breaking Your Code

Vext 1.1: Voice to Text for Mac — Built for AI Workflows

Octomind Cloud: AI Agents Without the Setup Tax (Prelaunch)