Building OLLM: SSE streaming, BYOK, and why I wanted control over my own AI client

OLLM is a BYOK AI chat platform built on Next.js. Here's the architecture — SSE streaming, competitive race condition guards, and the OKLCH color space rabbit hole I fell into.

I built OLLM because I wanted an AI chat client I actually controlled. Most clients lock you into their model choices, their pricing, their data policies. BYOK — Bring Your Own Key — means you connect your own API key to whatever OpenAI-compatible endpoint you want. The platform just provides the interface.

It's in MVP right now. This is a write-up of the interesting technical parts.

The BYOK architecture

The key design decision: the frontend sends requests through a Next.js API route, not directly to the model provider. This keeps the API key server-side and lets us add a layer of request normalization — every provider gets the same shape of request, regardless of quirks in their API.

The backend route accepts a standard chat message format, swaps in the user's configured endpoint and key, and streams the response back. Users can point OLLM at any OpenAI-compatible API — commercial providers, self-hosted models, local inference servers.

This decoupling is the whole product. The interface stays the same; the model behind it is the user's choice.

SSE streaming

Chat responses stream token by token using Server-Sent Events. The flow: the API route opens a connection to the model provider, reads the response as a stream, and forwards it to the client via an AsyncGenerator.

async function* streamCompletion(req: ChatRequest) {
  const response = await fetch(req.endpoint, { ... });
  const reader = response.body?.getReader();
  // decode and yield chunks
}

The client side accumulates chunks into the displayed message. One nuance: you need to buffer partial JSON chunks. The stream sometimes splits in the middle of a token, so you can't parse each chunk independently — you have to accumulate until you have a complete JSON object.

Race condition guards

There's a classic problem with streaming UIs: the user sends a message, a response starts streaming, they send another message before it finishes. Without a guard, the second response can start updating the UI while the first is still writing to the same state.

The fix is a request counter. Each send increments a requestId. The streaming handler checks before each update that its requestId still matches the current one. If not, it drops the update silently. The superseded response finishes its fetch in the background but never touches the UI.

React's Transition API is also useful here — wrapping the message state updates in startTransition keeps the UI responsive during streaming without blocking user input.

The visual side

I'm using Tailwind 4, which brings first-class support for the OKLCH color space. OKLCH is perceptually uniform — equal numeric steps produce equal perceived lightness changes, which makes building a coherent color system much more predictable than HSL.

The color palette for OLLM is built entirely in OKLCH. In practice this means the light and dark themes are consistent in a way that HSL-based themes often aren't, and accent colors at different lightness levels look intentional rather than accidental.

KaTeX handles math rendering, markdown-it handles the rest of the message formatting. Both are rendered client-side on the accumulated message content, not on each streamed chunk.

Current state

MVP means it works well enough for daily use, which it does — I've been using it as my primary AI client for the past few months. The parts that need more work are settings management and conversation history persistence. Those are next.

← Back to the blog