Building a local-first browser TTS studio with Kokoro

I wanted audio versions of my blog posts. Just a play button for an article, with speech that sounded natural enough that people would actually use it. I could have solved that by paying for a hosted text-to-speech (TTS) service and moving on, but I wanted more control than that.

Hosted TTS was the baseline, not the answer

I tried ElevenLabs first, and to be fair, it worked. The output quality was good, setup was easy, and it immediately proved that article audio was worth doing.

It still felt wrong for what I actually needed.

I did not want a big platform with a long feature list, pricing I had to keep thinking about, and my content flowing through someone else's product just to narrate a few articles. My use case was much smaller and much more specific: take text I already own, turn it into audio, keep the workflow simple, and keep the whole thing under my control.

That pushed me toward open models. I wanted something I could inspect, adapt, and run locally. Once I found the Kokoro ONNX model on Hugging Face, the project stopped looking like a vague idea and started looking like an engineering problem I actually wanted to solve.

From Python proof of concept to LocalVoice Studio

I started with the smallest possible test: a Python script in the terminal. Paste in some text, run synthesis, listen back, repeat. That first proof of concept was rough, but the voice quality was good enough that I immediately stopped thinking in terms of "can this work?" and started thinking in terms of "how far can I push this in the browser?"

That turned into LocalVoice Studio, a static frontend app built with Vue, TypeScript, ONNX Runtime Web, and kokoro-js. No backend. No cloud inference. No telemetry. Speech generation happens in the browser and, after the initial model download, the app is meant to keep working offline too. That distinction matters. "Local-first" is easy to say, but I wanted to be clear about what it means in practice. The model download is a one-time cost, but the app's behavior after that is what defines the experience.

At that point the goal changed a little. I was no longer building a thin article player. I was building a small studio for producing article audio. It was about being able to give a personal touch to the audio and play with the voice settings. I wanted to make it feel like a creative tool, not just a synthesis utility.

The difference between a demo and a tool

The model itself was only part of the work. The harder part was making browser TTS feel stable enough that I would trust it for real writing.

Inference runs in a dedicated worker because I had no interest in freezing the UI every time the model initialized or generated audio. The app prefers WebGPU when the browser can support it, but it can retry on WASM if GPU startup fails. That fallback path was one of the most important decisions in the project. Browser AI is full of features that look impressive on one machine and fall apart on the next. If runtime selection is brittle, the app is brittle.

I also did not want voice settings to feel like blind guesses. So the app generates previews, keeps local history, stores presets in the browser, supports voice blending, and lets you control pronunciation more directly with inline markup for pauses, stress, and phoneme overrides. That is the part I enjoyed most as a web developer. Once the raw speech quality is good enough, the real product work shifts to feedback loops. Can I preview a change quickly? Can I recover a setting I liked? Can I export something useful without extra cleanup? To me, delivering a good user experience, is more important than adding "AI-powered" in the landing page.

That is also why I kept the output practical. You can generate speech, play it back, download it as WAV, clear caches, and keep moving. It needed to feel like a local tool I could come back to, not a one-shot demo I would show once and forget.

AI made it fast - guardrails made it shippable

This project was also an experiment in AI-assisted development. I used Codex in VS Code, Gemini through Antigravity, and the usual modern tooling around a Vue codebase. The speed is real. You can move from rough idea to working interface much faster than I could have a year ago.

The part people keep underselling is how quickly that speed turns into drift if you do not put boundaries around it.

I had to be explicit about the shape of the project: static frontend only, worker-driven inference, typed state, clear constraints around privacy, and no backend escape hatches. I also had to keep the boring guardrails in place from the start: linting, formatting, Vitest, Playwright, accessibility checks, and runtime recovery when WebGPU failed. That was the QA lead part of my brain taking over. I do not really care that an agent can scaffold ten components in a minute if the result becomes impossible to trust once real users touch it.

AI is great at accelerating implementation inside a box. It is much worse at deciding where the box should be, how strict it needs to stay, and what quality bar the project has to clear before it deserves to ship. That part is still on us.

What I took away

I did not build this because I think every problem needs a local AI app. I built it because this one did. Article audio is a narrow problem, and a browser-first, open-source TTS tool turned out to be a much better fit for it than another subscription.

The bigger lesson for me is that local-first AI gets interesting when you stop treating the model as the product. The product is everything around it: fallback behavior, caching, previews, export, tests, and enough constraints that the code does not dissolve into vibes halfway through the build.

If you want to study or reuse the approach, the repo is here. If you are building something similar, I would start with the smallest proof of concept you can get running, then spend more time than you think on guardrails. The demo is the easy part. Making it hold up is the work.