Skip to main content

Voice Mode Arrived in Main Chat

LikeClaw just moved real-time voice out of a side experiment and into the main chat flow, with live mic sessions, credit safeguards, friendlier errors, and a fast sequence of fixes that made voice usable in production.

Five days from voice layer to production hardening.

3.1.34→3.2.19

Versions shipped

15+

Voice-related commits

Voice inside main chat

Core shift

Usable real-time conversations

Biggest win

Voice stopped being a side feature

A lot of AI products add voice in the least ambitious way possible.

They put a microphone icon somewhere in the corner, transcribe what you say, paste the text into the chat box, and call it voice mode.

That is not really voice. That is dictation with branding.

Over the last few days, LikeClaw took a more serious step. Voice was moved into the main chat experience itself, backed by a live interaction layer, then hardened through a fast run of production fixes. The result is not just that you can talk to the agent. The result is that talking now fits the way the product already works.

The important shift: from experimental surface to core workflow

The most meaningful change was not the first voice commit. It was the decision to bring voice into the main composer.

A dedicated voice experiment can be impressive in a demo and irrelevant in a product. Users do not want to choose between “the real app” and “the voice version of the app.” They want to continue the same conversation in the same place, with a different input mode.

That is what changed here.

LikeClaw added a mic button and inline voice panel directly in the chat composer. Voice now lives where users already think and work: inside the conversation thread, attached to the active agent, alongside the normal flow of prompts and responses.

That sounds smaller than it is. Product-wise, it is the difference between a feature you try once and a feature you actually keep using.

Real-time interaction instead of fake voice

Before a voice feature becomes useful, it has to feel live.

That is why the initial work mattered: a real-time voice interaction layer using Gemini Live, not just a slower route to submit transcribed text. The system started handling spoken sessions as their own interaction mode, with live communication and dedicated session behavior.

From there, the team had to solve the less glamorous part: make that live experience behave correctly in the product people already use every day.

The hardening sprint is the story

The commit history tells the real story better than any launch graphic could.

Voice websocket routing had to be corrected so production traffic used the right API base path. CDN URLs for VAD and ONNX Runtime dependencies had to be pinned to installed versions. The chat UI had to stop treating voice as separate from the current conversation context. Empty chats needed automatic session creation before voice could start. Cached messages needed invalidation after each dispatch so users could actually see updated results.

Then came the deeper interaction fixes.

Voice sessions were wired to the existing chat session. An inline panel replaced a more awkward detached flow. Hard timeouts were removed in favor of a model that could inject asynchronous chat updates into the live session. Persona handling was tightened so specifics are preserved verbatim instead of being flattened away. And when a user’s microphone permissions fail, the interface now says so clearly instead of failing like a broken app.

This is what real product work looks like. Not one launch. A chain of corrections until the feature stops fighting the user.

Safety and cost controls had to exist before scale

Voice is one of those features that can become expensive, messy, or abusable quickly if you ship it without guardrails.

LikeClaw added session safeguards early: duration caps, per-user caps, and credit checks. That matters because real-time features are easy to love in a demo and easy to regret in production if there is no cost discipline behind them.

The product also needed operational fixes around deployment headers and microphone policy so the browser would actually permit the experience users were being invited to start. Again, boring on paper. Essential in practice.

A voice feature is not real if it only works on the founder’s laptop.

Why this matters for LikeClaw specifically

LikeClaw is not trying to be a toy voice chatbot. It is building an agent platform where conversations connect to workspaces, tools, background execution, schedules, and persistent context.

That makes voice more valuable here than in a generic assistant.

If voice lives inside the same chat that can already trigger tools, store files, and continue work over time, speaking to the agent is not just another input novelty. It is a faster way to use the whole platform.

A founder can talk through a task while walking. A user can brainstorm hands-free, then continue the same thread later by typing. A support workflow can begin with voice and end with structured output in the workspace. The same context survives.

That continuity is the real feature.

Shipping the usable version first

There is a common mistake in AI product development: keep polishing the side experience until it looks magical, while never integrating it into the main product deeply enough to matter.

LikeClaw did the opposite.

It shipped a real-time voice layer, moved it into the core chat workflow, then spent the following releases fixing the parts that determine whether users trust it: routing, session creation, cache behavior, async updates, credit controls, permissions, and deployment policy.

That is a better path.

Because once voice is in the main chat, every further improvement compounds on a real usage path. You are no longer polishing a concept. You are improving an actual behavior users can adopt.

What comes next

Voice in LikeClaw is not done. It does not need to be.

What matters is that the platform crossed the line from “we have a voice feature” to “voice is part of the product.” That is a much higher bar.

The last week of work shows exactly how that happens: not with a single announcement, but with repeated product and infrastructure decisions that make the experience coherent.

The microphone button is now in the place it always should have been — inside the conversation itself.

What changed in LikeClaw voice

  1. 1

    Real-time voice layer

    Gemini Live-powered voice sessions landed with a dedicated real-time interaction layer instead of a fake push-to-talk wrapper around text chat.

  2. 2

    Voice moved into the main composer

    The mic button and inline voice panel now live directly in the chat composer, so voice is part of the normal conversation flow instead of a detached experiment.

  3. 3

    Session continuity and async updates

    Voice now stays attached to the existing chat session, can create a session in an empty chat, and can surface asynchronous updates instead of feeling stateless.

  4. 4

    Guardrails and clearer failures

    Duration caps, per-user caps, credit checks, pinned CDN dependencies, routing fixes, and friendly microphone-permission errors made the feature safer to ship broadly.

Questions about LikeClaw voice mode

Is this just speech-to-text on top of chat?

No. The work was not just transcription. LikeClaw added a real-time voice interaction layer, then integrated it into the main chat UI so voice behaves like a first-class conversation mode rather than a novelty button.

Why were there so many follow-up fixes?

Because real-time features only become useful after edge cases are handled. Empty-chat startup, websocket routing, CDN version pinning, cache invalidation, session continuity, timeouts, credit safety, and microphone permissions all had to be tightened before voice felt dependable.

What does this change for users?

It lowers the friction to start and continue a conversation. Instead of typing every prompt, users can talk to the agent inside the same chat they already use for work, while keeping the same context and receiving the same task results.

Is voice now finished?

No. It is usable, which matters more. Shipping voice into the main chat flow means LikeClaw can now refine a real production path instead of polishing an isolated demo surface.