Voice agents
@cloudflare/voice gives Cloudflare's Agent SDK a streaming voice transport — WebSocket frames carry STT input and TTS output, with Workers AI doing the heavy lifting. ayjnt detects withVoice(), provisions the AI binding, and emits a typed useVoiceAgent hook the client can call without thinking about transport URLs.
The mixin
A voice agent extends a class produced by withVoice(Agent):
import { Agent } from "agents";
import { withVoice } from "@cloudflare/voice";
import { WorkersAIFluxSTT, WorkersAITTS } from "@cloudflare/voice/providers";
import type { GeneratedEnv } from "@ayjnt/env";
type State = { transcript: Array<{ role: "user" | "assistant"; text: string }> };
export default class ChatAgent extends withVoice(Agent)<GeneratedEnv, State> {
override initialState: State = { transcript: [] };
voice = {
stt: new WorkersAIFluxSTT({ binding: this.env.AI }),
tts: new WorkersAITTS({ binding: this.env.AI }),
};
// onTranscript fires once a full utterance is recognized.
async onTranscript(text: string): Promise<string> {
this.setState({
transcript: [...this.state.transcript, { role: "user", text }],
});
const reply = await this.respondTo(text);
this.setState({
transcript: [...this.state.transcript, { role: "assistant", text: reply }],
});
return reply; // ← TTS'd back to the client.
}
private async respondTo(text: string): Promise<string> {
// call your LLM here
return `I heard you say: ${text}`;
}
}What ayjnt wires up
Detecting withVoice( in the agent's source file
flips a voice feature flag and:
-
Adds an
aibinding towrangler.jsonc(idempotent — it's already there if you also have browser tools). -
Adds
AI: AitoGeneratedEnv. -
Generates a typed
useVoiceAgenthook in@ayjnt/<agentId>that points at the right route path and uses ayjnt's URL shape instead of the SDK's default.
The generated hook
Voice agents get a different client hook than normal agents.
Where a regular agent would generate useAgent from agents/react, a voice
agent gets useVoiceAgent from ayjnt/voice/client:
import { useAyjntVoiceAgent } from "ayjnt/voice/client";
export function useVoiceAgent(opts?: {
name?: string;
host?: string;
enabled?: boolean;
onReconnect?: () => void;
}) {
return useAyjntVoiceAgent({
agent: "chat",
routePath: "/chat", // ← ayjnt's URL shape
...opts,
});
}
Why the wrapper: the upstream useVoiceAgent from @cloudflare/voice/react hardcodes a WebSocketVoiceTransport backed by PartySocket, which
connects to /agents/<kebab>/<name>.
ayjnt's URL shape is /<route>/<instance>; the wrapper supplies
a raw-WebSocket transport that connects to the right URL.
Using the hook in your UI
import { useVoiceAgent } from "@ayjnt/chat";
export default function VoiceUI() {
const voice = useVoiceAgent({ name: "demo" });
// voice.connected, voice.listening, voice.speaking, voice.error
// voice.start(), voice.stop()
return (
<div>
<button onClick={voice.connected ? voice.stop : voice.start}>
{voice.connected ? "Stop" : "Talk"}
</button>
<p>{voice.listening ? "Listening…" : voice.speaking ? "Speaking…" : "Idle"}</p>
</div>
);
}STT and TTS providers
The voice mixin doesn't pick STT or TTS for you — it expects
the agent to expose a voice = { stt, tts }
instance field. The most useful providers from @cloudflare/voice/providers:
-
WorkersAIFluxSTT— streaming STT against Workers AI's Whisper model. -
WorkersAITTS— Workers AI's TTS endpoint. Streams audio frames back over the WebSocket.
You can also implement your own STT/TTS interfaces if you want to point at OpenAI, Deepgram, ElevenLabs, etc.
onTranscript lifecycle
onTranscript(text) is the central hook. It fires
once the STT engine flushes a complete utterance (silence
detection + finalization). Whatever string you return is sent
to the TTS engine and streamed back to the client as audio.
Return undefined to skip the TTS pass — useful
when you want to send a state update without speaking.
Detection rules
// ✓ detected
export default class ChatAgent extends withVoice(Agent)<…> {}
// ✓ detected
const VoiceAgent = withVoice(Agent);
export default class ChatAgent extends VoiceAgent<…> {}
// ✗ NOT detected — aliased import
import { withVoice as wv } from "@cloudflare/voice";
export default class ChatAgent extends wv(Agent)<…> {}
The detection is source-level regex matching withVoice( in the file.
Why a custom transport
Upstream WebSocketVoiceTransport uses PartySocket
with prefix: "agents", which hardcodes a URL
shape of /agents/<kebab>/<name>.
ayjnt's URL shape is /<route>/<instance> — flatter, no
"agents" segment. The generated hook supplies an AyjntVoiceTransport that opens a raw WebSocket
to the right URL while keeping the upstream message protocol
intact.
Reference
- Cloudflare's Voice Agents docs
-
examples/voice-agent— ChatAgent + UI from this guide. -
src/runtime/voiceClient.tsx—AyjntVoiceTransport+ the wrapper hook.