WebSocket Integration
Connect your telephony provider to UpVoot using a standard WebSocket.
Endpoint
wss://api.upvoot.com/voice/<api-key>
This is a standard WebSocket endpoint. No custom headers, SDKs, or libraries are required. Any telephony platform that supports WebSocket audio streaming can connect.
Connection flow
- Your telephony provider initiates a WebSocket upgrade to the endpoint above.
- UpVoot validates the API key. If invalid, it responds with
HTTP 401and rejects the upgrade. - The agent configuration (system prompt, greeting, voice settings) is loaded based on the API key.
- The agent speaks the greeting immediately after the connection is established.
- Your provider streams raw audio from the caller to UpVoot; UpVoot streams agent audio back to the caller.
- When the caller hangs up, your provider closes the WebSocket. The call is logged and points are deducted.
Audio streaming
Audio is streamed as raw binary frames over the WebSocket in both directions:
| Direction | Format | Sample rate | Channels | Bit depth |
|---|---|---|---|---|
| Client → UpVoot | PCM (raw) | 16 kHz | Mono | 16-bit signed little-endian |
| UpVoot → Client | PCM (raw) | 16 kHz | Mono | 16-bit signed little-endian |
See Audio Format for conversion examples and codec compatibility.
Control messages
In addition to binary audio frames, the server sends JSON text frames to signal call state:
| Message type | Direction | Description |
|---|---|---|
audio_start | Server → Client | Agent has started speaking; binary audio frames will follow |
audio_end | Server → Client | Agent finished speaking; ready to listen |
transcript | Server → Client | Final transcript of the last turn (agent or caller) |
interim | Server → Client | Partial (streaming) transcript of caller speech |
turn_start | Server → Client | LLM has started processing a turn |
end_call | Client → Server | Instruct the agent to end the call gracefully |
Handling barge-in
UpVoot detects when the caller speaks while the agent is talking and interrupts the agent automatically. You do not need to handle this — audio from the caller is always processed regardless of whether the agent is speaking.
Connection limits
- Maximum call duration: 30 minutes (configurable per agent)
- Maximum turns per call: configurable in agent settings (default 20)
- Silence timeout: 10 seconds of continuous silence ends the call
Minimal client example (Node.js)
const WebSocket = require("ws");
const ws = new WebSocket(
"wss://api.upvoot.com/voice/sk_live_YOUR_API_KEY"
);
ws.on("open", () => {
console.log("Connected — agent will speak greeting");
});
ws.on("message", (data) => {
if (typeof data === "string") {
const msg = JSON.parse(data);
if (msg.type === "transcript") {
console.log(`[${msg.role}] ${msg.text}`);
}
} else {
// Binary: PCM audio from agent — send to caller
streamToCaller(data);
}
});
// Stream PCM audio from the caller to UpVoot
function onCallerAudio(pcmBuffer) {
if (ws.readyState === WebSocket.OPEN) {
ws.send(pcmBuffer);
}
}Exotel-specific setup
In the Exotel Applet builder, create a Connect to WebSocket node and set the URL to:
wss://api.upvoot.com/voice/<api-key>
Set the audio codec to PCM / L16 / 16kHz (also called "raw linear"). No additional headers are required.
See Supported Providers for a full Exotel walkthrough.