AI · AGENTS | FEB 3, 2025

A Deep Dive Into LiveKit, the Open Source Platform Powering OpenAI’s Speech-To-Speech (STS) Technology at Scale

LiveKit is an open-source WebRTC SFU powering OpenAI’s Speech-to-Speech system. With efficient media forwarding and dynamic participant management, it’s the foundation for building low-latency, real-time communication apps at scale.

Author: Sakthi (Sandy) Santhosh

LiveKit is an open-source WebRTC SFU that powers scalable, low-latency real-time communications, such as OpenAI's Speech-To-Speech system. Instead of relying on resource-intensive MCU architectures, LiveKit efficiently forwards media streams to support dynamic room and participant management. The platform enables flexible agent registration and dispatching, allowing developers to build everything from basic conferencing tools to advanced multimodal applications with ease.

In this article, let’s discuss LiveKit’s architecture, which enables OpenAI to scale its Speech-To-Speech (STS) system to serve millions of users concurrently. We will explore the old architecture, its shortcomings, and the improved architecture that achieves efficiency at scale.

LiveKit is an open-source WebRTC infrastructure designed for real-time audio, video, and data streaming at scale. Unlike monolithic solutions, LiveKit follows a modular and scalable architecture that enables developers to build interactive applications with low-latency communications. This article dissects the core components of LiveKit and explains how they interact to deliver a seamless real-time experience.

NOTE 📝
The intention of writing this article was that when I studied the documentation of LiveKit, I found it hard to understand and map concepts together. Hence, my goal in this article is to give an overview of the LiveKit tool in a single article.

Why LiveKit Exists?

OpenAI uses WebRTC for real-time STS communication. However, it is inherently a peer-to-peer (P2P) protocol, where two peers exchange audio and video ("media") directly, bypassing central servers. This works well for a small number of users, typically up to three. Since most users lack the bandwidth required to simultaneously upload multiple high-resolution video streams, scaling WebRTC beyond a few participants necessitates a client-server model.

One widely used model is the Multipoint Conferencing Unit (MCU) architecture. In an MCU setup, each user transmits encoded media streams to a central server, which then decodes, composites, and re-encodes them before sending a single stream to each participant. This conserves bandwidth but limits flexibility, as individual audio or video streams cannot be adjusted separately. Additionally, MCU architectures require significant computational resources, making large-scale deployments challenging.

LiveKit instead adopts a Selective Forwarding Unit (SFU) model, which acts as a specialized media router. Instead of processing media streams, the SFU forwards them directly to subscribers without alteration. This approach allows for greater scalability and flexibility. Each participant sends a single uplink stream, which the SFU distributes to other users as needed. Although SFU-based systems require higher downstream bandwidth compared to MCU setups, they provide full control over individual audio and video streams, enabling dynamic UI configurations and real-time optimizations.

SFU, to put it in simple terms, is a broker that connects two or more participants in a namespace/room. This architecture lets the AI agents and participants scale individually.

LiveKit’s SFU is built in Go. It supports horizontal scaling, allowing deployments on multiple nodes with Redis-based peer-to-peer routing. While a single-node setup requires no external dependencies, multi-node deployments leverage Redis for efficient room-based connection management. This architecture ensures that LiveKit can efficiently handle large-scale, real-time communications with minimal latency.

LiveKit architecture describing how a client and agent interact with its SFU architecture.

LiveKit architecture describing how a client and agent interact with its SFU architecture.

Additional Features

  • Scalability: Building scalable realtime communication systems is challenging. LiveKit is an opinionated, horizontally-scaling WebRTC SFU (Selective Forwarding Unit) that allows applications to scale to support large numbers of concurrent users.
  • Complexity: Implementing WebRTC from scratch is complex, involving signaling, media handling, networking, and more. LiveKit solves these undifferentiated problems, presenting a consistent set of API primitives across platforms.
  • Fragmented Solutions: Many WebRTC solutions lack advanced features like simulcast, selective subscription, SVC codecs, and end-to-end encryption. LiveKit provides these and more out-of-the-box.
  • Vendor Lock-in: Proprietary WebRTC solutions can lead to vendor lock-in. As an open-source platform, LiveKit makes it easy to switch between our self-hosted and Cloud offerings.

Key Concepts in LiveKit

Rooms

Rooms serve as virtual spaces where participants can join and interact. A room acts as an isolated session where users/participants (agents and real people) exchange audio, video, and data. Rooms are dynamically created and managed through LiveKit’s APIs, allowing flexible session handling.

Participants

Participants are users who join a room to communicate. They can either publish media tracks (audio/video) or subscribe to other participants' streams. LiveKit distinguishes between different types of participants based on their roles. Following are the types of participants that can join a room:

  1. Publishers – These participants transmit media streams (audio, video, or data) to the room. They can be end users broadcasting their media or AI agents generating synthesized speech and video.
  2. Subscribers – These participants consume media streams published by others in the room. Subscribers only receive data and do not necessarily publish any content themselves.
  3. Moderators – Special participants who have administrative control over a room, such as muting participants, removing users, or managing stream permissions.

Tracks

A track represents a single media stream in LiveKit, such as an audio or video feed. Each track is identified separately and can be individually managed. A track is generated by either the publisher or subscriber.

  • Audio Track: Carries voice data from a participant.
  • Video Track: Represents a visual feed from a participant’s camera.
  • Data Track: Allows arbitrary data transmission between users in a room, useful for synchronized interactions beyond audio and video.

Track Publication: When a participant wants to share a media stream, they publish a track to the room. This process involves encoding and sending media data to LiveKit’s SFU, making it available for others to subscribe to. Publishers can control their tracks, such as enabling/disabling video or changing resolutions dynamically.

Track Subscription: Subscribers receive and render tracks published by others in the room. The SFU optimizes delivery based on network conditions, allowing subscribers to receive lower-resolution streams if bandwidth is constrained. Subscriptions can be dynamically managed, meaning participants can choose which tracks to receive, optimizing bandwidth usage.

Clients

Clients are real-life users trying to interact agents. When the client joins a room, LiveKit server creates a room for that client automatically, or the client can specify the room it wants to join. Once joined, LiveKit dispatches an agent automatically or the client instructs LiveKit via API to communicate with a particular agent.

Building Agents

INFORMATION ℹ️
Throughout this article, we will explain with LiveKit’s Python SDK. Please note that the remainder of the article will present code snippets solely for educational purposes, as the primary objective is to provide instruction rather than demonstration.

Now that we’ve understood the overall architecture, let’s dig deeper into each of the components of the architecture. The framework (SDKs for various languages) turns your program into an "agent" that can join a LiveKit room and interact with other participants. Here's a high-level overview of the lifecycle:

  1. Worker Registration: When you run an agent, it connects to LiveKit server (specified by you, can be self-hosted or SaaS offering) and registers itself as a "worker" via a persistent WebSocket connection. Once registered, the app is on standby, waiting for rooms (sessions with end-users) to be created. It exchanges availability and capacity information with LiveKit server automatically, allowing for correct load balancing of incoming requests.
  2. Agent Dispatch: When an end-user connects to a room, LiveKit server selects an available worker and sends it information about that session. The first worker to accept that request will instantiate your program and join the room. A worker can host multiple instances of your agent simultaneously, running each in its own process for isolation.
  3. Your Program: This is where you take over. Your program can use most features of the LiveKit Python SDK. Agents can also leverage the agent plugin ecosystem to process or synthesize voice and video data.
  4. Room Close: The room will automatically close when the last non-agent participant has left. Remaining agents will be disconnected.

Implementing Agent Functionality

We can define a callable (function/method) that will be handling the tracks (requests) from the publishers. The tracks are handled by LLMs. LiveKit has a diverse option of LLMs that you can use to power your agents.

Agents are the way to let the users interact with the LLM. They expose functionality over handling LLMs. Following are the main APIs the framework offers that can be used to implement the functionality of the agent:

  1. MultimodalAgent: Uses OpenAI’s multimodal model and realtime API to directly process user audio and generate audio responses, similar to OpenAI’s advanced voice mode, producing more natural-sounding speech.

    User and agent interaction in a `MultimodalAgent`-based workflow.

    User and agent interaction in a MultimodalAgent-based workflow.

  2. VoicePipelineAgent: Uses a pipeline of STT, LLM, and TTS models, providing greater control over the conversation flow by allowing applications to modify the text returned by the LLM.

    User and agent interaction in a `VoicePipelineAgent`-based workflow.

    User and agent interaction in a VoicePipelineAgent-based workflow.

Agent/Worker Registration

The following code runs a single participant (agent) instance that remains idle until a publisher joins a room.

opts = WorkerOptions(
    entrypoint_fnc,
    request_fnc,
    prewarm_fnc,
    load_fnc,
    load_threshold,
    permissions,
    worker_type=WorkerType.ROOM
)

cli.run_app(opts)

This snippet initializes a worker for a LiveKit server with a custom configuration. The worker is configured using WorkerOptions and requires several parameters:

  1. entrypoint_fnc: The entrypoint function, which is the main function that gets called when a job is assigned to the worker. It will contain one of the above discussed API, VoicePipelineAgent or MultiModelAgent.
  2. request_fnc: A function that inspects incoming requests to decide if the current worker should handle them.
  3. prewarm_fnc: A function used to perform any necessary initialization when a new process is spawned.
  4. load_fnc: A function that monitors and reports the system load, such as CPU or RAM usage.
  5. load_threshold: The maximum load value above which new worker processes will not be spawned.
  6. permissions: The permissions granted to the worker, such as subscribing to tracks, publishing data, or updating metadata.
  7. worker_type: The type of worker to create. It can either be a JT_ROOM (for room-related tasks) or a JT_PUBLISHER (for publishing-related tasks). In this example, it's set to JT_ROOM.

Finally, the worker is started by calling cli.run_app(opts), which uses the configuration defined in opts.

This configuration is typical for setting up a worker in a LiveKit server environment to handle real-time communications.

Agent Dispatching

As part of an agent's lifecycle, agent workers register with the LiveKit server and remain idle until they are assigned to a room to interact with end users. The process of assigning an agent to a room is called dispatching an agent.

LiveKit’s dispatch system is optimized for concurrency and low latency. It supports hundreds of thousands of new connections per second and dispatches an agent to a room in under 150ms.

Automatic Agent Dispatching

By default, agents are automatically dispatched when rooms are created. If a participant connects to LiveKit and the room does not already exist, it is created automatically, and an agent is assigned to it.

Automatic dispatch is the best option if the same agent needs to be assigned to new participants consistently.

The key difference is that when registering the agent, the agent_name field in WorkerOptions needs to be set.

opts = WorkerOptions(
    ...
    agent_name="test-agent",
)

IMPORTANT 🔴
The agent will not be automatically dispatched to any newly created rooms when the agent_name is set.

If the agent doesn’t register itself automatically to a room when a participants joins, how will you instead explicitly call it to the room? This is done through the API provided by the framework.

Dispatching Agents via API

For greater control over when and how agents join rooms, explicit dispatch is available. This approach leverages the same worker systems, allowing you to run agent workers in the same way. Only thing that’s different here now is the task of dispatching an agent is now the responsibility of the participant (publisher) joining the room.

# Participant: Publisher
import asyncio
from livekit import api

room_name = "my-room"
agent_name = "test-agent" # Explicit (chosen by the participant)

async def create_explicit_dispatch():
    lkapi = api.LiveKitAPI()
    dispatch = await lkapi.agent_dispatch.create_dispatch(
        api.CreateAgentDispatchRequest(
            agent_name=agent_name, room=room_name, metadata="my_job_metadata"
        )
    )
    print("Created:", dispatch)

    dispatches = await lkapi.agent_dispatch.list_dispatch(room_name=room_name)
    print(f"There are {len(dispatches)} dispatches in {room_name}.")
    await lkapi.aclose()

asyncio.run(create_explicit_dispatch())

Conclusion

With that, you now know everything you need to know to get started and build solutions around LiveKit. In summary, LiveKit represents a groundbreaking open-source framework that redefines real-time communication by combining a scalable SFU architecture with modular design principles. By efficiently managing media streams and offering robust room and participant management, it empowers developers to create innovative applications—from simple conferencing tools to advanced Speech-To-Speech systems—without the overhead of traditional MCU architectures. With a deep understanding of its core concepts, from agent registration and dispatching to track publication and subscription, you are now equipped to harness LiveKit's full potential.