Creating a Real-Time Audio Service Using OpenAI

Even though Deepseek R1 has been released, there's no denying that the APIs provided by OpenAI are still excellent and appealing.

Today, we're going to create a real-time audio web service using OpenAI's Realtime API.

1. What is the Realtime API?

Creating a Real-Time Audio Service Using OpenAI-1

A service released by OpenAI on October 1, 2024, supporting real-time voice input and output.

Previously, to interact with chatGPT using voice, you had to use a speech recognition model like Whisper to convert audio into text, send it, and then convert the model's response back to voice using text-to-speech.

This method results in longer delay than expected.

The Realtime API implements audio input and output directly.

You can implement real-time audio input and output using the GPT-4o model with Websocket and WebRTC.

For more details, please check the official website.

2. Open AI Realtime Blocks

Creating a Real-Time Audio Service Using OpenAI-2

It's not been long since its release and someone has already implemented the API and uploaded it as open source on GitHub.

The homepage itself is so beautiful that I thought Vercel had created the SDK.

Here's the creator's homepage.

Creating a Real-Time Audio Service Using OpenAI-3

When you go to install, there's no yarn or npm, you're just advised to take what you need.

There's also code.

Just skim through and transfer the necessary parts to your project.

There are various formats like Classic, Dock, Siri, etc., and I liked the ChatGPT version the most.

Creating a Real-Time Audio Service Using OpenAI-4

3. Implementation (Ctrl + C & V)

Calling this implementation is embarrassing, it's more like copy-pasting.

First, install all dependencies.

The model I selected had only one dependency.

yarn add framer-motion

And add one hook.

Check the entire code in the documentation's Create the WebRTC Hook.

Creating a Real-Time Audio Service Using OpenAI-5

"use client";
 
import { useState, useRef, useEffect } from "react";
import { Tool } from "@/lib/tools";

const useWebRTCAudioSession = (voice: string, tools?: Tool[]) => {
  const [status, setStatus] = useState("");
  const [isSessionActive, setIsSessionActive] = useState(false);
  const audioIndicatorRef = useRef<HTMLDivElement | null>(null);
  const audioContextRef = useRef<AudioContext | null>(null);
  const audioStreamRef = useRef<MediaStream | null>(null);
  const peerConnectionRef = useRef<RTCPeerConnection | null>(null);

... omitted

To be honest, I added it mindlessly for quick development...

Depending on the model you use, appropriately add or remove audio or text in modalities in the middle.

If you don't like anything being logged in the console after development, feel free to remove all console.log.

And create a session path for websocket generation.

import { NextResponse } from 'next/server';

export async function POST() {
    try {        
        if (!process.env.OPENAI_API_KEY){
            throw new Error(`OPENAI_API_KEY is not set`);

        }
        const response = await fetch("https://api.openai.com/v1/realtime/sessions", {
            method: "POST",
            headers: {
                Authorization: `Bearer ${process.env.OPENAI_API_KEY}`,
                "Content-Type": "application/json",
            },
            body: JSON.stringify({
                model: "gpt-4o-mini-realtime-preview",
                voice: "alloy",
                modalities: ["audio", "text"],
                // instructions:"You are a helpful assistant for the website named OpenAI Realtime Blocks, a UI Library for Nextjs developers who want to integrate pre-made UI components using TailwindCSS, Framer Motion into their web projects. It works using an OpenAI API Key and the pre-defined 'use-webrtc' hook that developers can copy and paste easily into any Nextjs app. There are a variety of UI components that look beautiful and react to AI Voice, which should be a delight on any modern web app.",
                tools: tools,
                tool_choice: "auto",
            }),
        });

        if (!response.ok) {
            throw new Error(`API request failed with status ${response.status}`);
        }

        const data = await response.json();

        // Return the JSON response to the client
        return NextResponse.json(data);
    } catch (error) {
        console.error("Error fetching session data:", error);
        return NextResponse.json({ error: "Failed to fetch session data" }, { status: 500 });
    }
}

const tools = [
    {
        "type": "function",
        "name": "getPageHTML",
        "description": "Gets the HTML for the current page",
        "parameters": {
            "type": "object",
            "properties": {}
        }
    },
    {
        "type": "function", 
        "name": "getWeather",
        "description": "Gets the current weather",
        "parameters": {
            "type": "object",
            "properties": {}
        }
    },
    {
        "type": "function",
        "name": "getCurrentTime",
        "description": "Gets the current time",
        "parameters": {
            "type": "object",
            "properties": {}
        }
    },
];

A quick reminder, don't forget to set the API key in the .env.local file.

Enter this document and you'll find chatgpt.tsx and page.tsx as they are.

Copy them over.

I configured it as follows:

import React, { useEffect, useState } from "react";
import { motion } from "framer-motion";
import useWebRTCAudioSession from "@/hooks/use-webrtc";
 
const ChatGPT: React.FC = () => {
  const { currentVolume, isSessionActive, handleStartStopClick, msgs } =
    useWebRTCAudioSession("alloy");
 
  const silenceTimeoutRef = React.useRef<NodeJS.Timeout | null>(null);
 
  const [mode, setMode] = useState<"idle" | "thinking" | "responding" | "volume" | "">(
    ""
  );
  const [volumeLevels, setVolumeLevels] = useState([0, 0, 0, 0]);

... omitted

And I also created a page.tsx.

import ChatGPT from "@/components/aiChat/AudioChatGPT";
 
export default function Page() {
  return (
    <main className="flex items-center justify-center h-screen">
      <ChatGPT />
    </main>
  );
}

When you test it now, you can see that it works well.

I changed the SVG color to gray for now as I don't have dark mode yet.

Creating a Real-Time Audio Service Using OpenAI-6

4. Review

The cost is $0.06 per minute for audio input and $0.24 for output.

It's hard to gauge, but after testing with short conversations today, the cost came out as follows.

Creating a Real-Time Audio Service Using OpenAI-7

In the last two months, I spent $2 on streamText, but spent $1.5 just for today...

Obviously, it's much cheaper than using a paid service, but it's more expensive than expected.

My wife asked me to implement GPT for learning English, so I might try it with this.

목차

Creating a Real-Time Audio Service Using OpenAI

1. What is the Realtime API?

2. Open AI Realtime Blocks

3. Implementation (Ctrl + C & V)

4. Review