Even though Deepseek R1 has been released, there's no denying that the APIs provided by OpenAI are still excellent and appealing.
Today, we're going to create a real-time audio web service using OpenAI's Realtime API.
1. What is the Realtime API?

A service released by OpenAI on October 1, 2024, supporting real-time voice input and output.
Previously, to interact with chatGPT using voice, you had to use a speech recognition model like Whisper to convert audio into text, send it, and then convert the model's response back to voice using text-to-speech.
This method results in longer delay than expected.
The Realtime API implements audio input and output directly.
You can implement real-time audio input and output using the GPT-4o model with Websocket and WebRTC.
For more details, please check the official website.
2. Open AI Realtime Blocks

It's not been long since its release and someone has already implemented the API and uploaded it as open source on GitHub.
The homepage itself is so beautiful that I thought Vercel had created the SDK.
Here's the creator's homepage.

When you go to install, there's no yarn or npm, you're just advised to take what you need.
There's also code.
Just skim through and transfer the necessary parts to your project.
There are various formats like Classic, Dock, Siri, etc., and I liked the ChatGPT version the most.

3. Implementation (Ctrl + C & V)
Calling this implementation is embarrassing, it's more like copy-pasting.
First, install all dependencies.
The model I selected had only one dependency.
yarn add framer-motionAnd add one hook.
Check the entire code in the documentation's Create the WebRTC Hook.

"use client";
import { useState, useRef, useEffect } from "react";
import { Tool } from "@/lib/tools";
const useWebRTCAudioSession = (voice: string, tools?: Tool[]) => {
const [status, setStatus] = useState("");
const [isSessionActive, setIsSessionActive] = useState(false);
const audioIndicatorRef = useRef<HTMLDivElement | null>(null);
const audioContextRef = useRef<AudioContext | null>(null);
const audioStreamRef = useRef<MediaStream | null>(null);
const peerConnectionRef = useRef<RTCPeerConnection | null>(null);
... omittedTo be honest, I added it mindlessly for quick development...
Depending on the model you use, appropriately add or remove audio or text in modalities in the middle.
If you don't like anything being logged in the console after development, feel free to remove all console.log.
And create a session path for websocket generation.
import { NextResponse } from 'next/server';
export async function POST() {
try {
if (!process.env.OPENAI_API_KEY){
throw new Error(`OPENAI_API_KEY is not set`);
}
const response = await fetch("https://api.openai.com/v1/realtime/sessions", {
method: "POST",
headers: {
Authorization: `Bearer ${process.env.OPENAI_API_KEY}`,
"Content-Type": "application/json",
},
body: JSON.stringify({
model: "gpt-4o-mini-realtime-preview",
voice: "alloy",
modalities: ["audio", "text"],
// instructions:"You are a helpful assistant for the website named OpenAI Realtime Blocks, a UI Library for Nextjs developers who want to integrate pre-made UI components using TailwindCSS, Framer Motion into their web projects. It works using an OpenAI API Key and the pre-defined 'use-webrtc' hook that developers can copy and paste easily into any Nextjs app. There are a variety of UI components that look beautiful and react to AI Voice, which should be a delight on any modern web app.",
tools: tools,
tool_choice: "auto",
}),
});
if (!response.ok) {
throw new Error(`API request failed with status ${response.status}`);
}
const data = await response.json();
// Return the JSON response to the client
return NextResponse.json(data);
} catch (error) {
console.error("Error fetching session data:", error);
return NextResponse.json({ error: "Failed to fetch session data" }, { status: 500 });
}
}
const tools = [
{
"type": "function",
"name": "getPageHTML",
"description": "Gets the HTML for the current page",
"parameters": {
"type": "object",
"properties": {}
}
},
{
"type": "function",
"name": "getWeather",
"description": "Gets the current weather",
"parameters": {
"type": "object",
"properties": {}
}
},
{
"type": "function",
"name": "getCurrentTime",
"description": "Gets the current time",
"parameters": {
"type": "object",
"properties": {}
}
},
];A quick reminder, don't forget to set the API key in the .env.local file.
Enter this document and you'll find chatgpt.tsx and page.tsx as they are.
Copy them over.
I configured it as follows:
import React, { useEffect, useState } from "react";
import { motion } from "framer-motion";
import useWebRTCAudioSession from "@/hooks/use-webrtc";
const ChatGPT: React.FC = () => {
const { currentVolume, isSessionActive, handleStartStopClick, msgs } =
useWebRTCAudioSession("alloy");
const silenceTimeoutRef = React.useRef<NodeJS.Timeout | null>(null);
const [mode, setMode] = useState<"idle" | "thinking" | "responding" | "volume" | "">(
""
);
const [volumeLevels, setVolumeLevels] = useState([0, 0, 0, 0]);
... omitted
And I also created a page.tsx.
import ChatGPT from "@/components/aiChat/AudioChatGPT";
export default function Page() {
return (
<main className="flex items-center justify-center h-screen">
<ChatGPT />
</main>
);
}When you test it now, you can see that it works well.
I changed the SVG color to gray for now as I don't have dark mode yet.

4. Review
The cost is $0.06 per minute for audio input and $0.24 for output.
It's hard to gauge, but after testing with short conversations today, the cost came out as follows.

In the last two months, I spent $2 on streamText, but spent $1.5 just for today...
Obviously, it's much cheaper than using a paid service, but it's more expensive than expected.
My wife asked me to implement GPT for learning English, so I might try it with this.
댓글을 불러오는 중...