OpenAI를 이용한 실시간 오디오 서비스 만들기

힘센캥거루·2025-02-07

Deepseek R1이 나왔지만, 여전히 Open AI가 제공하는 API가 훌륭하고 매력적인 것은 두말할 필요 없다.

오늘은 Open AI의 Realtime API를 이용해 실시간 오디오 웹 서비스를 만들어보려고 한다.

1. Realtime API란?

OpenAI에서 2024년 10월 1일에 출시한 서비스로, 실시간 음성 입출력을 지원한다.

이전에는 음성을 이용해 chatGPT와 상호작용 하기 위해서는 Whisper과 같은 음성 인식 모델을 활용해 오디오를 텍스트로 만들고, 이를 전달한 뒤 돌아오는 모델의 응답을 텍스트-음성 변환을 이용해 출력해야 했다.

이러한 방식은 생각보다 긴 지연시간이 든다.

Realtime API는 오디오 입출력을 직접 구현해준다.

GPT-4o 모델과 Websocket, WebRTC를 이용해 실시간으로 오디오 입출력을 구현할 수 있다.

자세한 내용은 공식 사이트에서 확인해보길 바란다.

2. Open AI Realtime Blocks

이게 나온지 얼마나 됬다고 어느 천사가 벌써 API를 구현하고 github에 오픈소스로 올려놨다.

홈페이지 자체도 너무 이뻐서 vercel에서 만들어준 SDK 인 줄 알았다.

제작자의 홈페이지를 올려둔다.

설치로 들어가면 yarn이나 npm같은건 없고 그냥 필요한 부분을 가져가 쓰라고 한다.

코드도 대로 있다.

대충 읽어보고 필요한 부분만 내 프로젝트에 옮겨주면 된다.

Classic, Dock, Siri 등 여러가지 형식들이 있는데, 그 중에서 ChatGPT 버전이 가장 마음에 들었다.

3. 구현하기(Ctrl + C & V)

이건 구현하기라고 하기도 부끄럽고, 그냥 복붙에 가깝다.

먼저 의존성을 모두 설치해주자.

내가 선택한 모델은 의존성이 하나 뿐이었다.

yarn add framer-motion

그리고 훅을 하나 추가해존다.

전체 코드는 문서의 Create the WebRTC Hook을 확인하자.

/src/hooks/use-webrtc.ts

"use client";
 
import { useState, useRef, useEffect } from "react";
import { Tool } from "@/lib/tools";
 
const useWebRTCAudioSession = (voice: string, tools?: Tool[]) => {
  const [status, setStatus] = useState("");
  const [isSessionActive, setIsSessionActive] = useState(false);
  const audioIndicatorRef = useRef<HTMLDivElement | null>(null);
  const audioContextRef = useRef<AudioContext | null>(null);
  const audioStreamRef = useRef<MediaStream | null>(null);
  const peerConnectionRef = useRef<RTCPeerConnection | null>(null);
 
생략...

솔찍히 빠른 개발을 위해 무지성으로 추가한 나...

사용할 모델에 따라 중간의 modalities에서 audio나 text를 적절하게 추가, 삭제해주자.

개발 후 콘솔이 찍히는게 싫다면 console.log도 모두 제거해도 좋다.

그리고 웹소켄 생성을 위한 session 경로를 만들어준다.

/src/api/session/route.ts

import { NextResponse } from 'next/server';
 
export async function POST() {
    try {        
        if (!process.env.OPENAI_API_KEY){
            throw new Error(`OPENAI_API_KEY is not set`);
 
        }
        const response = await fetch("https://api.openai.com/v1/realtime/sessions", {
            method: "POST",
            headers: {
                Authorization: `Bearer ${process.env.OPENAI_API_KEY}`,
                "Content-Type": "application/json",
            },
            body: JSON.stringify({
                model: "gpt-4o-mini-realtime-preview",
                voice: "alloy",
                modalities: ["audio", "text"],
                // instructions:"You are a helpful assistant for the website named OpenAI Realtime Blocks, a UI Library for Nextjs developers who want to integrate pre-made UI components using TailwindCSS, Framer Motion into their web projects. It works using an OpenAI API Key and the pre-defined 'use-webrtc' hook that developers can copy and paste easily into any Nextjs app. There are a variety of UI components that look beautiful and react to AI Voice, which should be a delight on any modern web app.",
                tools: tools,
                tool_choice: "auto",
            }),
        });
 
        if (!response.ok) {
            throw new Error(`API request failed with status ${response.status}`);
        }
 
        const data = await response.json();
 
        // Return the JSON response to the client
        return NextResponse.json(data);
    } catch (error) {
        console.error("Error fetching session data:", error);
        return NextResponse.json({ error: "Failed to fetch session data" }, { status: 500 });
    }
}
 
const tools = [
    {
        "type": "function",
        "name": "getPageHTML",
        "description": "Gets the HTML for the current page",
        "parameters": {
            "type": "object",
            "properties": {}
        }
    },
    {
        "type": "function", 
        "name": "getWeather",
        "description": "Gets the current weather",
        "parameters": {
            "type": "object",
            "properties": {}
        }
    },
    {
        "type": "function",
        "name": "getCurrentTime",
        "description": "Gets the current time",
        "parameters": {
            "type": "object",
            "properties": {}
        }
    },
];

노파심에서 하는 말이지만, .env.local 파일에 API 키 설정을 잊지 말자.

해당 문서로 들어가면 chatgpt.tsx와 page.tsx가 그대로 있다.

이걸 퍼나르자.

나는 아래와 같이 설정했다.

/src/components/aiChat/AudioChatGPT.tsx

import React, { useEffect, useState } from "react";
import { motion } from "framer-motion";
import useWebRTCAudioSession from "@/hooks/use-webrtc";
 
const ChatGPT: React.FC = () => {
  const { currentVolume, isSessionActive, handleStartStopClick, msgs } =
    useWebRTCAudioSession("alloy");
 
  const silenceTimeoutRef = React.useRef<NodeJS.Timeout | null>(null);
 
  const [mode, setMode] = useState<"idle" | "thinking" | "responding" | "volume" | "">(
    ""
  );
  const [volumeLevels, setVolumeLevels] = useState([0, 0, 0, 0]);
 
  ....생략

그리고 page.tsx도 하나 만들어주었다.

/src/app/services/audio/page.tsx

import ChatGPT from "@/components/aiChat/AudioChatGPT";
 
export default function Page() {
  return (
    <main className="flex items-center justify-center h-screen">
      <ChatGPT />
    </main>
  );
}

이제 테스트를 해보면 잘 작동하는 것을 볼 수 있다.

나는 아직 다크모드가 없어서 svg 색상을 gray로 바꾸어 주었다.

4. 후기

비용은 오디오 입력 1분당 0.06$, 출력당 0.24$ 이다.

이게 체감이 안되는데, 오늘 잠깐잠깐 대화를 하며 테스트를 했더니 아래처럼 비용이 나왔다.

지난 두 달 동안 streamText로 쓴게 2달러였는데 오늘 하루만에 1.5달러 씀...

당연히 결재해서 쓰는것 보다는 훨씬 저렴하겠지만, 생각보다 비싸다.

와이프가 영어 공부용 GPT를 구현해달라던데 이걸로 한번 해봐야겠다.