使用 OpenAI 构建实时音频服务

虽然 Deepseek R1 已经发布，但 Open AI 提供的 API 依然优秀且具有吸引力，这无可争辩。

今天，我们将使用 Open AI 的实时 API 来构建一个实时音频网络服务。

1. 什么是实时 API？

这是 OpenAI 于 2024 年 10 月 1 日推出的服务，支持实时语音输入输出。

以前，为了使用语音与 chatGPT 进行互动，需要利用 Whisper 等语音识别模型将音频转换为文本，然后发送文本，再使用文本-语音转换输出模型的响应。

这种方式往往会产生较长的延迟。

实时 API 可以直接实现音频输入输出。

使用 GPT-4o 模型和 Websocket、WebRTC，可以实时实现音频输入输出。

详细信息请查看官方网站。

2. Open AI 实时模块

虽然刚推出没多久，但哪位天使已经实现了 API 并作为开源项目上传到 GitHub。

主页本身也很美，我以为是 vercel 提供的 SDK。

以下是创作者主页。

进入安装部分，没有 yarn 或 npm 之类的，只是让你取所需部分使用。

代码也都有。

稍微阅读一下，只需将需要的部分移到我的项目中。

有 Classic、Dock、Siri 等多种格式，其中我最喜欢的是 ChatGPT 版本。

3. 实现 (Ctrl + C & V)

这几乎不能算作实现，更像是简单复制粘贴。

首先，安装所有依赖项。

我选择的模型只需要一个依赖。

yarn add framer-motion

然后增加一个钩子。

完整代码请查看文档中的创建 WebRTC 钩子。

"use client";
 
import { useState, useRef, useEffect } from "react";
import { Tool } from "@/lib/tools";

const useWebRTCAudioSession = (voice: string, tools?: Tool[]) => {
  const [status, setStatus] = useState("");
  const [isSessionActive, setIsSessionActive] = useState(false);
  const audioIndicatorRef = useRef<HTMLDivElement | null>(null);
  const audioContextRef = useRef<AudioContext | null>(null);
  const audioStreamRef = useRef<MediaStream | null>(null);
  const peerConnectionRef = useRef<RTCPeerConnection | null>(null);

生略...

老实说，为了快速开发，我盲目地增加...

根据使用的模型，在 modalities 中适当增加或删除音频或文本。

如果不喜欢日志输出的话，可以所有 console.log 都去掉。

为创建 websocket 生成 session 路径。

import { NextResponse } from 'next/server';

export async function POST() {
    try {        
        if (!process.env.OPENAI_API_KEY){
            throw new Error(`OPENAI_API_KEY is not set`);

        }
        const response = await fetch("https://api.openai.com/v1/realtime/sessions", {
            method: "POST",
            headers: {
                Authorization: `Bearer ${process.env.OPENAI_API_KEY}`,
                "Content-Type": "application/json",
            },
            body: JSON.stringify({
                model: "gpt-4o-mini-realtime-preview",
                voice: "alloy",
                modalities: ["audio", "text"],
                // instructions:"You are a helpful assistant for the website named OpenAI Realtime Blocks, a UI Library for Nextjs developers who want to integrate pre-made UI components using TailwindCSS, Framer Motion into their web projects. It works using an OpenAI API Key and the pre-defined 'use-webrtc' hook that developers can copy and paste easily into any Nextjs app. There are a variety of UI components that look beautiful and react to AI Voice, which should be a delight on any modern web app.",
                tools: tools,
                tool_choice: "auto",
            }),
        });

        if (!response.ok) {
            throw new Error(`API request failed with status ${response.status}`);
        }

        const data = await response.json();

        // 返回 JSON 响应给客户端
        return NextResponse.json(data);
    } catch (error) {
        console.error("Error fetching session data:", error);
        return NextResponse.json({ error: "Failed to fetch session data" }, { status: 500 });
    }
}

const tools = [
    {
        "type": "function",
        "name": "getPageHTML",
        "description": "获取当前页面的 HTML",
        "parameters": {
            "type": "object",
            "properties": {}
        }
    },
    {
        "type": "function", 
        "name": "getWeather",
        "description": "获取当前天气",
        "parameters": {
            "type": "object",
            "properties": {}
        }
    },
    {
        "type": "function",
        "name": "getCurrentTime",
        "description": "获取当前时间",
        "parameters": {
            "type": "object",
            "properties": {}
        }
    },
];

出于谨慎提醒一下，请勿忘记在 .env.local 文件中设置 API 密钥。

进入文档可以看到 chatgpt.tsx 和 page.tsx 完整源码。

让我们复制这些。

如下所示配置。

import React, { useEffect, useState } from "react";
import { motion } from "framer-motion";
import useWebRTCAudioSession from "@/hooks/use-webrtc";
 
const ChatGPT: React.FC = () => {
  const { currentVolume, isSessionActive, handleStartStopClick, msgs } =
    useWebRTCAudioSession("alloy");
 
  const silenceTimeoutRef = React.useRef<NodeJS.Timeout | null>(null);
 
  const [mode, setMode] = useState<"idle" | "thinking" | "responding" | "volume" | "">(
    ""
  );
  const [volumeLevels, setVolumeLevels] = useState([0, 0, 0, 0]);

  ....生略

还创建了一个 page.tsx。

import ChatGPT from "@/components/aiChat/AudioChatGPT";
 
export default function Page() {
  return (
    <main className="flex items-center justify-center h-screen">
      <ChatGPT />
    </main>
  );
}

现在测试运行，可以看到正常工作。

我还没有实现暗模式，所以将 svg 的颜色更改为灰色。

4. 经验

费用是每分钟音频输入 $0.06，每次输出 $0.24。

这并不直观，但今天简单测试了一下，发现如下成本。

过去两个月使用 streamText 花了 2 美元，但今天一天就花了 1.5 美元...

当然比直接付款便宜得多，但还是比预期贵。

我太太想让我实现一个用于英语学习的 GPT，可以试试这个。

목차

使用 OpenAI 构建实时音频服务

1. 什么是实时 API？

2. Open AI 实时模块

3. 实现 (Ctrl + C & V)

4. 经验