使用 Gemini API 分析音訊檔案

您可以要求 Gemini 模型分析您提供的內嵌 (Base64 編碼) 或透過網址提供的音訊檔案。使用 Vertex AI in Firebase 時,您可以直接從應用程式提出這項要求。

這項功能可讓您執行以下操作:

  • 說明、摘要或解答音訊內容相關問題
  • 轉錄音訊內容
  • 使用時間戳記分析特定音訊片段

跳至程式碼範例 跳至串流回應程式碼


參閱其他指南,瞭解如何使用音訊的其他選項
產生結構化輸出內容 多回合即時通訊 雙向串流

事前準備

如果您尚未完成,請參閱入門指南,瞭解如何設定 Firebase 專案、將應用程式連結至 Firebase、新增 SDK、初始化 Vertex AI 服務,以及建立 GenerativeModel 例項。

如要測試並重複顯示提示,甚至取得產生的程式碼片段,建議您使用 Vertex AI Studio

傳送音訊檔案 (以 Base64 編碼) 並接收文字

請先完成本指南的「事前準備」一節,再嘗試使用這個範例。

您可以要求 Gemini 模型透過文字和音訊提示產生文字,方法是提供輸入檔案的 mimeType 和檔案本身。請參閱本頁後續的輸入檔案相關規定和建議

Swift

您可以呼叫 generateContent(),根據多模態輸入的文字和單一音訊檔案產生文字。

import FirebaseVertexAI

// Initialize the Vertex AI service
let vertex = VertexAI.vertexAI()

// Create a `GenerativeModel` instance with a model that supports your use case
let model = vertex.generativeModel(modelName: "gemini-2.0-flash")

// Provide the audio as `Data`
guard let audioData = try? Data(contentsOf: audioURL) else {
    print("Error loading audio data.")
    return // Or handle the error appropriately
}

// Specify the appropriate audio MIME type
let audio = InlineDataPart(data: audioData, mimeType: "audio/mpeg")


// Provide a text prompt to include with the audio
let prompt = "Transcribe what's said in this audio recording."

// To generate text output, call `generateContent` with the audio and text prompt
let response = try await model.generateContent(audio, prompt)

// Print the generated text, handling the case where it might be nil
print(response.text ?? "No text in response.")

Kotlin

您可以呼叫 generateContent(),根據多模態輸入的文字和單一音訊檔案產生文字。

對於 Kotlin,這個 SDK 中的函式為暫停函式,需要從 協同程式範圍中呼叫。
// Initialize the Vertex AI service and create a `GenerativeModel` instance
// Specify a model that supports your use case
val generativeModel = Firebase.vertexAI.generativeModel("gemini-2.0-flash")

val contentResolver = applicationContext.contentResolver

val inputStream = contentResolver.openInputStream(audioUri)

if (inputStream != null) {  // Check if the audio loaded successfully
    inputStream.use { stream ->
        val bytes = stream.readBytes()

        // Provide a prompt that includes the audio specified above and text
        val prompt = content {
            inlineData(bytes, "audio/mpeg")  // Specify the appropriate audio MIME type
            text("Transcribe what's said in this audio recording.")
        }

        // To generate text output, call `generateContent` with the prompt
        val response = generativeModel.generateContent(prompt)

        // Log the generated text, handling the case where it might be null
        Log.d(TAG, response.text?: "")
    }
} else {
    Log.e(TAG, "Error getting input stream for audio.")
    // Handle the error appropriately
}

Java

您可以呼叫 generateContent(),根據多模態輸入的文字和單一音訊檔案產生文字。

對於 Java,這個 SDK 中的各個方法會傳回 ListenableFuture
// Initialize the Vertex AI service and create a `GenerativeModel` instance
// Specify a model that supports your use case
GenerativeModel gm = FirebaseVertexAI.getInstance()
        .generativeModel("gemini-2.0-flash");
GenerativeModelFutures model = GenerativeModelFutures.from(gm);

ContentResolver resolver = getApplicationContext().getContentResolver();

try (InputStream stream = resolver.openInputStream(audioUri)) {
    File audioFile = new File(new URI(audioUri.toString()));
    int audioSize = (int) audioFile.length();
    byte audioBytes = new byte[audioSize];
    if (stream != null) {
        stream.read(audioBytes, 0, audioBytes.length);
        stream.close();

        // Provide a prompt that includes the audio specified above and text
        Content prompt = new Content.Builder()
              .addInlineData(audioBytes, "audio/mpeg")  // Specify the appropriate audio MIME type
              .addText("Transcribe what's said in this audio recording.")
              .build();

        // To generate text output, call `generateContent` with the prompt
        ListenableFuture<GenerateContentResponse> response = model.generateContent(prompt);
        Futures.addCallback(response, new FutureCallback<GenerateContentResponse>() {
            @Override
            public void onSuccess(GenerateContentResponse result) {
                String text = result.getText();
                Log.d(TAG, (text == null) ? "" : text);
            }
            @Override
            public void onFailure(Throwable t) {
                Log.e(TAG, "Failed to generate a response", t);
            }
        }, executor);
    } else {
        Log.e(TAG, "Error getting input stream for file.");
        // Handle the error appropriately
    }
} catch (IOException e) {
    Log.e(TAG, "Failed to read the audio file", e);
} catch (URISyntaxException e) {
    Log.e(TAG, "Invalid audio file", e);
}

Web

您可以呼叫 generateContent(),根據多模態輸入的文字和單一音訊檔案產生文字。

import { initializeApp } from "firebase/app";
import { getVertexAI, getGenerativeModel } from "firebase/vertexai";

// TODO(developer) Replace the following with your app's Firebase configuration
// See: https://firebase.google.com/docs/web/learn-more#config-object
const firebaseConfig = {
  // ...
};

// Initialize FirebaseApp
const firebaseApp = initializeApp(firebaseConfig);

// Initialize the Vertex AI service
const vertexAI = getVertexAI(firebaseApp);

// Create a `GenerativeModel` instance with a model that supports your use case
const model = getGenerativeModel(vertexAI, { model: "gemini-2.0-flash" });

// Converts a File object to a Part object.
async function fileToGenerativePart(file) {
  const base64EncodedDataPromise = new Promise((resolve) => {
    const reader = new FileReader();
    reader.onloadend = () => resolve(reader.result.split(','));
    reader.readAsDataURL(file);
  });
  return {
    inlineData: { data: await base64EncodedDataPromise, mimeType: file.type },
  };
}

async function run() {
  // Provide a text prompt to include with the audio
  const prompt = "Transcribe what's said in this audio recording.";

  // Prepare audio for input
  const fileInputEl = document.querySelector("input[type=file]");
  const audioPart = await fileToGenerativePart(fileInputEl.files);

  // To generate text output, call `generateContent` with the text and audio
  const result = await model.generateContent([prompt, audioPart]);

  // Log the generated text, handling the case where it might be undefined
  console.log(result.response.text() ?? "No text in response.");
}

run();

Dart

您可以呼叫 generateContent(),根據多模態輸入的文字和單一音訊檔案生成文字。

import 'package:firebase_vertexai/firebase_vertexai.dart';
import 'package:firebase_core/firebase_core.dart';
import 'firebase_options.dart';

await Firebase.initializeApp(
  options: DefaultFirebaseOptions.currentPlatform,
);

// Initialize the Vertex AI service and create a `GenerativeModel` instance
// Specify a model that supports your use case
final model =
      FirebaseVertexAI.instance.generativeModel(model: 'gemini-2.0-flash');

// Provide a text prompt to include with the audio
final prompt = TextPart("Transcribe what's said in this audio recording.");

// Prepare audio for input
final audio = await File('audio0.mp3').readAsBytes();

// Provide the audio as `Data` with the appropriate audio MIME type
final audioPart = InlineDataPart('audio/mpeg', audio);

// To generate text output, call `generateContent` with the text and audio
final response = await model.generateContent([
  Content.multi([prompt,audioPart])
]);

// Print the generated text
print(response.text);

瞭解如何選擇適合用途和應用程式的模型,以及選用的位置

逐句顯示回應

請先完成本指南的「事前準備」一節,再嘗試使用這個範例。

您可以不等待模型產生的完整結果,改用串流處理部分結果,以便加快互動速度。如要串流回應,請呼叫 generateContentStream



輸入音訊檔案的規定和建議

請參閱「支援的 Vertex AI Gemini API 輸入檔案和相關規定」一文,瞭解下列項目的詳細資訊:

支援的音訊 MIME 類型

Gemini 多模態模型支援下列音訊 MIME 類型:

音訊 MIME 類型 Gemini 2.0 Flash Gemini 2.0 Flash‑Lite
AAC - audio/aac
FLAC - audio/flac
MP3 - audio/mp3
MPA - audio/m4a
MPEG - audio/mpeg
MPGA - audio/mpga
MP4 - audio/mp4
OPUS - audio/opus
PCM - audio/pcm
WAV - audio/wav
WEBM - audio/webm

每項要求的限制

提示要求中最多可包含 1 個音訊檔案



你還可以做些什麼?

試用其他功能

瞭解如何控管內容產生作業

您也可以使用 Vertex AI Studio 實驗提示和模型設定。

進一步瞭解支援的型號

瞭解可用於各種用途的模型,以及相關配額價格


針對使用 Vertex AI in Firebase 的體驗提供意見回饋