Whisper

Tier	Price	Features
Whisper	$0.006 / minute	High accuracy, Multi-language, Translation to English
GPT-4o Transcribe	$0.006 / minute	Better accuracy, Speaker diarization, Better punctuation
GPT-4o Mini Transcribe	$0.003 / minute	50% cheaper, Good accuracy, Best for bulk

Application Tips

Self-Host for High Volume

Whisper is open-source. Break-even at ~500 hours/month vs API.

Use GPT-4o Mini for Savings

GPT-4o Mini Transcribe costs 50% less at $0.003/min with good accuracy.

Max 25MB File Size

Supports mp3, mp4, wav, webm etc. Max 25MB. Split longer files.

China Access Solutions

API Proxy

Use third-party proxy services for Whisper API.

Self-Host Whisper

Deploy open-source Whisper locally. No API needed.

Code Example

JavaScript / TypeScript

import OpenAI from 'openai';
import fs from 'fs';

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

async function transcribeAudio() {
  const transcription = await openai.audio.transcriptions.create({
    file: fs.createReadStream('audio.mp3'),
    model: 'whisper-1',
    language: 'en',
    response_format: 'verbose_json',
  });
  console.log('Text:', transcription.text);
}

transcribeAudio();

Rate Limits

Tier	Limits
Tier 1	50 RPM
Tier 3+	500+ RPM

Recommended Use Cases

Meeting transcriptionPodcast transcriptionSubtitle generationVoice note conversionMulti-language audio processing

Last Updated: 2025-02

Related API Guides

OpenAI GPT-4o / GPT-4.1 / o3

OpenAI

OpenAI's flagship LLM family including GPT-4o for multimodal tasks, GPT-4.1 for long-context coding, and o3 for advanced reasoning. Industry-leading models with the largest developer ecosystem.

Anthropic Claude (Sonnet 4.5 / Opus 4.5)

Anthropic

Anthropic's Claude model family excels in nuanced reasoning, safety, and long-context tasks. Claude Sonnet 4.5 offers the best balance of cost and performance, while Opus 4.5 delivers frontier intelligence.

Google Gemini (2.5 Pro / 2.5 Flash)

Google

Google's Gemini models offer a generous free tier, 1M token context window, and strong multimodal capabilities. Gemini 2.5 Pro leads in reasoning, while Flash models provide cost-effective alternatives.