OpenAI · Specialized
Whisper
OpenAI's open-source automatic speech recognition model trained on 680,000 hours of multilingual audio for robust transcription and translation.
Overview
Whisper is OpenAI's open-source automatic speech recognition (ASR) model trained on 680,000 hours of multilingual and multitask supervised data from the web. It approaches human-level accuracy on standard English speech recognition benchmarks while demonstrating remarkable robustness to accents, background noise, and technical language. Whisper supports transcription in 99 languages and translation to English, making it the most versatile open-source ASR model available and a foundation for countless audio processing applications.
Parameters
39M (tiny) to 1.5B (large-v3)
Languages
99 languages supported
Training Data
680,000 hours of audio
Architecture
Encoder-decoder transformer
License
MIT
Capabilities
Multilingual speech-to-text transcription (99 languages)
Speech translation to English from any supported language
Robust handling of accents, noise, and technical terminology
Timestamp generation at word and segment level
Language detection and identification
Use Cases
Transcribing meetings, interviews, and podcasts automatically
Adding subtitles and captions to video content
Building multilingual voice interfaces and voice search
Creating accessible content through automated transcription
Pros
- +Near-human accuracy on English speech recognition
- +Open-source with MIT license for unrestricted use
- +99-language support makes it the most versatile open ASR model
- +Robust to real-world audio conditions and diverse accents
Cons
- -Large models require significant GPU memory for real-time use
- -Can hallucinate text for silent or low-quality audio segments
- -Real-time streaming requires additional engineering effort
- -Translation is limited to English as the target language
Pricing
Free and open-source for self-hosting. OpenAI API: $0.006/minute of audio. Runs on consumer GPUs; Tiny model runs on CPU.