What is OpenAI Whisper?

OpenAI Whisper is an open-source automatic speech recognition (ASR) model released by OpenAI in September 2022. The latest version, Whisper large-v3, was trained on 1 million hours of weakly labeled audio and 4 million hours of pseudo-labeled audio, making it one of the largest speech recognition training datasets ever assembled. The model converts spoken audio into text, supports 99 languages, and comes in 9 variants across 5 size categories — from tiny (39M parameters) to large (1.55B parameters), plus a turbo variant (809M parameters) optimized for speed. Because Whisper's weights are publicly available, developers can run it locally on devices without sending audio to cloud servers.

How does Whisper run on Apple Silicon Macs?

Whisper runs on Apple Silicon Macs through optimized implementations like whisper.cpp and MLX Whisper that take advantage of the M-series chip architecture. The whisper.cpp implementation supports Apple's Core ML framework, which enables execution on the Neural Engine (ANE) — achieving more than 3x faster inference compared to CPU-only processing. Apple Silicon's unified memory architecture allows model weights and audio data to be processed without copying between CPU and GPU memory. On the M4 chip, the tiny model achieves 27x real-time speed, and the large model processes via Core ML in approximately 1.23 seconds. Apps like Voibe use these optimizations to deliver real-time, on-device dictation.

Which Whisper model size should I use?

For most English dictation on Apple Silicon Macs, the Whisper Small model (244 million parameters, 461 MB) offers the best balance of accuracy and speed. It processes speech in real-time with low memory usage. The Medium model (769 million parameters, 1.5 GB) provides higher accuracy but uses more memory. The Large model (1.55 billion parameters, 2.9 GB) delivers the highest accuracy but requires significant processing power. The Tiny (39M) and Base (74M) models are fastest but sacrifice accuracy. Voibe selects the optimal model size automatically based on your Mac's capabilities.

Is Whisper accurate enough for professional dictation?

Yes. Whisper large-v3 achieves a 2.7% word error rate (WER) on clean English audio and 7.88% WER on mixed real-world recordings — competitive with commercial cloud services. The large-v3-turbo variant achieves 7.75% WER on mixed audio while being 8x faster than the full large model. For comparison, Google Speech-to-Text scores 16.51%–20.63% WER in comparable tests. For professional dictation on Apple Silicon Macs, the Medium and Large models provide accuracy suitable for legal, medical, and business documentation. English performance is strongest — 65% of Whisper's training data is English audio.

Does Whisper work offline?

Yes. Once a Whisper model is downloaded to your Mac, all speech recognition runs entirely offline. The model weights are stored locally on your device, and all audio processing happens on your Mac's processor without any internet connection. This is fundamentally different from cloud speech APIs (like Google Speech-to-Text or OpenAI's Whisper API) that require sending audio to remote servers. Apps like Voibe bundle Whisper models locally for fully offline dictation.

What is the difference between Whisper and the Whisper API?

Whisper is an open-source speech recognition model that can run locally on any compatible hardware. The Whisper API is OpenAI's cloud-hosted service that runs Whisper on their servers — you send audio over the internet and receive text back. The local version keeps all audio on your device (private, offline-capable). The API version sends audio to OpenAI's servers (requires internet, audio is transmitted). For privacy-conscious dictation, always use local Whisper implementations rather than the cloud API.

How is Whisper different from Siri and Apple Dictation?

Whisper and Apple Dictation use different speech recognition approaches. Apple Dictation on Apple Silicon uses Apple's proprietary on-device models, which are optimized for Apple hardware but are a closed system — you cannot inspect, modify, or verify exactly how they work. Whisper is open-source, meaning its architecture, training data composition, and model weights are publicly available for inspection. Both run on-device on Apple Silicon, but Whisper's open-source nature provides transparency that Apple's closed system does not.

Can Whisper transcribe languages other than English?

Yes. Whisper supports 99 languages and was trained on multilingual audio data. However, accuracy varies significantly by language based on representation in the training data. English has the strongest performance because it dominates the training dataset. Languages with substantial training data (Spanish, French, German, Japanese, Chinese) perform well. Lower-resource languages may have higher error rates. For non-English dictation, the Large model provides the best multilingual accuracy. Importantly, on-device Whisper outputs exactly what you speak — without LLM post-processing layers that some tools apply. Cloud-based dictation tools that layer LLMs on top of transcription have been reported to corrupt or rewrite non-English text as they attempt to "improve" the output.

How Whisper Works: OpenAI's Speech Model Explained for Mac Users (2026)

How Whisper Works: The AI Model Behind On-Device Dictation

TL;DR: OpenAI Whisper is an open-source speech recognition model — the latest large-v3 version was trained on 1 million hours of labeled audio plus 4 million hours of pseudo-labeled audio. It converts spoken language into text using a transformer architecture — the same type of AI architecture that powers ChatGPT. On Apple Silicon Macs, Whisper runs entirely on-device using the Neural Engine chip, processing speech locally without sending any audio to servers. This is the technology that enables private, offline dictation apps like Voibe.

Understanding how Whisper works helps explain why on-device dictation in 2026 is fast, accurate, and private. This guide breaks down the model architecture, explains the different model sizes and their trade-offs, and covers how Apple Silicon optimization enables real-time local processing.

Key Takeaway

Whisper is open-source, runs fully on-device on Apple Silicon, and processes speech without any network connection. It achieves accuracy comparable to cloud services for English.

Key Takeaways: Whisper at a Glance

Aspect	Detail	Why It Matters
Creator	OpenAI (released September 2022)	Open-source, publicly auditable
Training Data	1M hours labeled + 4M pseudo-labeled (v3)	Broad vocabulary and accent coverage
Architecture	Encoder-decoder transformer	Same proven architecture as ChatGPT
Model Sizes	Tiny (39M) to Large (1.55B parameters)	Choose accuracy vs. speed for your hardware
Languages	99 languages supported	Strongest for English, good for major languages
Apple Silicon	Optimized via whisper.cpp and MLX	Real-time processing on M1–M4 Neural Engine
Privacy	Runs 100% on-device when used locally	No audio sent to servers, fully offline

Disclosure: Voibe is our product and uses Whisper for on-device dictation. We explain the technology factually.

Whisper's Architecture: How Speech Becomes Text

Whisper uses an encoder-decoder transformer architecture — the same family of AI architectures behind large language models like GPT-4 and Claude. Here is how the pipeline works:

Step 1: Audio preprocessing — Raw audio from your microphone is converted into a mel spectrogram, which is a visual representation of sound frequencies over time. The audio is processed in 30-second chunks. This spectrogram is the model's "input image" of your speech.

Step 2: Encoder — The encoder is a stack of transformer layers that processes the mel spectrogram and creates a rich representation of the audio. Each layer learns different aspects of the sound: lower layers capture basic acoustic features (pitch, volume), while higher layers capture linguistic features (phonemes, word boundaries).

Step 3: Decoder — The decoder takes the encoder's representation and generates text token by token. It predicts the next word based on the audio representation and the words it has already generated. This is the same autoregressive generation process used by text AI models.

Step 4: Output — The generated tokens are assembled into the final transcript. Whisper can also output timestamps, detect language, and identify speaker transitions depending on the implementation.

The entire process — from audio input to text output — runs on your Mac's processor when using local implementations. No step requires network connectivity.

Whisper converts speech to text through four stages, all running locally on Apple Silicon.

Whisper Model Sizes: Choosing the Right One

Whisper comes in five model sizes, each trading accuracy for speed and resource usage. Choosing the right size depends on your Mac's hardware and your accuracy requirements.

Model	Parameters	Disk Size	Relative Speed	Best For
Tiny	39 million	~75 MB	Fastest (~32x real-time)	Quick drafts, low-power devices
Base	74 million	~142 MB	Very fast (~16x real-time)	Casual dictation, older hardware
Small	244 million	~461 MB	Fast (~6x real-time)	Daily use — best accuracy/speed balance
Medium	769 million	~1.5 GB	Moderate (~2x real-time)	Professional dictation, higher accuracy
Large	1.55 billion	~2.9 GB	Slower (~1x real-time)	Maximum accuracy, multilingual
Turbo	809 million	~1.6 GB	Fast (~4x real-time)	Near-large accuracy, optimized speed

On Apple Silicon Macs, the Small model handles most English dictation well. It processes speech approximately 6 times faster than real-time on M1 chips. On the M2, the large-v3-turbo model transcribes 10 minutes of audio in approximately 63 seconds. On M3 and M4, processing is faster still, meaning a 10-second audio clip is transcribed in under 2 seconds. For professional work requiring the highest accuracy, the Medium model is the sweet spot — it runs at approximately 2x real-time on M1 and faster on newer chips.

Voibe automatically selects the optimal model size based on your Mac's Apple Silicon generation, balancing accuracy and responsiveness without manual configuration.

Why Apple Silicon Makes Whisper Fast and Private

Apple Silicon's architecture is uniquely suited for running Whisper locally. Three hardware features make on-device speech recognition practical:

Neural Engine — Every Apple Silicon chip (M1 through M4) includes a dedicated Neural Engine designed specifically for machine learning workloads. The Neural Engine handles the matrix multiplication operations that dominate transformer computations, offloading this work from the CPU and GPU. The M4's Neural Engine can perform up to 38 trillion operations per second.

Unified Memory Architecture — Unlike traditional computers where CPU, GPU, and memory are separate components connected by buses, Apple Silicon uses a unified memory pool shared by all processors. This means Whisper model weights, audio data, and intermediate computations can be accessed by the Neural Engine without copying data between memory regions — eliminating a major bottleneck in AI inference.

Efficient inference frameworks — Two open-source projects optimize Whisper specifically for Apple Silicon:

whisper.cpp — A C/C++ implementation by Georgi Gerganov that uses Apple's Accelerate and Core ML frameworks. Core ML execution on the Neural Engine delivers more than 3x faster inference compared to CPU-only processing.
MLX Whisper — Built on Apple's MLX framework, designed from the ground up for Apple Silicon's unified memory architecture

These optimizations mean that Whisper models run efficiently on even the base M1 chip with 8 GB of unified memory. No external GPU, no cloud server, and no internet connection required.

For a broader look at how local processing compares to cloud dictation, see our guide on cloud vs. local dictation.

Apple Silicon's Neural Engine, unified memory, and optimized frameworks enable real-time Whisper locally.

Whisper vs. Cloud Speech APIs: Privacy and Performance

Running Whisper locally and using cloud speech APIs (including OpenAI's own Whisper API) are fundamentally different experiences, despite using the same underlying model. Here is how they compare:

Factor	Local Whisper (e.g., Voibe)	Cloud Whisper API	Google Speech-to-Text
Audio leaves device?	No	Yes (sent to OpenAI servers)	Yes (sent to Google servers)
Internet required?	No	Yes	Yes
Latency	Low (local processing)	Variable (network + server)	Variable (network + server)
Privacy	Maximum (no data transmitted)	Audio on OpenAI servers	Audio on Google servers
Cost model	One-time / subscription	Per-minute API pricing	Per-minute API pricing
Model inspectable?	Yes (open-source weights)	No (hosted service)	No (proprietary)

The key distinction: local Whisper gives you the accuracy of OpenAI's speech recognition without sending a single byte of audio to any server. The model runs on your hardware, the audio stays on your hardware, and no external party has access to your voice data.

A further distinction for multilingual users: on-device Whisper delivers your transcription without any LLM post-processing. Some cloud dictation tools pass the raw Whisper output through a large language model to apply formatting or corrections — a step that has been reported to corrupt non-English text, silently replacing or rewriting words in the target language. On-device Whisper delivers what you speak, unmodified by any downstream model.

For a broader privacy comparison of dictation approaches, see our dictation privacy guide. For details on how voice data is handled by different services, see our voice data privacy guide.

Getting Started with On-Device Whisper Dictation

The easiest way to use Whisper for on-device dictation on Mac is through a dedicated app that handles model management, optimization, and system-wide integration:

Voibe bundles optimized Whisper models and runs them on your Apple Silicon Mac's Neural Engine. It works system-wide (any app), requires no account, and costs $4.90 per month or $99 for a lifetime license. Download, install, and start dictating — all processing stays on your Mac.

Requirements: Apple Silicon Mac (M1, M2, M3, or M4) running macOS 13 or later.

For a step-by-step setup guide, see how to use dictation on Mac. For a comparison of all on-device dictation options, see our roundup of the best offline dictation apps. Developers building speech-to-text features can explore our OpenAI Whisper alternatives guide comparing managed APIs (Deepgram, AssemblyAI) with optimized self-hosted options (faster-whisper, whisper.cpp).

How Whisper Works: OpenAI's Speech Model Explained for Mac Users (2026)

How Whisper Works: The AI Model Behind On-Device Dictation

Key Takeaways: Whisper at a Glance

Whisper's Architecture: How Speech Becomes Text

Whisper Model Sizes: Choosing the Right One

Why Apple Silicon Makes Whisper Fast and Private

Whisper vs. Cloud Speech APIs: Privacy and Performance

Getting Started with On-Device Whisper Dictation

Ready to type 3x faster?

Related Articles

7 Best Offline Dictation Apps for Mac in 2026

Apple Dictation Privacy: What Data Apple Collects and How to Stop It

Cloud vs. Local Dictation: Privacy, Speed, and Accuracy Compared (2026)