The preview model is separate from the main engine. You get instant visual feedback without any hit to final accuracy.
An NVIDIA End-of-Utterance model on the Neural Engine processes 320ms audio windows and produces word-level partials with ~300ms latency.
If the EOU model isn't available, preview falls back to the main engine processing 2-second chunks during recording.
The EOU engine chews through 320ms audio windows on the Neural Engine and spits out word-level partials almost immediately.
Preview runs on the Neural Engine. Main transcription runs on Metal GPU. They don't fight over resources.
Words appear progressively with a subtle typewriter effect. Feels natural, not jarring.
Dictionary corrections show up with colored highlights. Click one to see what the original transcription was.
About 300ms latency. The EOU engine processes 320ms audio windows on the Neural Engine, faster than any cloud service.
No. It's a separate lightweight model. The main engine runs independently and produces the final result when you release the button.
Yes. Toggle it off in Settings. The main transcription engine still works; you just won't see words appearing in real-time.
The primary EOU preview needs the Neural Engine (Apple Silicon). If unavailable, it falls back to the main engine with 2-second chunks.