Founder dev log
How Cleft's transcription got fast, private, and accurate
Cleft has transcribed your voice on-device since the first builds. The cloud only ever handled the summary step. A look at why we made that choice early, and the work to make on-device transcription fast and accurate.
- Filed
- Author
- Jonathan Cosgrove
- Read
- 3 min
- Updated
People think of transcription as the product. It is really the first step, the thing that has to work before any of the rest matters. I wanted to write down how it has changed, because what we held fixed and what we kept improving both say a lot about how we think.
On-device from the first build
Transcription in Cleft has run on your device since the first builds in early 2024. That was how the app worked from the start. When you stop recording, a local Whisper model turns your audio into text on the phone or the Mac. The audio never has to leave the device, and recording works on a plane with the wifi off.
The only step that talks to a server is the one after transcription: turning a raw transcript into a clean, structured note. That summary runs on a cloud model. The audio, and the transcription of it, stay on the device. So "your voice never leaves your device" describes the real architecture, and it has been that way since the first build.
ON YOUR DEVICE
record → whisper.cpp → transcript
│
only the text is sent
▼
IN THE CLOUD
summarise + format → structured note
We did move around inside the on-device world. Early builds ran whisper.cpp, we tried WhisperKit for about a week in April 2024, then went back to whisper.cpp with an Intel Mac fallback so older machines kept working. Different engines, same arrangement: the audio is transcribed where it was recorded.
Making it fast
On-device transcription gave us that privacy and cost us speed. Early versions made you record, stop, and wait while the model caught up, sometimes close to twenty seconds. That is a long time when you just want to get a thought down and move on.
In 1.11 we upgraded the transcription engine and got record-to-ready down to about three seconds, still entirely on-device. Quiet recordings are handled better too, so low-volume audio is cleaned up before transcription instead of coming out garbled.
When the model makes things up
On-device Whisper has a specific failure anyone who has used it will recognise. On silence, the model does not stay quiet. It hallucinates. The classic one is a phantom "Thank you" or "Thanks for watching" showing up in a note where you said nothing, because the model has seen a lot of video captions.
So we added silence filtering to catch those, and kept tightening it. For longer recordings we fixed the assembly so every section comes back in the right order, with nothing repeated and nothing skipped.
The subtlest one was the background gap. If you started a recording and then switched to another app, the words you spoke while Cleft was in the background were captured to the audio file but could go missing from the transcript. The app now notices when part of a recording happened in the background and re-transcribes the full audio, so the note has everything you said. Fully foreground recordings are untouched and stay just as fast.
Open-sourcing the native path
The newest piece is native. Justin built and open-sourced Liquid Speech, an MIT-licensed Flutter bridge to Apple's on-device SpeechAnalyzer API, with a runtime availability check so apps can use the native path on iOS and macOS 26 and fall back cleanly on older versions. The same principle that shaped our transcription, keeping the audio on the device, is now a small piece of public infrastructure other people can build on.
The priorities have stayed in that order the whole time: keep the audio on the device, make the device fast enough to keep up, and make sure the transcript only holds what you actually said.
