Vext 1.2.0: Two-Pass Diarization, a Self-Healing Event Tap, and Five Languages

Vext 1.1 shipped three weeks ago. Vext 1.2.0 lands today.

getvext.app · 1.1.0 → 1.2.0

The user-facing 1.2.0 announcement covers what's new in the app — a Speakers tab, full multi-language UI, sharper meeting transcripts. This post is the engineering counterpart: the problems that were genuinely hard to ship, and what they look like in the code now.

Streaming diarization isn't enough — even when it's "right"

Real-time meeting transcription has one degree of freedom we can't recover later: it sees audio once, in order. The streaming diarizer assigns each VAD chunk a single speaker label using one embedding per chunk. That's a good approximation when people take turns. It collapses fast back-and-forth into the wrong label when they don't.

Two speakers stepping on each other for ten seconds should not become "Speaker 1 for ten seconds."

1.2.0 keeps the streaming pass — meeting transcripts still appear live, chunk by chunk — but after the meeting ends and the provisional transcript is saved, Vext runs a second diarization pass on the per-stream WAV archives. The offline pipeline is:

pyannote Community-1 for segmentation
WeSpeaker embeddings with overlap-frame masking
VBx Bayesian refinement to consolidate clusters

The offline pass re-attributes every chunk to its globally correct cluster. When it recognizes a known speaker, their embedding updates in the database — the next meeting picks them up faster. Vext deletes the temp WAV archives once refinement completes.

You see the streaming result during the meeting. You read the refined result afterwards. They're not the same artifact — and that's deliberate.

Multi-speaker chunks: slice instead of label

The other half of the diarization story happens inside a single chunk.

Sortformer emits a per-frame speaker timeline. If two or more distinct speaker indices appear inside one VAD chunk, transcribing it as a single block forces the model to attribute everything to one voice. A 30-second chunk containing a brisk exchange becomes one transcribed block with one speaker label — two voices collapsed into one.

1.2.0 slices the audio at the speaker change-points and transcribes each turn independently. One chunk in becomes N chunks out, each with its own speaker label, each transcribed as a discrete utterance.

One detail that took longer than it should have: Sortformer fires noisy sub-300ms flickers — a single frame attributed to a different speaker mid-utterance. Splitting on every flicker fragments the transcript and produces phantom turns. Sub-300ms regions are now absorbed into the longest adjacent run before the slicer runs, so the splits we make are the splits that exist.

The microphone-poisoning bug

Apple's setVoiceProcessingEnabled on AVAudioInputNode does what it says: AGC, noise suppression, echo cancellation. It also does something the docs don't emphasize — it mutates shared HAL state on the input device.

Turn it on in Vext, and every other app reading the same microphone — Zoom, FaceTime, OBS, every recorder — sees AGC and noise suppression applied to their feed too. The user's voice sounds distant and gain-reduced in the call they're actually in. Turn it off, and you get the same problem in reverse the next time another app turns it on.

The instinct is to fight the API — push it off, then back on, hold a lock, restore state. The correct answer is "don't use it here at all." Vext captures meeting participants via a separate system-audio process tap, not via the microphone path. The mic stream and the system stream are physically distinct. Echo cancellation across them was never necessary; it was solving a problem that doesn't exist in this architecture.

1.2.0 removes the call. The shared HAL state is no longer disturbed.

An event tap that lies

The global keyboard event tap — the thing that makes hold-a-hotkey-and-speak work — has a failure mode worth describing because it took a while to track down.

After display sleep, system sleep, or fast user switching, the mach port backing the tap can become stale. CGEventTapIsEnabled keeps returning true. Events are silently dropped. The user holds the hotkey; nothing happens. Restarting the app fixes it. Nothing in the logs explains why.

1.2.0 self-heals:

We now subscribe to NSWorkspace.didWakeNotification, screensDidWakeNotification, and sessionDidBecomeActiveNotification. Each triggers a full tap reinstall — not a re-enable, a recreate.
When the system fires tapDisabledByTimeout, we verify the re-enable actually took hold. If it didn't, the same full reinstall runs.
The health-check timer moved to .common run-loop modes — we moved it because it used to block during menu tracking and drag operations, exactly the windows when a stale tap is most likely to be the user's next interaction.

It's not glamorous code. It's the kind of code that decides whether the app is reliable enough to live in your menu bar.

One trigger path

Vext can start dictation, a note, or a meeting from a keyboard hotkey or from the menu bar. In 1.1, those were two code paths. They had drifted.

The hotkey path went through the coordinator: cursor bubble shown, license checked, state callbacks fired, paste serialization enforced. The menu bar path bypassed most of that and called the recording layer directly. Subtle bugs surfaced — a session that started from the menu with a stale license state, a missing cursor bubble that left users wondering whether anything was recording.

1.2.0 routes the menu actions through the same code path as the hotkey, with one declared difference: menu-driven dictation is treated as hands-free (toggle to start, toggle to stop), since there's no physical key to hold. Everything else is identical because it's the same function call.

Speaker labels that don't drift

Two paths were producing names for the same audio. The per-meeting speaker snapshot — the list of voices and their assigned names you see in the meeting detail — used to be reconstructed from DiarizerManager.getSpeakerList() after the meeting ended, then mapped to display names separately from the chunk labels.

1.2.0 builds the snapshot progressively during recording via liveSnapshot, using the same diarizeSpeaker() call that labels chunks. Same source of truth, by construction. Speakers already in the global KnownSpeakerRepository are excluded from the per-meeting snapshot, since their embeddings live globally and don't need to be re-listed per meeting.

Five languages, one table

The localization story is short because the implementation is dull on purpose.

Every user-visible string — sidebar labels, menu items, empty states, onboarding prompts, permission descriptions, toolbar tooltips, About text, model picker labels — goes through a single centralized translation table. Five languages: English, Spanish, Russian, Hindi, Thai. Missing keys fall back to English silently.

The language picker (Settings → General, and in onboarding) has an AUTO option that follows the macOS system locale. Picking a specific language switches without a restart — no app relaunch, no view reload, no flash. That's possible for the same reason the table is cheap to extend: every visible string reads from the table at render time, not at app launch.

If we add a sixth language, the work is translation, not engineering.

How to upgrade

If you're on 1.1.0:

brew upgrade muvon/tap/vext

Or grab the DMG from getvext.app/download.

If you're new:

brew install muvon/tap/vext

Existing speakers, dictations, notes, and meetings are preserved. The first meeting recorded under 1.2.0 is the first one with the two-pass diarization in effect.

Full Vext 1.2.0 release notes →

— Don

Streaming diarization isn't enough — even when it's "right"

Multi-speaker chunks: slice instead of label

The microphone-poisoning bug

An event tap that lies

One trigger path

Speaker labels that don't drift

Five languages, one table

How to upgrade

Related Articles

Give an AI Agent a Filesystem Without Giving It Your Whole Filesystem

Running One AI Agent Across Many Models: A Multi-Model Routing Guide for Octomind

AI Agent Memory Without the Noise: Scoping and Forgetting with Octobrain