✦ ASR Benchmark — Cantonese Meeting Transcription

Citibank 高德 onboarding call · 3 min clip · 3 speakers · 5 providers · Apr 2026

Dimension

cantonese.ai

Cantonese-native · API · diarize

RECOMMENDED

whisper.cpp v3

Local M4 Max · lang=yue · Metal

FREE · SURPRISE

whisper-v3 API

Fireworks · lang=yue

CURRENT PIPELINE

whisper.cpp v2

Local M4 Max · lang=zh · Metal

TOM'S REC

MLX Whisper v3

Local M4 Max · lang=yue · MLX

yue BROKEN

Cantonese FidelitySame passage across all 5 providers.

7 / 10

① 「噉誒封面圖就係呢一個啦, 哩個冇問題」
② 「當中我哋有啲咩service可以provide到嘅」
③ 「啫係first watch我哋未必有咁多時間」粵語 particles: 噉/哩/嘅/啫/冇/囉 ✓Garbles: 銀河→銀行 · 落雪樓 · ｂ通amana

7 / 10

① 「咁，封面圖就係呢一個啦，呢個冇問題」
② 「當中我哋有啲咩service可以provide到嘅」
③ 「因為我係驚，啫係first launch我哋未必有咁多時間」Same 粵語 particles · cleaner than cantonese.aiNo garbles — but no diarization either

4 / 10

① 「封面圖就係呢一個啦 呢個冇問題」
② 「當中我們有什麼服務可以提供」
③ 「因為我們未必有那麼多時間」First 30s Cantonese → then Mandarin drift我哋→我們 · 啲咩→什麼 · provide→提供

2 / 10

① 「封面圖就是這個 這個沒問題」
② 「當中我們有什麼服務可以提供」
③ 「first launch我們未必有這個功能」Pure Mandarin · 冇→沒 · 嘅→的 · 噉→那v2 hears Cantonese but writes Mandarin (no yue token)

2 / 10

① 「封面圖就是這個 這個沒有問題」
② 「當中我們有什麼Service可以提供」
③ 「因為我的First Launch 可能沒有那麼多時間」Pure Mandarin despite lang=yue · MLX port ignores yue tokenIdentical to v2 zh output — yue flag has no effect

Speaker DiarizationWho said what? Gates action attribution.

8 / 10

SPEAKER_00: 嗱in general呢我通常for一啲clients…
SPEAKER_01: 啫係Let's say你個特色服務係…
SPEAKER_02: 噉誒封面圖就係呢一個啦…3 speakers correctly separated throughout

0 / 10

All 3 speakers merged into one continuous stream with timestamps but no labelsWhisper architecture has no diarization

0 / 10

Flat paragraph, no speaker labels, no segment timestampsSame model as local but even less structure

0 / 10

Timestamped segments but all 3 speakers merged into one streamv2 also has no diarization capability

0 / 10

94 timestamped segments, all speakers merged into one streamSame Whisper architecture — no diarization

Entities + Code-SwitchSame passage: the "5 buttons" proposal.

7 / 10

「你先訂立咗可能呢好似滙豐噉樣有五個button啊，你畀咗我字我先，跟住我再tie in返你哋個look and feel for citibank本身嘅ｂ通amana…brand tone啦brand color」citibank ✓ · 滙豐 (trad.) ✓ · 粵語 grammar ✓"brand tone and manner" → "ｂ通amana" garble

6 / 10

「你先定立咗,可能呢,寫好似匯豐咁樣有五個button,你畀咗個字我先,跟住我再tie in返你哋個look and feel for city band本身嘅tone and manner…brand tone,brand colour」tone and manner ✓ · 粵語 grammar ✓ · cleanCitibank → "city band" ✗ (consistent error)

4 / 10

「你先定立了可能像匯豐那樣有五個Button 你先給我一個字然後我再Tie in你們的Look and Feel for City Band本身的Tone and Manner…Brand Tone Brand Color」English terms ✓ but 粵語→Mandarin grammarCitibank → City Band ✗ · 會訪→匯豐 (another seg)

3 / 10

「你先訂立了好像匯豐那樣有5個button 你先給我一個字然後我再tie in你們的look and feel for city band本身的tone and manner…brand tone brand color」English ✓ but pure Mandarin · 噉→那 · 嘅→的Citibank → city band ✗ · "Spring Cabinet" elsewhere

3 / 10

「你先訂立了像匯豐那樣有五個Button 你先給我一個字然後我再Tie in你們的Look and Feel For CityBand本身的Tone and Manner…Brand Tone啦Brand Color」English preserved (Title Case) · Mandarin grammarCityBand ✗ · Core Relay→correlate garble elsewhere

ReadabilitySame passage: emoji/button discussion.

5 / 10

「in general呢我通常for一啲clients我哋都會上返去搵一啲類似emoji feel嘅一啲button嘅擺上去嘅，但係會cooperate返嚟嗰個文字嘅一個surface」cooperate→correlate · surface→service~15% word-level errors disrupt LLM extraction

8 / 10

「in general,我通常for嗰啲clients,我哋都會上返去搵一啲類似emoji feel嘅一啲button嘅擺上去嘅，但係會correlate返嚟嗰個文字嘅一個service」correlate ✓ · service ✓ · natural sentence flowCleanest output — directly usable for LLM

5 / 10

「In general 我通常for我的client 我們都會上去找一些類似emoji feel的button 放上去但會correlate到你的文字的一個service」correlate ✓ · but 我哋→我們 drift mid-sentenceFlat paragraph — no segmentation at all

7 / 10

「在一般來說我通常給一些客戶我們都會上去找一些類似emoji feel的一些button 擺上去的但是會coordinate你那個文字的一個service」Clean Mandarin · coordinate~correct · readableWrong language for Cantonese context extraction

4 / 10

「In general 我通常For一些Client 我們都會上去找一些類似Emoji Feel的一些Button 放上去的但會Core Relay你那個文字的一個Service」Title Case English — readable but non-standardCore Relay→correlate garble · Mandarin throughout

CompletenessVolume + structure of captured content.

6 / 10

~2,556 chars · 48 SRT segments · 3 diarized blocksRichest structure: speaker + timestamp + text~15% garbled segments reduce usable content

8 / 10

~2,500 chars · 90 segments (2s each) · clean timestampsMost complete — every utterance captured cleanlyNo filler skipped, no garbles

5 / 10

~903 chars · 6 timestamp blocks · flat paragraphSame model, ~64% less content than local runFireworks decoding params sacrifice detail

7 / 10

~2,200 chars · clean segments · good timestampsComplete content in Mandarin — nothing lostBut wrong language misleads extraction

5 / 10

~984 chars · 94 segments · word-level timestampsLess content than whisper.cpp v3 (984 vs 2,500)MLX conversion may lose decoding quality

Speed & CostMeasured wall-clock time on 3-min clip.

8 / 10

8.75s single pass · ~18s w/ diarization (2 passes)Measured: process_time=8.75s · fastest quality optionPro plan HK$299.9/mo · 30 hrs · ~US$0.02/min · 60 min max/file

5 / 10

44s total (whisper_print_timings) · FreeMeasured: M4 Max Metal GPU · 14.7s per audio min5× slower than cantonese.ai server

9 / 10

~1-3s estimated · ~$0.004/min (Fireworks pricing)Not measured — no timing in response JSONNear-instant but quality penalty is steep

6 / 10

28s total (whisper_print_timings) · FreeMeasured: M4 Max Metal GPU · 9.3s per audio minFastest local · but wrong language

6 / 10

~19s transcription (excl 42s model download) · FreeMeasured: M4 Max MLX · 6.3s per audio minFaster than whisper.cpp but quality much worse

Intelligence YieldCan we extract actions, decisions for Bob?

8 / 10

SPEAKER_00 asked Citibank to provide 5 feature buttons + brand toneSpeaker + action + context = attributable intel

5 / 10

Can extract: "5 buttons needed", "brand tone", "海報管理"Content-rich but no speaker → can't attribute

3 / 10

Same topics but less detail + entity errors reduce confidenceSame model, worse output than local version

3 / 10

Content complete in Mandarin, but wrong language + no speakersMandarin misleads Cantonese-context extraction

2 / 10

Mandarin + no speakers + garbles (Core Relay) + less contentWorst overall — MLX port degrades both yue + quality

Total

7.0

BEST FOR MEETINGS

5.9

BEST FREE OPTION

4.3

FAST BUT LOSSY

4.6

HEARS YUE · WRITES ZH

3.1

yue TOKEN BROKEN IN MLX

Key Finding — Same Model, Different Quality

whisper.cpp v3 local produces ~2× more content than Fireworks API v3 on the same model (whisper-large-v3). Local: ~2,500 chars. API: ~1,300 chars. MLX Whisper v3 was tested as a faster local alternative — 19s vs 44s — but the MLX port ignores the yue language token, outputting pure Mandarin identical to v2. MLX is faster but useless for Cantonese. whisper.cpp remains the only working local option for yue.

Recommendation for Hopeman

cantonese.ai Pro (HK$299.9/mo) — diarization enables "Crystal said X, Veaky proposed Y". 30 hrs/mo covers Hopeman volume. 1.5% of retainer cost. whisper.cpp v3 local for fallback — free, good Cantonese, M4 Max. MLX Whisper eliminated: faster (19s vs 44s) but yue token broken → Mandarin output. v2 hears Cantonese (Tom correct per OpenAI) but writes Mandarin. The v3 yue token controls output script, not recognition.

Timings: cantonese.ai process_time from API, whisper.cpp from whisper_print_timings, MLX from Python time.time(), Fireworks estimated. v2 yue support per OpenAI Common Voice 15 benchmark — processes audio but outputs zh text. MLX Whisper 0.4.3 (mlx-community/whisper-large-v3-mlx) ignores yue token. Also tested: Deepgram Nova-3 (zero), Google STT (not enabled). Local: M4 Max, Metal, whisper.cpp 1.8.4, mlx-whisper 0.4.3 Prepared by Eric San