Cantonese FidelitySame passage across all 5 providers.
7 / 10
① 「噉誒封面圖就係呢一個啦, 哩個冇問題」 ② 「當中我哋有啲咩service可以provide到嘅」 ③ 「啫係first watch我哋未必有咁多時間」粵語 particles: 噉/哩/嘅/啫/冇/囉 ✓Garbles: 銀河→銀行 · 落雪樓 · b通amana
7 / 10
① 「咁,封面圖就係呢一個啦,呢個冇問題」 ② 「當中我哋有啲咩service可以provide到嘅」 ③ 「因為我係驚,啫係first launch我哋未必有咁多時間」Same 粵語 particles · cleaner than cantonese.aiNo garbles — but no diarization either
4 / 10
① 「封面圖就係呢一個啦 呢個冇問題」 ② 「當中我們有什麼服務可以提供」 ③ 「因為我們未必有那麼多時間」First 30s Cantonese → then Mandarin drift我哋→我們 · 啲咩→什麼 · provide→提供
2 / 10
① 「封面圖就是這個 這個沒問題」 ② 「當中我們有什麼服務可以提供」 ③ 「first launch我們未必有這個功能」Pure Mandarin · 冇→沒 · 嘅→的 · 噉→那v2 hears Cantonese but writes Mandarin (no yue token)
2 / 10
① 「封面圖就是這個 這個沒有問題」 ② 「當中我們有什麼Service可以提供」 ③ 「因為我的First Launch 可能沒有那麼多時間」Pure Mandarin despite lang=yue · MLX port ignores yue tokenIdentical to v2 zh output — yue flag has no effect
Speaker DiarizationWho said what? Gates action attribution.
All 3 speakers merged into one continuous stream with timestamps but no labelsWhisper architecture has no diarization
0 / 10
Flat paragraph, no speaker labels, no segment timestampsSame model as local but even less structure
0 / 10
Timestamped segments but all 3 speakers merged into one streamv2 also has no diarization capability
0 / 10
94 timestamped segments, all speakers merged into one streamSame Whisper architecture — no diarization
Entities + Code-SwitchSame passage: the "5 buttons" proposal.
7 / 10
「你先訂立咗可能呢好似滙豐噉樣有五個button啊,你畀咗我字我先,跟住我再tie in返你哋個look and feel for citibank本身嘅b通amana…brand tone啦brand color」citibank ✓ · 滙豐 (trad.) ✓ · 粵語 grammar ✓"brand tone and manner" → "b通amana" garble
6 / 10
「你先定立咗,可能呢,寫好似匯豐咁樣有五個button,你畀咗個字我先,跟住我再tie in返你哋個look and feel for city band本身嘅tone and manner…brand tone,brand colour」tone and manner ✓ · 粵語 grammar ✓ · cleanCitibank → "city band" ✗ (consistent error)
4 / 10
「你先定立了 可能像匯豐那樣有五個Button 你先給我一個字 然後我再Tie in你們的Look and Feel for City Band本身的Tone and Manner…Brand Tone Brand Color」English terms ✓ but 粵語→Mandarin grammarCitibank → City Band ✗ · 會訪→匯豐 (another seg)
3 / 10
「你先訂立了好像匯豐那樣有5個button 你先給我一個字 然後我再tie in你們的look and feel for city band本身的tone and manner…brand tone brand color」English ✓ but pure Mandarin · 噉→那 · 嘅→的Citibank → city band ✗ · "Spring Cabinet" elsewhere
3 / 10
「你先訂立了像匯豐那樣有五個Button 你先給我一個字 然後我再Tie in你們的Look and Feel For CityBand本身的Tone and Manner…Brand Tone啦Brand Color」English preserved (Title Case) · Mandarin grammarCityBand ✗ · Core Relay→correlate garble elsewhere
「in general,我通常for嗰啲clients,我哋都會上返去搵一啲類似emoji feel嘅一啲button嘅擺上去嘅,但係會correlate返嚟嗰個文字嘅一個service」correlate ✓ · service ✓ · natural sentence flowCleanest output — directly usable for LLM
5 / 10
「In general 我通常for我的client 我們都會上去找一些類似emoji feel的button 放上去 但會correlate到你的文字的一個service」correlate ✓ · but 我哋→我們 drift mid-sentenceFlat paragraph — no segmentation at all
7 / 10
「在一般來說我通常給一些客戶 我們都會上去找一些類似emoji feel的一些button 擺上去的 但是會coordinate你那個文字的一個service」Clean Mandarin · coordinate~correct · readableWrong language for Cantonese context extraction
4 / 10
「In general 我通常For一些Client 我們都會上去找一些 類似Emoji Feel的一些Button 放上去的 但會Core Relay你那個 文字的一個Service」Title Case English — readable but non-standardCore Relay→correlate garble · Mandarin throughout
CompletenessVolume + structure of captured content.
~2,500 chars · 90 segments (2s each) · clean timestampsMost complete — every utterance captured cleanlyNo filler skipped, no garbles
5 / 10
~903 chars · 6 timestamp blocks · flat paragraphSame model, ~64% less content than local runFireworks decoding params sacrifice detail
7 / 10
~2,200 chars · clean segments · good timestampsComplete content in Mandarin — nothing lostBut wrong language misleads extraction
5 / 10
~984 chars · 94 segments · word-level timestampsLess content than whisper.cpp v3 (984 vs 2,500)MLX conversion may lose decoding quality
Speed & CostMeasured wall-clock time on 3-min clip.
8 / 10
8.75s single pass · ~18s w/ diarization (2 passes)Measured: process_time=8.75s · fastest quality optionPro plan HK$299.9/mo · 30 hrs · ~US$0.02/min · 60 min max/file
5 / 10
44s total (whisper_print_timings) · FreeMeasured: M4 Max Metal GPU · 14.7s per audio min5× slower than cantonese.ai server
9 / 10
~1-3s estimated · ~$0.004/min (Fireworks pricing)Not measured — no timing in response JSONNear-instant but quality penalty is steep
6 / 10
28s total (whisper_print_timings) · FreeMeasured: M4 Max Metal GPU · 9.3s per audio minFastest local · but wrong language
6 / 10
~19s transcription (excl 42s model download) · FreeMeasured: M4 Max MLX · 6.3s per audio minFaster than whisper.cpp but quality much worse
Intelligence YieldCan we extract actions, decisions for Bob?
8 / 10
SPEAKER_00 asked Citibank to provide 5 feature buttons + brand toneSpeaker + action + context = attributable intel
5 / 10
Can extract: "5 buttons needed", "brand tone", "海報管理"Content-rich but no speaker → can't attribute
3 / 10
Same topics but less detail + entity errors reduce confidenceSame model, worse output than local version
3 / 10
Content complete in Mandarin, but wrong language + no speakersMandarin misleads Cantonese-context extraction
2 / 10
Mandarin + no speakers + garbles (Core Relay) + less contentWorst overall — MLX port degrades both yue + quality
Total
7.0
BEST FOR MEETINGS
5.9
BEST FREE OPTION
4.3
FAST BUT LOSSY
4.6
HEARS YUE · WRITES ZH
3.1
yue TOKEN BROKEN IN MLX
Key Finding — Same Model, Different Quality
whisper.cpp v3 local produces ~2× more content than Fireworks API v3 on the same model (whisper-large-v3). Local: ~2,500 chars. API: ~1,300 chars. MLX Whisper v3 was tested as a faster local alternative — 19s vs 44s — but the MLX port ignores the yue language token, outputting pure Mandarin identical to v2. MLX is faster but useless for Cantonese. whisper.cpp remains the only working local option for yue.
Recommendation for Hopeman
cantonese.ai Pro (HK$299.9/mo) — diarization enables "Crystal said X, Veaky proposed Y". 30 hrs/mo covers Hopeman volume. 1.5% of retainer cost. whisper.cpp v3 local for fallback — free, good Cantonese, M4 Max. MLX Whisper eliminated: faster (19s vs 44s) but yue token broken → Mandarin output. v2 hears Cantonese (Tom correct per OpenAI) but writes Mandarin. The v3 yue token controls output script, not recognition.
Timings: cantonese.ai process_time from API, whisper.cpp from whisper_print_timings, MLX from Python time.time(), Fireworks estimated. v2 yue support per OpenAI Common Voice 15 benchmark — processes audio but outputs zh text. MLX Whisper 0.4.3 (mlx-community/whisper-large-v3-mlx) ignores yue token. Also tested: Deepgram Nova-3 (zero), Google STT (not enabled). Local: M4 Max, Metal, whisper.cpp 1.8.4, mlx-whisper 0.4.3Prepared by Eric San