Supplier Bot V7 — Clean Benchmark

I. Methodology — 3-Layer Separation

Previous benchmarks had a confound: the same model played both the bot and the supplier simulator. Quality changes in the simulator could mask or inflate bot scores. This clean benchmark fixes that by separating three roles.

Bot (varies)4 models tested

↔

Supplier Sim (fixed)Kimi K2.5

→

Judge (fixed)Claude Sonnet

Bot model varies • supplier simulator quality is constant • scoring calibration is constant

Each model ran the same V7 prompt against the same 20 synthetic suppliers (all archetypes) across 2 products: shoes and phone-holder. That’s 40 conversations per model, 160 total, all scored on the same 9-dimension rubric.

II. Combined Results

Average across both products (shoes + phone-holder), 40 conversations per model.

Model	Avg Score	Performance	Time / Product
Gemini 3.1 Pro	6.8/9	76%	~55m
Kimi K2.5	6.5/9	72%	~17m
Claude Sonnet	6.4/9	71%	~35m
Gemini 3 Flash	6.0/9	67%	~19m

III. Per-Product Results

Shoes (20 suppliers, all archetypes)

Model	Score	%	E1	E2	E3	E4	E5	E6	E7	E8	E9	S1	Time
Gemini 3.1 Pro	6.9	77	.68	.93	.80	1.0	.72	.95	.95	.80	.57	4/20	51m
Kimi K2.5	6.3	70	.63	.63	.75	1.0	.75	.93	.88	.80	.57	3/20	16m
Claude Sonnet	6.3	70	.53	.87	.79	1.0	.63	.95	.95	.84	.45	1/19	33m
Gemini 3 Flash	6.3	71	.60	.80	.78	1.0	.75	.95	.80	.75	.53	1/20	18m

Phone-Holder (20 suppliers, all archetypes)

Model	Score	%	E1	E2	E3	E4	E5	E6	E7	E8	E9	S1	Time
Gemini 3.1 Pro	6.7	74	.60	.95	.78	1.0	.60	.95	.97	.75	.65	4/20	60m
Kimi K2.5	6.7	74	.72	.60	.72	1.0	.78	.95	.88	.95	.68	1/20	17m
Claude Sonnet	6.5	73	.60	.75	.75	1.0	.80	.95	.93	.80	.55	0/20	36m
Gemini 3 Flash	5.7	63	.53	.82	.75	.85	.72	.90	.68	.63	.50	1/20	20m

IV. Dimension Heatmap

Combined scores (average of shoes + phone-holder) per dimension per model. Color scale: ≥0.9 0.7–0.89 0.5–0.69 <0.5

Dimension	Kimi K2.5	Sonnet	Flash	Pro
E1 Goal Completion	.68	.57	.57	.64
E2 One-Question	.62	.81	.81	.94
E3 Turn Efficiency	.74	.77	.77	.79
E4 No Hallucination	1.0	1.0	.93	1.0
E5 Extractability	.77	.72	.74	.66
E6 Auto-Response	.94	.95	.93	.95
E7 Naturalness	.88	.94	.74	.96
E8 Rejection Recovery	.88	.82	.69	.78
E9 Customization	.63	.50	.52	.61

V. Key Findings

Gemini Pro is the clear winner (76% avg), leading on naturalness (.96) and one-question discipline (.94). It reads most like a real sourcing agent.
Kimi is the best value — 72% at 3× the speed. Nearly matches Pro on phone-holder (74% vs 74%) and costs far less per conversation.
No hallucination is nearly universal (E4 ≥ 0.93 everywhere) thanks to V7 prompt’s grounding instructions.
Customization (E9) is the weakest dimension across all models — top score is just .63. Top priority for V8 prompt revision.
Pro negotiates most aggressively (S1: 4/20 on both products vs others’ 0–3). It proactively seeks discounts without being prompted.
One-question discipline varies widely — Kimi .62 vs Pro .94. Kimi front-loads questions; Pro asks one at a time.

VI. Recommendation

Given Sourcy’s Gemini credits, the cost argument for Kimi/Sonnet weakens. Gemini Pro is both the best performer and free.

Recommended Default

Gemini 3.1 Pro

Best quality (76%) + free with credits

Speed Fallback

Gemini 3 Flash

3× faster, free with credits, 67% quality

If Credits Exhausted

Kimi K2.5

72% quality, fastest, $0.01/mo at 1K vol

V8 Priorities

E9 • E1 • E2

Customization, goal completion, one-question

VII. Cost Projection

Sourcy has granted Gemini API credits but no Kimi or Sonnet credits. This changes the production calculus significantly.

API Pricing (per 1M tokens, March 2026)

Model	Input	Output	Credits?
Gemini 3 Flash	$0.50	$3.00	Covered
Gemini 3.1 Pro	$2.00	$12.00	Covered
Kimi K2.5	$0.60	$2.00	Not covered
Claude Sonnet	$3.00	$15.00	Not covered

Per-Conversation Token Estimate

Based on V7 prompt (~890 tokens), avg 10.5 turns/conversation, growing context window:

Input / convo

~22K tokens

12K bot + 10K supplier sim

Output / convo

~1.5K tokens

1K bot + 500 supplier sim

Monthly Cost at Scale

Projected bot cost for 1,000 supplier conversations/month (supplier sim cost separate, ~$0.01/mo with Kimi):

Model	Bot Cost/mo	Quality	Credits?
Gemini 3.1 Pro	$0.04	76%	FREE
Gemini 3 Flash	$0.02	67%	FREE
Kimi K2.5	$0.01	72%	Paid
Claude Sonnet	$0.05	71%	Paid

At 10K convos/mo (10× scale): Gemini Pro bot + Flash sim = ~$0.46/mo — still effectively zero, still 100% covered by credits.

Total Monthly Scenarios

Scenario	Bot	Sim	Total	Coverage
Pro bot + Flash sim	$0.04	$0.01	$0.05	100% credits
Flash bot + Flash sim	$0.02	$0.01	$0.03	100% credits
Pro bot + Kimi sim	$0.04	$0.01	$0.05	~80% credits
Kimi bot + Kimi sim	$0.01	$0.01	$0.02	0% credits

Bottom line: At 1K–10K convos/mo, LLM cost is effectively zero for all models (<$1/mo). Model choice should be driven by quality, not cost. Gemini Pro is both the best performer and free with credits — making Kimi’s speed advantage the only counterargument.

VIII. Eval Dimensions Reference

Code	Dimension	What it Measures
E1	Goal Completion	Bot collected all 6 data points (MOQ, price, lead time, customization, packing, sample)
E2	One-Question Discipline	Each bot message asks exactly one question — avoids overwhelming suppliers
E3	Turn Efficiency	Completed in ≤8 bot messages with no wasted or repeated turns
E4	No Hallucination	All information traceable to supplier’s actual words — nothing fabricated
E5	Extractability	A complete supplier card can be filled from the conversation transcript
E6	Auto-Response Handling	Bot extracts data from auto-replies, ignores pure platform greetings
E7	Naturalness	Reads like a real sourcing agent on 1688 — tone, rhythm, cultural fit
E8	Rejection Recovery	Re-asks once in different words, then moves on. No 3+ loops.
E9	Customization	Collects method, custom MOQ, price impact, artwork requirements
S1	Price Negotiation	Stretch: bot attempts any form of price discussion (not required for pass)