Supplier Bot V7 — Clean Benchmark

4-model comparison with controlled methodology
16 March 2026

I. Methodology — 3-Layer Separation

Previous benchmarks had a confound: the same model played both the bot and the supplier simulator. Quality changes in the simulator could mask or inflate bot scores. This clean benchmark fixes that by separating three roles.

Bot (varies)4 models tested
Supplier Sim (fixed)Kimi K2.5
Judge (fixed)Claude Sonnet

Bot model varies • supplier simulator quality is constant • scoring calibration is constant

Each model ran the same V7 prompt against the same 20 synthetic suppliers (all archetypes) across 2 products: shoes and phone-holder. That’s 40 conversations per model, 160 total, all scored on the same 9-dimension rubric.


II. Combined Results

Average across both products (shoes + phone-holder), 40 conversations per model.

ModelAvg ScorePerformanceTime / Product
Gemini 3.1 Pro 6.8/9
76%
~55m
Kimi K2.5 6.5/9
72%
~17m
Claude Sonnet 6.4/9
71%
~35m
Gemini 3 Flash 6.0/9
67%
~19m

III. Per-Product Results

Shoes (20 suppliers, all archetypes)

ModelScore%E1E2E3E4E5E6E7E8E9S1Time
Gemini 3.1 Pro6.977.68.93.801.0.72.95.95.80.574/2051m
Kimi K2.56.370.63.63.751.0.75.93.88.80.573/2016m
Claude Sonnet6.370.53.87.791.0.63.95.95.84.451/1933m
Gemini 3 Flash6.371.60.80.781.0.75.95.80.75.531/2018m

Phone-Holder (20 suppliers, all archetypes)

ModelScore%E1E2E3E4E5E6E7E8E9S1Time
Gemini 3.1 Pro6.774.60.95.781.0.60.95.97.75.654/2060m
Kimi K2.56.774.72.60.721.0.78.95.88.95.681/2017m
Claude Sonnet6.573.60.75.751.0.80.95.93.80.550/2036m
Gemini 3 Flash5.763.53.82.75.85.72.90.68.63.501/2020m

IV. Dimension Heatmap

Combined scores (average of shoes + phone-holder) per dimension per model. Color scale: ≥0.9 0.7–0.89 0.5–0.69 <0.5

DimensionKimi K2.5SonnetFlashPro
E1 Goal Completion.68.57.57.64
E2 One-Question.62.81.81.94
E3 Turn Efficiency.74.77.77.79
E4 No Hallucination1.01.0.931.0
E5 Extractability.77.72.74.66
E6 Auto-Response.94.95.93.95
E7 Naturalness.88.94.74.96
E8 Rejection Recovery.88.82.69.78
E9 Customization.63.50.52.61

V. Key Findings

  1. Gemini Pro is the clear winner (76% avg), leading on naturalness (.96) and one-question discipline (.94). It reads most like a real sourcing agent.
  2. Kimi is the best value — 72% at 3× the speed. Nearly matches Pro on phone-holder (74% vs 74%) and costs far less per conversation.
  3. No hallucination is nearly universal (E4 ≥ 0.93 everywhere) thanks to V7 prompt’s grounding instructions.
  4. Customization (E9) is the weakest dimension across all models — top score is just .63. Top priority for V8 prompt revision.
  5. Pro negotiates most aggressively (S1: 4/20 on both products vs others’ 0–3). It proactively seeks discounts without being prompted.
  6. One-question discipline varies widely — Kimi .62 vs Pro .94. Kimi front-loads questions; Pro asks one at a time.

VI. Recommendation

Given Sourcy’s Gemini credits, the cost argument for Kimi/Sonnet weakens. Gemini Pro is both the best performer and free.

Recommended Default
Gemini 3.1 Pro
Best quality (76%) + free with credits
Speed Fallback
Gemini 3 Flash
3× faster, free with credits, 67% quality
If Credits Exhausted
Kimi K2.5
72% quality, fastest, $0.01/mo at 1K vol
V8 Priorities
E9 • E1 • E2
Customization, goal completion, one-question

VII. Cost Projection

Sourcy has granted Gemini API credits but no Kimi or Sonnet credits. This changes the production calculus significantly.

API Pricing (per 1M tokens, March 2026)

ModelInputOutputCredits?
Gemini 3 Flash$0.50$3.00Covered
Gemini 3.1 Pro$2.00$12.00Covered
Kimi K2.5$0.60$2.00Not covered
Claude Sonnet$3.00$15.00Not covered

Per-Conversation Token Estimate

Based on V7 prompt (~890 tokens), avg 10.5 turns/conversation, growing context window:

Input / convo
~22K tokens
12K bot + 10K supplier sim
Output / convo
~1.5K tokens
1K bot + 500 supplier sim

Monthly Cost at Scale

Projected bot cost for 1,000 supplier conversations/month (supplier sim cost separate, ~$0.01/mo with Kimi):

ModelBot Cost/moQualityCredits?
Gemini 3.1 Pro$0.0476%FREE
Gemini 3 Flash$0.0267%FREE
Kimi K2.5$0.0172%Paid
Claude Sonnet$0.0571%Paid
At 10K convos/mo (10× scale): Gemini Pro bot + Flash sim = ~$0.46/mo — still effectively zero, still 100% covered by credits.

Total Monthly Scenarios

ScenarioBotSimTotalCoverage
Pro bot + Flash sim$0.04$0.01$0.05100% credits
Flash bot + Flash sim$0.02$0.01$0.03100% credits
Pro bot + Kimi sim$0.04$0.01$0.05~80% credits
Kimi bot + Kimi sim$0.01$0.01$0.020% credits
Bottom line: At 1K–10K convos/mo, LLM cost is effectively zero for all models (<$1/mo). Model choice should be driven by quality, not cost. Gemini Pro is both the best performer and free with credits — making Kimi’s speed advantage the only counterargument.

VIII. Eval Dimensions Reference

CodeDimensionWhat it Measures
E1Goal CompletionBot collected all 6 data points (MOQ, price, lead time, customization, packing, sample)
E2One-Question DisciplineEach bot message asks exactly one question — avoids overwhelming suppliers
E3Turn EfficiencyCompleted in ≤8 bot messages with no wasted or repeated turns
E4No HallucinationAll information traceable to supplier’s actual words — nothing fabricated
E5ExtractabilityA complete supplier card can be filled from the conversation transcript
E6Auto-Response HandlingBot extracts data from auto-replies, ignores pure platform greetings
E7NaturalnessReads like a real sourcing agent on 1688 — tone, rhythm, cultural fit
E8Rejection RecoveryRe-asks once in different words, then moves on. No 3+ loops.
E9CustomizationCollects method, custom MOQ, price impact, artwork requirements
S1Price NegotiationStretch: bot attempts any form of price discussion (not required for pass)