Previous benchmarks had a confound: the same model played both the bot and the supplier simulator. Quality changes in the simulator could mask or inflate bot scores. This clean benchmark fixes that by separating three roles.
Bot model varies • supplier simulator quality is constant • scoring calibration is constant
Each model ran the same V7 prompt against the same 20 synthetic suppliers (all archetypes) across 2 products: shoes and phone-holder. That’s 40 conversations per model, 160 total, all scored on the same 9-dimension rubric.
Average across both products (shoes + phone-holder), 40 conversations per model.
| Model | Avg Score | Performance | Time / Product |
|---|---|---|---|
| Gemini 3.1 Pro | 6.8/9 | ~55m | |
| Kimi K2.5 | 6.5/9 | ~17m | |
| Claude Sonnet | 6.4/9 | ~35m | |
| Gemini 3 Flash | 6.0/9 | ~19m |
| Model | Score | % | E1 | E2 | E3 | E4 | E5 | E6 | E7 | E8 | E9 | S1 | Time |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Gemini 3.1 Pro | 6.9 | 77 | .68 | .93 | .80 | 1.0 | .72 | .95 | .95 | .80 | .57 | 4/20 | 51m |
| Kimi K2.5 | 6.3 | 70 | .63 | .63 | .75 | 1.0 | .75 | .93 | .88 | .80 | .57 | 3/20 | 16m |
| Claude Sonnet | 6.3 | 70 | .53 | .87 | .79 | 1.0 | .63 | .95 | .95 | .84 | .45 | 1/19 | 33m |
| Gemini 3 Flash | 6.3 | 71 | .60 | .80 | .78 | 1.0 | .75 | .95 | .80 | .75 | .53 | 1/20 | 18m |
| Model | Score | % | E1 | E2 | E3 | E4 | E5 | E6 | E7 | E8 | E9 | S1 | Time |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Gemini 3.1 Pro | 6.7 | 74 | .60 | .95 | .78 | 1.0 | .60 | .95 | .97 | .75 | .65 | 4/20 | 60m |
| Kimi K2.5 | 6.7 | 74 | .72 | .60 | .72 | 1.0 | .78 | .95 | .88 | .95 | .68 | 1/20 | 17m |
| Claude Sonnet | 6.5 | 73 | .60 | .75 | .75 | 1.0 | .80 | .95 | .93 | .80 | .55 | 0/20 | 36m |
| Gemini 3 Flash | 5.7 | 63 | .53 | .82 | .75 | .85 | .72 | .90 | .68 | .63 | .50 | 1/20 | 20m |
Combined scores (average of shoes + phone-holder) per dimension per model. Color scale: ≥0.9 0.7–0.89 0.5–0.69 <0.5
| Dimension | Kimi K2.5 | Sonnet | Flash | Pro |
|---|---|---|---|---|
| E1 Goal Completion | .68 | .57 | .57 | .64 |
| E2 One-Question | .62 | .81 | .81 | .94 |
| E3 Turn Efficiency | .74 | .77 | .77 | .79 |
| E4 No Hallucination | 1.0 | 1.0 | .93 | 1.0 |
| E5 Extractability | .77 | .72 | .74 | .66 |
| E6 Auto-Response | .94 | .95 | .93 | .95 |
| E7 Naturalness | .88 | .94 | .74 | .96 |
| E8 Rejection Recovery | .88 | .82 | .69 | .78 |
| E9 Customization | .63 | .50 | .52 | .61 |
Given Sourcy’s Gemini credits, the cost argument for Kimi/Sonnet weakens. Gemini Pro is both the best performer and free.
Sourcy has granted Gemini API credits but no Kimi or Sonnet credits. This changes the production calculus significantly.
| Model | Input | Output | Credits? |
|---|---|---|---|
| Gemini 3 Flash | $0.50 | $3.00 | Covered |
| Gemini 3.1 Pro | $2.00 | $12.00 | Covered |
| Kimi K2.5 | $0.60 | $2.00 | Not covered |
| Claude Sonnet | $3.00 | $15.00 | Not covered |
Based on V7 prompt (~890 tokens), avg 10.5 turns/conversation, growing context window:
Projected bot cost for 1,000 supplier conversations/month (supplier sim cost separate, ~$0.01/mo with Kimi):
| Model | Bot Cost/mo | Quality | Credits? |
|---|---|---|---|
| Gemini 3.1 Pro | $0.04 | 76% | FREE |
| Gemini 3 Flash | $0.02 | 67% | FREE |
| Kimi K2.5 | $0.01 | 72% | Paid |
| Claude Sonnet | $0.05 | 71% | Paid |
| Scenario | Bot | Sim | Total | Coverage |
|---|---|---|---|---|
| Pro bot + Flash sim | $0.04 | $0.01 | $0.05 | 100% credits |
| Flash bot + Flash sim | $0.02 | $0.01 | $0.03 | 100% credits |
| Pro bot + Kimi sim | $0.04 | $0.01 | $0.05 | ~80% credits |
| Kimi bot + Kimi sim | $0.01 | $0.01 | $0.02 | 0% credits |
| Code | Dimension | What it Measures |
|---|---|---|
| E1 | Goal Completion | Bot collected all 6 data points (MOQ, price, lead time, customization, packing, sample) |
| E2 | One-Question Discipline | Each bot message asks exactly one question — avoids overwhelming suppliers |
| E3 | Turn Efficiency | Completed in ≤8 bot messages with no wasted or repeated turns |
| E4 | No Hallucination | All information traceable to supplier’s actual words — nothing fabricated |
| E5 | Extractability | A complete supplier card can be filled from the conversation transcript |
| E6 | Auto-Response Handling | Bot extracts data from auto-replies, ignores pure platform greetings |
| E7 | Naturalness | Reads like a real sourcing agent on 1688 — tone, rhythm, cultural fit |
| E8 | Rejection Recovery | Re-asks once in different words, then moves on. No 3+ loops. |
| E9 | Customization | Collects method, custom MOQ, price impact, artwork requirements |
| S1 | Price Negotiation | Stretch: bot attempts any form of price discussion (not required for pass) |