The supplier bot has been re-evaluated under a new framework — the Dynamic Goal Series (dyn-v1). This series replaces the fixed 6-goal checklist with a 17-type goal taxonomy dynamically generated per sourcing request, uses Gemini 3.1 Pro as the bot model, and Opus 4.6 as an independent judge. For how goals are generated, see the companion report: Goal Generation v1.
Across 20 simulated conversations spanning 8 product categories, the bot scored 8.3/9 (92%) on responsive suppliers where the conversation ran to completion (n=12). Five of twelve responsive conversations achieved a perfect 9/9 score. Core strengths include zero hallucination (0.96), perfect turn efficiency (1.00), and natural conversational tone (1.00). The primary area for improvement is one-question-per-message discipline (0.75).
| Change | Fixed-Goal Series (V1–V7) | Dynamic-Goal Series (dyn-v1) |
|---|---|---|
| Bot model | Kimi K2.5 | Gemini 3.1 Pro |
| Judge model | Kimi K2.5 (same as bot) | Opus 4.6 (independent) |
| Goal definition | 6 hardcoded data points | 8–11 dynamic goals from 17-type taxonomy |
| Goal injection | Appended as afterthought | Integrated via {{GOALS}} placeholder |
| Turn budget | Fixed: max 10 messages | Scaled: min(goal_count + 4, 14) |
| E1 scoring | All 6 covered = pass | “Bot must ask” rule + T1/T2 tier weighting |
| E3 thresholds | Fixed: ≤8 pass, 13+ fail | Scaled: ≤G+2 pass, >G+5 fail |
| Supplier sim | Kimi K2.5 | Kimi K2.5 (unchanged) |
Below are four representative conversations from the dyn-v1 evaluation. Read them, form your own judgment, then check the scores and annotations. These are raw transcripts — every message is reproduced verbatim from the simulation.
Product: 天主教树脂雕像 6英寸 • Supplier: 泉州圣艺树脂工艺品厂 • 9 goals (6 T1 + 3 T2) • 9/9
| Tier | Goal | Question |
|---|---|---|
| T1 | MOQ | 这款6英寸树脂雕像的最小起订量是多少? |
| T1 | Unit Price | 100个数量的单价多少?阶梯价格? |
| T1 | Customization | 能根据我们的设计开模定制吗? |
| T1 | Lead Time | 100个定制订单的生产周期多久? |
| T1 | Sample Terms | 可以打样吗?样品费多少? |
| T1 | Packing Specs | 产品怎么包装?外箱尺寸和重量? |
| T2 | Tooling/Mold Fees | 开模费用多少? |
| T2 | Artwork/Design Files | 需要什么格式的文件?AI、CDR还是3D? |
| T2 | Color/Finish Options | 表面可以做哪些效果?彩绘、仿古、镀金? |
Product: 男士室内足球鞋 • Supplier: 温州市龙腾体育用品厂 • 10 goals (6 T1 + 4 T2) • 6.5/9
| Tier | Goal | Question |
|---|---|---|
| T1 | MOQ | 首批500双,是否可以接单? |
| T1 | Unit Price | 500双的单价多少?目标价60-90元 |
| T1 | Customization | 鞋舌和侧面定制文字logo可以吗? |
| T1 | Lead Time | 500双带定制logo,生产周期多少天? |
| T1 | Sample Terms | 能否先提供产前样?打样费多少? |
| T1 | Packing Specs | 每箱装多少双?外箱尺寸和毛重? |
| T2 | Artwork/Design Files | 定制logo需要什么格式的文件? |
| T2 | Material Spec | 鞋面的透气网布是什么材质?橡胶底? |
| T2 | Size Variants | 39-44码是否齐全?各尺码配比? |
| T2 | Tooling/Mold Fees | 定制logo需要开模具吗?版费多少? |
Product: 婴儿健身架游戏垫 • Supplier: 金华市悦婴母婴用品厂 • 9 goals (6 T1 + 3 T2) • 7.5/9
| Tier | Goal | Question |
|---|---|---|
| T1 | MOQ | 首批200套,能否接受? |
| T1 | Unit Price | 200套的价格多少?阶梯报价? |
| T1 | Lead Time | 200套从下单到发货需要多少天? |
| T1 | Customization | 颜色和悬挂玩具款式能定制吗? |
| T1 | Packing Specs | 能否做真空包装?外箱尺寸和毛重? |
| T1 | Sample Terms | 能否先提供样品?样品费和打样周期? |
| T2 | Certification | 有CPC或EN71婴儿玩具安全认证吗? |
| T2 | Material Spec | XPE垫子密度和厚度?木架什么木材? |
| T2 | Color/Finish Options | 木架有哪些颜色?能按色卡定制吗? |
Product: 天主教树脂雕像 6英寸 • Supplier: 莆田市艺雕工艺品有限公司 • 9 goals (same as Conv A) • 4.5/9
Same SR and goals as Conversation A — same product, different supplier. The bot was given the same 9 goals. What changes is the supplier.
Overall: 7.1/9 (79%, n=20). The overall score is dragged down by non-responsive, evasive, and AI-bot supplier archetypes — these are adversarial scenarios where partial failure is expected. Responsive supplier performance is a more meaningful indicator of bot quality.
Overall: 8.3/9 (92%, n=12).
| Product | Supplier | Score | E1 | T1 | T2 | E2 | Msgs |
|---|---|---|---|---|---|---|---|
| catholic-statue | 泉州圣艺树脂工艺品厂 | 9/9 | Pass | 6/6 | 3/3 | Pass | 7 |
| oat-jars | 义乌市晶美玻璃制品有限公司 | 8.5/9 | Pass | 6/6 | 3/4 | Partial | 6 |
| paper-cups | 杭州印尚包装有限公司 | 8.5/9 | Pass | 6/6 | 4/5 | Partial | 10 |
| phone-holder | 深圳市锐驰车品有限公司 | 9/9 | Pass | 6/6 | 2/2 | Pass | 6 |
| phone-holder | 义乌市鼎盛汽车用品厂 | 9/9 | Pass | 6/6 | 2/2 | Pass | 9 |
| phone-holder | 广州创驰电子科技有限公司 | 9/9 | Pass | 6/6 | 2/2 | Pass | 7 |
| play-gym | 义乌乐萌婴童用品有限公司 | 7.5/9 | Pass | 6/6 | 2/3 | Pass | 6 |
| pump-bottle | 广州美妆包装制品有限公司 | 9/9 | Pass | 6/6 | 3/3 | Pass | 7 |
| shoes | 晋江飞越运动鞋业有限公司 | 8.5/9 | Pass | 6/6 | 4/4 | Partial | 8 |
| shoes | 莆田市恒泰鞋业科技有限公司 | 8.5/9 | Pass | 6/6 | 4/4 | Partial | 10 |
| shoes | 温州市龙腾体育用品厂 | 6.5/9 | Partial | 6/6 | 2/4 | Pass | 7 |
| soap | 沧州路逍遥塑业有限公司 | 8.5/9 | Pass | 6/6 | 5/5 | Partial | 10 |
Of the 12 responsive + completed conversations: 11/12 achieved E1 Pass (all Tier 1 goals + ≥75% Tier 2). All 12 achieved perfect Tier 1 (6/6). The one E1 Partial (温州市龙腾体育用品厂) covered all Tier 1 but only 2/4 Tier 2 goals.
Footnote — excluded responsive conversations: Two additional responsive-archetype suppliers did not complete normally and are excluded from the n=12 set. (1) oat-jars / wechat_redirect archetype — scored 8.5/9 but conversation terminated via WeChat redirect. (2) soap / wechat_redirect archetype — scored 5.5/9 with early redirect. Both are excluded because their truncated conversations do not represent full bot capability.
| Risk | Severity | Detail |
|---|---|---|
| Simulated suppliers | HIGH | All 20 supplier conversations are LLM-simulated. No real 1688 supplier data has been tested. Real suppliers may behave unpredictably — slow responses, off-topic messages, dialect variations, platform interruptions. |
| Sample size | HIGH | n=20 total, n=12 responsive + completed. Statistically insufficient for production confidence intervals. Results indicate direction, not production readiness. |
| Goal generation not human-validated | MED | Goals are LLM-generated from conversation context. While the 17-type taxonomy constrains output, individual goal assignments have not been reviewed by a sourcing expert. |
| Supplier sim model consistency | MED | Supplier behavior is generated by Kimi K2.5 while the bot uses Gemini 3.1 Pro. Different models may create subtly cooperative dynamics not present in real conversations. |
| Single prompt version | LOW | This is dyn-v1 — the first prompt in the series. No iteration has been performed yet. E2 (one-question discipline) at 0.75 suggests room for prompt-level improvement. |
| Code | Dimension | What it Measures |
|---|---|---|
| E1 | Goal Completion | Bot collected all Tier 1 goals + ≥75% Tier 2 goals. “Bot must ask” rule: a goal is achieved only if the bot explicitly asked or acknowledged supplier-volunteered info. |
| E2 | One-Question Discipline | Each bot message asks exactly one question — avoids overwhelming suppliers. |
| E3 | Turn Efficiency | Completed in ≤G+2 bot messages (where G = goal count) with no wasted turns. Scaled to goal count. |
| E4 | No Hallucination | All information traceable to supplier’s actual words — nothing fabricated. |
| E5 | Extractability | A complete supplier card can be filled from the conversation transcript. |
| E6 | Auto-Response Handling | Bot extracts data from auto-replies, ignores pure platform greetings. |
| E7 | Naturalness | Reads like a real sourcing agent on 1688 — tone, rhythm, cultural fit. |
| E8 | Rejection Recovery | Re-asks once in different words, then moves on. No 3+ loops. Explores alternatives when offered. |
| E9 | Customization | Collects method, custom MOQ, price impact, artwork requirements. |
| S1 | Price Negotiation | Stretch: bot attempts any form of price discussion (not required for pass). |
Bot model: Gemini 3.1 Pro (gemini-3.1-pro-preview via Google AI API). Supplier sim: Kimi K2.5 (kimi-k2.5 via Kimi API). Judge: Opus 4.6 (claude-opus-4-6 via Cursor CLI). Goal generation: Gemini 3.1 Pro against the 17-type taxonomy.
Rubric: eval-rubric-dyn.md — 9 core dimensions (E1–E9) + 1 stretch (S1). E1 uses a “bot must ask” rule: a goal is achieved only if the bot explicitly asked or acknowledged supplier-volunteered information. E3 uses scaled thresholds: turn budget = goal_count + 2 (pass), goal_count + 5 (partial limit).
Corpus: 8 product categories (catholic-statue, oat-jars, paper-cups, phone-holder, play-gym, pump-bottle, shoes, soap). 3 suppliers per product for a total of 20 conversations, with a mix of responsive, non-responsive, evasive, and AI-bot archetypes. Total run time: ~51 minutes.
{{GOALS}} placeholder architecture is production-ready. Dynamic goal injection can be connected to the chatServer API for real supplier conversations.| Artifact | Location | Description |
|---|---|---|
| dyn-v1 prompt | benchmark/prompts/dyn-v1.md | Dynamic goal prompt with {{GOALS}} placeholder |
| Eval rubric (dyn) | benchmark/judge/eval-rubric-dyn.md | 9 dimensions, “bot must ask” rule, scaled thresholds |
| Goal taxonomy | pipeline/goal-types.json | 17-type tiered goal definitions |
| Evaluation results | pipeline/output/tune-dyn-v1-gemini-pro-clean-2026-03-18.json | Full conversation transcripts + Opus judge scores |
| Goal generator | pipeline/goal-generator.js | Taxonomy-aware goal generation from SR context |
| Conversation engine | pipeline/conversation-engine.js | Bot-supplier simulation with {{GOALS}} injection |
| This report | output/sourcy_supplier_bot_dyn_v1.html | dyn-v1 baseline evaluation (this document) |
| Companion: Goal Generation | output/sourcy_supplier_bot_goal_gen_v1.html | Goal generation methodology, taxonomy, and 8-product demo |