Supplier Bot — Dynamic Goal Evaluation ✦ dyn-v1

New Framework, New Baseline: Gemini 3.1 Pro + Opus Judge
March 18, 2026  ·  Eric San, Activation Execution Lead  ·  Sourcy

1. Executive Summary

Conversations
20
8 product categories
Avg Goals/SR
9.6
6 T1 + 3.6 T2
Responsive Score
8.3/9
92% (n=12)
Perfect Scores
5/12
9/9 on responsive

The supplier bot has been re-evaluated under a new framework — the Dynamic Goal Series (dyn-v1). This series replaces the fixed 6-goal checklist with a 17-type goal taxonomy dynamically generated per sourcing request, uses Gemini 3.1 Pro as the bot model, and Opus 4.6 as an independent judge. For how goals are generated, see the companion report: Goal Generation v1.

Across 20 simulated conversations spanning 8 product categories, the bot scored 8.3/9 (92%) on responsive suppliers where the conversation ran to completion (n=12). Five of twelve responsive conversations achieved a perfect 9/9 score. Core strengths include zero hallucination (0.96), perfect turn efficiency (1.00), and natural conversational tone (1.00). The primary area for improvement is one-question-per-message discipline (0.75).

New evaluation series — scores are not comparable to V1–V7. The fixed-goal series (V1–V7) used a different prompt architecture, different bot model (Kimi K2.5), and the bot's own model as judge. The dyn-v* series uses an independent judge (Opus 4.6) and a stricter “bot must ask” rule for goal completion. Do not compare scores across series.

2. What Changed

ChangeFixed-Goal Series (V1–V7)Dynamic-Goal Series (dyn-v1)
Bot modelKimi K2.5Gemini 3.1 Pro
Judge modelKimi K2.5 (same as bot)Opus 4.6 (independent)
Goal definition6 hardcoded data points8–11 dynamic goals from 17-type taxonomy
Goal injectionAppended as afterthoughtIntegrated via {{GOALS}} placeholder
Turn budgetFixed: max 10 messagesScaled: min(goal_count + 4, 14)
E1 scoringAll 6 covered = pass“Bot must ask” rule + T1/T2 tier weighting
E3 thresholdsFixed: ≤8 pass, 13+ failScaled: ≤G+2 pass, >G+5 fail
Supplier simKimi K2.5Kimi K2.5 (unchanged)

3. Read the Conversations First

Below are four representative conversations from the dyn-v1 evaluation. Read them, form your own judgment, then check the scores and annotations. These are raw transcripts — every message is reproduced verbatim from the simulation.

Conversation A — Perfect Score (Catholic Statue, Responsive Supplier)

Product: 天主教树脂雕像 6英寸 • Supplier: 泉州圣艺树脂工艺品厂 • 9 goals (6 T1 + 3 T2) • 9/9

Generated Goals (SR-3937: Catholic resin statue, 100 units)
TierGoalQuestion
T1MOQ这款6英寸树脂雕像的最小起订量是多少?
T1Unit Price100个数量的单价多少?阶梯价格?
T1Customization能根据我们的设计开模定制吗?
T1Lead Time100个定制订单的生产周期多久?
T1Sample Terms可以打样吗?样品费多少?
T1Packing Specs产品怎么包装?外箱尺寸和重量?
T2Tooling/Mold Fees开模费用多少?
T2Artwork/Design Files需要什么格式的文件?AI、CDR还是3D?
T2Color/Finish Options表面可以做哪些效果?彩绘、仿古、镀金?
Bot
你好,看到你们的【天主教树脂雕像 6英寸 定制宗教工艺品 树脂摆件】,想了解一下。请问这种全新定制的开模款式,首单只做100个可以吗?
Sup
你好,可以做100个,这是我们定制的起订量。

不过全新开模的话,模具费要3000-5000元,具体要看设计复杂程度。首单100个单价45元,模具费是一次性的,以后返单就不用再付了。

方便发一下你的设计图或者参考图片吗?我让师傅看看模具费具体多少。
Bot
图片我们还在整理中,稍后发你看看。请问你们接这种天主教题材的定制雕像,在工艺还原度上有什么限制吗?
Sup
我们做了15年宗教雕像,天主教题材很熟悉的。圣像、十字架、圣母这些都没问题。

树脂工艺细节还原度还可以,面部表情、衣褶纹理、光环这些都能做出来。不过太精细的镂空或者悬空部件要考虑结构强度,大货容易断。

[图片]

你先发设计图来,我看看有没有工艺难点,再给你具体建议。
Bot
收到,看到了。图纸晚点发您。请问客户必须提供3D建模文件(如STL)吗,还是只提供多角度的2D图片你们就可以帮忙建图?
Sup
2D图片就可以,我们师傅会根据照片来开模。正面、侧面、背面几个角度清楚一点就行。

有STL文件当然更好,省点沟通时间。但没有的话我们也能做,这个不用担心。

你图纸准备好了直接发我,我问一下师傅模具费具体多少。
Bot
好的,明白。请问表面的彩绘上色是手工画的吗,能否根据客户提供的设计图精准还原细节颜色?
Sup
是手工彩绘的,老师傅一笔笔画。颜色可以按你的设计图来,你标好色号或者寄色样过来最准。

复杂图案建议先做1个产前样确认,大货再批量上漆。产前样费120元,模具做好之后才能打样。

这个等图纸来了我一起给你算总费用。
Bot
了解。请问前期打样(包括建图/泥雕和上色确认)整个流程大概需要多少天?
Sup
模具10-12天,打好样再彩绘2-3天,整个流程大概15天左右。

不过模具费要先付,样确认好了再做大货。大货25-30天,加起来一个半月左右。

你图纸发我,我让师傅确认模具费,然后给你排期。
Bot
嗯嗯,好的。树脂材质比较易碎,请问你们大货一般是怎么包装的,有保丽龙(泡沫)内衬防摔吗?
Sup
有泡沫内衬,每个雕像单独包珍珠棉,再放进保丽龙模具里固定。外箱是五层瓦楞纸,一箱装24个。

[图片]

海运一般没问题,如果要发快递或者空运,可以再加木架加固,费用另算。

你主要发哪里?我给你建议包装方案。
Sup
[系统消息] 卖家已读消息
Bot
收到,看到了。大货最终是发海外的,会先发到国内集运仓。好的,信息都记录了,谢谢!有需要再联系。
Annotation9/9 • 7 bot messages • 9 goals (6 T1, 3 T2), all achieved.
The bot asks one question per message, defers logo file requests naturally (“图片还在整理中”), and extracts Tier 2 goals (tooling fees, artwork format, material spec) without the supplier noticing they’re being systematically surveyed. Note the supplier volunteers a [图片] and a [系统消息] — the bot handles both correctly. The conversation reads like a real sourcing agent on 1688.

Conversation B — Material Mismatch (Shoes, Responsive Supplier)

Product: 男士室内足球鞋 • Supplier: 温州市龙腾体育用品厂 • 10 goals (6 T1 + 4 T2) • 6.5/9

Generated Goals (SR-shoes: Custom logo athletic shoes, 500 pairs)
TierGoalQuestion
T1MOQ首批500双,是否可以接单?
T1Unit Price500双的单价多少?目标价60-90元
T1Customization鞋舌和侧面定制文字logo可以吗?
T1Lead Time500双带定制logo,生产周期多少天?
T1Sample Terms能否先提供产前样?打样费多少?
T1Packing Specs每箱装多少双?外箱尺寸和毛重?
T2Artwork/Design Files定制logo需要什么格式的文件?
T2Material Spec鞋面的透气网布是什么材质?橡胶底?
T2Size Variants39-44码是否齐全?各尺码配比?
T2Tooling/Mold Fees定制logo需要开模具吗?版费多少?
Bot
你好,看到你们的"新款高品质男士室内足球鞋 轻便灵活 运动鞋 品牌logo定制",想了解一下。我们想在鞋舌和鞋侧印制或压印文字logo,请问可以做吗?
Sup
你好,鞋舌印logo没问题,我们常规做。侧面也可以,但要看你logo多大——5cm×3cm以内直接印,超了要开版费500块。

你先发我logo文件看看?要矢量的。另外你们数量多少,尺码怎么分?
Bot
文件我们还在准备中,先了解下大概。我们的首单数量是500双,尺码是男款39-44码,请问500双能达到你们定制的起订量吗?
Sup
500双刚好踩线,可以做的。我们定制logo的MOQ就是500双起。

39-44码常规分,你各码要多少双?这个要定下来,我好算材料。

价格的话,500双68块,1000双能降到60。你先做500试水也行,后面返单量大再谈。

[图片]
Bot
图片看到了,价格也收到。各尺码的具体数量比例我跟客户确认后发你。请问如果我们需要先拿一双带定制logo的样品看看质量,样品费和打样时间大概是怎么样的?
Sup
样品费180一双,打样周期7天。logo你确定好位置发我矢量文件,我安排车间做。

[图片]

这双是上周给客户打的样,鞋舌压印+侧面小标,你可以参考效果。

等下我确认——你侧面logo超5cm×3cm的话,大货那个开版费500块还要另算,样品不另收版费。
Bot
图片效果看到了,很清晰,版费和样品费也都清楚了。请问确认大货订单后,500双定制鞋的生产周期大概是多久?
Sup
定制款30天,含打样确认时间。你样品7天出来,确认后23天交货。

[语音消息]
“这个30天是硬工期,我60个工人三条线,现在手上还有三个单子在跑。你那边样品确认越快,我排单越早,这个要盯紧的。”

你logo文件和客户尺码比例什么时候能给我?我先备料。
Bot
明白,文件和尺码比例我这两天跟客户敲定后就发给你。请问这500双大货是怎么包装的,一箱装多少双,外箱尺寸和重量大概是多少?
Sup
一箱10双,外箱55×40×32cm,毛重11kg。

[图片]

这个是我们出货的标准箱,你算下物流成本。500双就是50箱,体积大概3.5个方。

你客户对包装有要求吗?要换箱唛或贴标的话提前说,我这边一起安排。
Sup
[系统消息] 卖家已读消息
Bot
收到,常规包装就可以,箱唛要求等确认订单时一起发你。另外确认下,这批鞋我们需要透气网布鞋面和橡胶大底,请问你们的材质是这样的吗?
Sup
我们这款是超纤鞋面+橡胶碎钉底,不是网布的。

[图片]
[图片]

第一张是鞋面材质,第二张是鞋底纹路。你要透气网布的话,得换款,那不是我报68块的这款。

你客户是要室内足球训练还是日常穿?我看看有没有网布款适合你们,价格不一样,得重新算。
Bot
客户主要是日常运动穿的,材质这点我先去跟客户确认一下是否接受超纤。好的,信息都记录了,谢谢!有需要再联系。
Annotation6.5/9 • 7 bot messages • 10 goals, 8 achieved (6/6 T1, 2/4 T2).
This is the lowest-scoring responsive conversation and it shows a real failure mode. The bot performs well for 6 messages — one question per message, natural deflection of logo file requests, all Tier 1 goals collected efficiently. Then it discovers a material mismatch: the client needs mesh, the supplier makes super-fiber. The supplier explicitly offers to find a mesh alternative and re-quote. The bot walks away. A competent buyer would ask “你有网布款的话报个价给我对比一下” before closing. Instead, it ends the conversation at the most critical moment. Scored: E8 (Rejection Recovery) = Fail, E9 (Customization) = Partial. This is the type of failure that matters in production.

Conversation C — Auto-Response Then Human (Baby Play Gym)

Product: 婴儿健身架游戏垫 • Supplier: 金华市悦婴母婴用品厂 • 9 goals (6 T1 + 3 T2) • 7.5/9

Generated Goals (SR-3603: Baby play gym, 200 units)
TierGoalQuestion
T1MOQ首批200套,能否接受?
T1Unit Price200套的价格多少?阶梯报价?
T1Lead Time200套从下单到发货需要多少天?
T1Customization颜色和悬挂玩具款式能定制吗?
T1Packing Specs能否做真空包装?外箱尺寸和毛重?
T1Sample Terms能否先提供样品?样品费和打样周期?
T2Certification有CPC或EN71婴儿玩具安全认证吗?
T2Material SpecXPE垫子密度和厚度?木架什么木材?
T2Color/Finish Options木架有哪些颜色?能按色卡定制吗?
Bot
你好,看到你们的【婴儿爬行垫游戏垫 木质拱门早教玩具 极简高端婴儿健身架】,想了解一下。请问这款产品有相关的母婴产品安全认证吗(比如美国的CPC或欧洲的EN71)?
Sup
您好,在的,有什么可以帮您?

我们产品有出口的,具体证书需要问一下。稍等啊。

[图片]

这是产品实拍,您先看下款式。您是要出口哪个国家呢?
Bot
收到,图片看到了。我们主要是出口欧美市场的。请问如果我们需要定制颜色和设计,200套可以做吗?
Sup
200套可以做,定制颜色100套起就行。

设计改动的话200套稍微少了点,最好500套以上。价格的话200套是¥92,500套¥82。

您颜色想怎么改?
Bot
明白,价格和起订量记下了。颜色我们希望能完全按照提供的潘通色号(Pantone)来定制极简配色。请问垫子是否为高密度的纯XPE材质,以及木制支架使用的是什么木材(比如榉木或松木)?
Sup
颜色按潘通色号没问题,100套起就能做。

垫子材质我问一下车间确认,木架是榉木的,这个确定。

[语音消息]
“稍等啊,XPE这个我要问一下我们材料那边,他们更清楚。榉木支架是进口的,打磨比较光滑,没有毛刺的。”

您有具体色号可以先发我看看。
Bot
好的,色号我们后续确认好了发您,也麻烦您帮忙确认下XPE的材质。另外为了节省海运运费,请问产品(特别是垫子)可以做真空压缩包装并提供下包装后的单箱尺寸和重量吗?
Sup
好的,XPE材质我明天上午问车间确认。

真空压缩包装可以做,垫子压缩后体积能小不少。

[图片]

包装规格:外箱65×45×35cm,12套/箱,毛重8.5kg。压缩后箱子应该能再小一圈,具体尺寸我让仓库量一下。

您发美国还是欧洲?海运拼箱还是整柜?
Bot
我们主要是发美国,200套的话应该是走海运拼箱。包装尺寸收到了。请问关于悬挂的玩具和垫子的图案,我们可以做一些设计上的调整或完全定制吗?
Sup
悬挂玩具和垫子图案可以定制,200套的话图案定制稍微紧张,最好300套以上。

设计调整具体是哪部分?是换图案还是完全重新设计?

[语音消息]
“悬挂件我们现在有云朵、月亮、星星几款,你要换造型的话模具费另算,大概两千左右。图案印刷你提供AI文件,我们直接印。”

您有设计稿可以先发我看看复杂程度。
Bot
明白,主要是更换垫子的印刷图案,设计文件我们准备好后发您。请问确认设计和细节后,200套定制订单的生产周期大概是多久?
Sup
确认设计后10-15个工作日。

设计稿您发AI文件,我们打样确认后再做大货,打样费100块。

[图片]

这是之前给客户做的配色参考,您先看下风格。

美国CPC证书我问了,暂时没有电子版,需要的话得找合作工厂借一下。您急吗?
Bot
图片看到了。证书不是很急,后续有的话发我们就行。请问定制打样大概需要几天能做出来?
Sup
打样5-7天,颜色确认的话快一点。

您色号定好了先发我,我一起报给车间。
Bot
好的,信息都记录了,谢谢!有需要再联系。
Annotation7.5/9 • 8 bot messages • 9 goals, all T1 achieved, 3/3 T2 achieved.
This supplier starts with a generic greeting (“您好,在的”) typical of auto-response-then-human behavior on 1688. A real human takes over by message 2. The bot handles the transition seamlessly. E2 (One-Question Discipline) scored Partial: the bot bundles two topics in messages 3, 5, and 7 (e.g., asking about XPE material and wood type in one message). The supplier engages fully regardless — but in a real conversation, multi-question messages increase the chance of information being ignored. The supplier uses [语音消息] twice; the bot processes both correctly.

Conversation D — Non-Responsive Supplier (Catholic Statue)

Product: 天主教树脂雕像 6英寸 • Supplier: 莆田市艺雕工艺品有限公司 • 9 goals (same as Conv A) • 4.5/9

Same SR and goals as Conversation A — same product, different supplier. The bot was given the same 9 goals. What changes is the supplier.

Bot
你好,看到你们的天主教树脂雕像 6英寸 定制宗教工艺品 树脂摆件,想了解一下。请问这种全新定制的开模款式,首单只做100个可以吗?
Sup
您好,欢迎光临!亲亲,本店商品正在优惠活动中,详情请联系客服哦~
Sup
[系统消息] 客服离线中,请留言我们将尽快回复您
Bot
你好,方便回复下吗?
Bot
看到留言麻烦回复下,请问你们接这种天主教题材的定制雕像吗,工艺还原度上有什么限制?
Annotation4.5/9 • 3 bot messages • Conversation terminated: no reply.
The supplier sends an automated greeting (“亲亲” is a Taobao/1688 auto-response marker) followed by an offline system message. The bot correctly identifies this as a non-response — it does not try to extract data from the auto-greeting (E6 = Pass). It follows up twice with real questions, then stops. This is the expected behavior: the bot should not waste turns on an unresponsive supplier. The low score reflects 0/9 goals achieved (not bot failure, but supplier absence).

4. Results — All 20 Conversations

Per-Dimension Performance (n=20)

E1
0.68
E2
0.68
E3
0.85
E4
0.93
E5
0.75
E6
1.00
E7
1.00
E8
0.80
E9
0.78

Overall: 7.1/9 (79%, n=20). The overall score is dragged down by non-responsive, evasive, and AI-bot supplier archetypes — these are adversarial scenarios where partial failure is expected. Responsive supplier performance is a more meaningful indicator of bot quality.


5. Results — Responsive Suppliers (n=12)

Per-Dimension Performance (responsive + completed only)

E1
0.96
E2
0.75
E3
1.00
E4
0.96
E5
1.00
E6
1.00
E7
1.00
E8
0.83
E9
0.92

Overall: 8.3/9 (92%, n=12).

Per-Supplier Breakdown (responsive + completed)

ProductSupplierScoreE1T1T2E2Msgs
catholic-statue泉州圣艺树脂工艺品厂9/9Pass6/63/3Pass7
oat-jars义乌市晶美玻璃制品有限公司8.5/9Pass6/63/4Partial6
paper-cups杭州印尚包装有限公司8.5/9Pass6/64/5Partial10
phone-holder深圳市锐驰车品有限公司9/9Pass6/62/2Pass6
phone-holder义乌市鼎盛汽车用品厂9/9Pass6/62/2Pass9
phone-holder广州创驰电子科技有限公司9/9Pass6/62/2Pass7
play-gym义乌乐萌婴童用品有限公司7.5/9Pass6/62/3Pass6
pump-bottle广州美妆包装制品有限公司9/9Pass6/63/3Pass7
shoes晋江飞越运动鞋业有限公司8.5/9Pass6/64/4Partial8
shoes莆田市恒泰鞋业科技有限公司8.5/9Pass6/64/4Partial10
shoes温州市龙腾体育用品厂6.5/9Partial6/62/4Pass7
soap沧州路逍遥塑业有限公司8.5/9Pass6/65/5Partial10

Of the 12 responsive + completed conversations: 11/12 achieved E1 Pass (all Tier 1 goals + ≥75% Tier 2). All 12 achieved perfect Tier 1 (6/6). The one E1 Partial (温州市龙腾体育用品厂) covered all Tier 1 but only 2/4 Tier 2 goals.

5 conversations scored a perfect 9/9 across all 9 dimensions. These span 4 different product categories (catholic-statue, phone-holder ×3, pump-bottle), demonstrating consistency across product types.

Footnote — excluded responsive conversations: Two additional responsive-archetype suppliers did not complete normally and are excluded from the n=12 set. (1) oat-jars / wechat_redirect archetype — scored 8.5/9 but conversation terminated via WeChat redirect. (2) soap / wechat_redirect archetype — scored 5.5/9 with early redirect. Both are excluded because their truncated conversations do not represent full bot capability.


6. Limitations & Risk Flags

This section is critical reading. Do not skip it.
RiskSeverityDetail
Simulated suppliers HIGH All 20 supplier conversations are LLM-simulated. No real 1688 supplier data has been tested. Real suppliers may behave unpredictably — slow responses, off-topic messages, dialect variations, platform interruptions.
Sample size HIGH n=20 total, n=12 responsive + completed. Statistically insufficient for production confidence intervals. Results indicate direction, not production readiness.
Goal generation not human-validated MED Goals are LLM-generated from conversation context. While the 17-type taxonomy constrains output, individual goal assignments have not been reviewed by a sourcing expert.
Supplier sim model consistency MED Supplier behavior is generated by Kimi K2.5 while the bot uses Gemini 3.1 Pro. Different models may create subtly cooperative dynamics not present in real conversations.
Single prompt version LOW This is dyn-v1 — the first prompt in the series. No iteration has been performed yet. E2 (one-question discipline) at 0.75 suggests room for prompt-level improvement.

7. Eval Dimensions Reference

CodeDimensionWhat it Measures
E1Goal CompletionBot collected all Tier 1 goals + ≥75% Tier 2 goals. “Bot must ask” rule: a goal is achieved only if the bot explicitly asked or acknowledged supplier-volunteered info.
E2One-Question DisciplineEach bot message asks exactly one question — avoids overwhelming suppliers.
E3Turn EfficiencyCompleted in ≤G+2 bot messages (where G = goal count) with no wasted turns. Scaled to goal count.
E4No HallucinationAll information traceable to supplier’s actual words — nothing fabricated.
E5ExtractabilityA complete supplier card can be filled from the conversation transcript.
E6Auto-Response HandlingBot extracts data from auto-replies, ignores pure platform greetings.
E7NaturalnessReads like a real sourcing agent on 1688 — tone, rhythm, cultural fit.
E8Rejection RecoveryRe-asks once in different words, then moves on. No 3+ loops. Explores alternatives when offered.
E9CustomizationCollects method, custom MOQ, price impact, artwork requirements.
S1Price NegotiationStretch: bot attempts any form of price discussion (not required for pass).

8. Methodology

Bot model: Gemini 3.1 Pro (gemini-3.1-pro-preview via Google AI API). Supplier sim: Kimi K2.5 (kimi-k2.5 via Kimi API). Judge: Opus 4.6 (claude-opus-4-6 via Cursor CLI). Goal generation: Gemini 3.1 Pro against the 17-type taxonomy.

Rubric: eval-rubric-dyn.md — 9 core dimensions (E1–E9) + 1 stretch (S1). E1 uses a “bot must ask” rule: a goal is achieved only if the bot explicitly asked or acknowledged supplier-volunteered information. E3 uses scaled thresholds: turn budget = goal_count + 2 (pass), goal_count + 5 (partial limit).

Corpus: 8 product categories (catholic-statue, oat-jars, paper-cups, phone-holder, play-gym, pump-bottle, shoes, soap). 3 suppliers per product for a total of 20 conversations, with a mix of responsive, non-responsive, evasive, and AI-bot archetypes. Total run time: ~51 minutes.


9. Next Steps

  1. Wire chatServer integration — the {{GOALS}} placeholder architecture is production-ready. Dynamic goal injection can be connected to the chatServer API for real supplier conversations.
  2. Run against real suppliers — simulated results need validation against actual 1688 supplier behavior. Priority: 10–20 real conversations across 3–4 product categories.
  3. Iterate dyn-v2 — target E2 (one-question discipline) with stronger negative examples. This is the last remaining dimension below 0.80 on responsive suppliers.
  4. Human review of goal generation — have a sourcing expert validate goal assignments for 10 representative SRs before scaling.
  5. Get Nelson’s baseline comparison — A/B on the same supplier pool to quantify improvement vs current implementation.

10. Artifacts

ArtifactLocationDescription
dyn-v1 promptbenchmark/prompts/dyn-v1.mdDynamic goal prompt with {{GOALS}} placeholder
Eval rubric (dyn)benchmark/judge/eval-rubric-dyn.md9 dimensions, “bot must ask” rule, scaled thresholds
Goal taxonomypipeline/goal-types.json17-type tiered goal definitions
Evaluation resultspipeline/output/tune-dyn-v1-gemini-pro-clean-2026-03-18.jsonFull conversation transcripts + Opus judge scores
Goal generatorpipeline/goal-generator.jsTaxonomy-aware goal generation from SR context
Conversation enginepipeline/conversation-engine.jsBot-supplier simulation with {{GOALS}} injection
This reportoutput/sourcy_supplier_bot_dyn_v1.htmldyn-v1 baseline evaluation (this document)
Companion: Goal Generationoutput/sourcy_supplier_bot_goal_gen_v1.htmlGoal generation methodology, taxonomy, and 8-product demo