Supplier Bot V2 — Production Handoff

Benchmark Score

90.7%

127/140 · Gemini Pro · 9/10 pass

Eval Dimensions

E1–E14 + S1

Eval Cases

Scenario suite

Signal Types

Escalation · scheduling · asset request

What We’re Delivering

The V2 pipeline is a tool-calling conversation engine that:

Generates tiered goals from a Sourcing Request.
Conducts multi-turn supplier conversations in Chinese on 1688/Alibaba.
Uses four structured tools to capture pricing, MOQ, blockers, media, and file requests during chat.
Runs a post-conversation summary agent that emits three signal types to a downstream brain.
Supports cross-supplier context injection for multi-supplier campaigns.

Scope This is the intelligence layer only. Sourcy owns transport (chatServer), scheduling execution, and the brain endpoint.

Architecture at a Glance

Two-agent design:

Agent	Model	Role	Tools
Conversation bot	Gemini 3.1 Pro	Real-time capture during supplier chat	`log_data`, `note_blocker`, `note_file_request`, `acknowledge_media`
Summary agent	Gemini 3.1 Pro	Post-conversation signal extraction	`emit_signal`, `extract_timing`, `flag_escalation`

Benchmark Results

Scores below reflect the latest hardened benchmark (Gemini Pro on the original ten cases; three additional cases exercised on Claude Sonnet where noted). Pass threshold aligns with the dyn-v3 rubric suite (14 scored dimensions per case).

Case ID	Category	Score	Result
mes2-01	Escalation — moq_rejection	10/14	PASS
mes2-03	Escalation — moq_escalation_deep	10/14	PASS
mes2-04	Baseline — authority_boundary	12/14	PASS
mes2-05	Media read — image_burst	14/14	PASS
mes2-06	Media send — file_request	11/14	PASS
mes2-07	Scheduler — timing_calculating	11/14	PASS
mes2-08	Scheduler — timing_tomorrow	14/14	PASS
mes2-09	Media read — image_read_then_file_send	12/14	PASS
mes2-10	Media send — logo_mockup_request	13/14	PASS
mes2-11	Baseline — regression_baseline	14/14	PASS
mes2-12	Scheduler — supplier_silence	6/14	FAIL Sonnet
mes2-13	Auto-response — auto_response_loop	13/14	PASS
mes2-14	Escalation — wechat_redirect	—	PENDING

Grounding (hardened-v2-final, 10-case Gemini Pro run): Conversation bot tool-trace grounding 82.1% (32 supported / 39 trace checks). Summary-agent signal grounding 81.8% (9 supported / 11 tool rows). Figures aggregate strict transcript-alignment checks from the benchmark report.

Signal Pipeline — Proven End-to-End

The send-to-brain integration test ran three cases with signal delivery enabled. Five signals landed as expected:

1× human_escalation (authority boundary — supplier requires a brand certificate the bot cannot provide).
4× schedule_follow_up (timing scenarios — supplier deferral language such as “等一下”, “明天找一下五金”, and offline-style messages).

Each signal carries sessionId, supplierId, severity, reason (with the exact supplier quote where applicable), plus followUpTiming or goalId as appropriate.

Integration gap The brain endpoint is currently a local JSON log sink. Sourcy must implement receiveSignals(signals) as a production API.

What Sourcy Needs to Do

Port the Tier 1 files: dyn-v6.md, conversation-engine.js, tool-definitions.js, summary-agent-llm.js, prompt-builder.js, context-builder.js, llm.js, goal-generator.js.
Implement receiveSignals() on the brain API.
Verify chatServer message format compatibility with tool-call output.
Re-run the 13-case eval benchmark after porting to confirm no regression.
Define scheduling backend contract (when runs trigger, prior state, follow-up execution) — awaiting Lokesh/Awsaf spec.
Set up model monitoring — re-run the benchmark if the Gemini model identifier changes.

Known Gaps & Risk Register

Gap	Severity	Owner	Status
Vision pass-through	Low	Sourcy	Bot handles image tokens correctly (mes2-05, mes2-09 PASS). Actual image-content description via Gemini vision is a separate user story, not yet eval-pinned.
Brain endpoint	High	Sourcy	Dummy log → production API
chatServer format	Medium	Sourcy / Tek	Untested with full production payload
Preview model stability	Low	Sourcy	Document model pinning procedure
1688 connector	Medium	Tek	Alibaba path verified; 1688 probe failed
Scheduling backend contract	Medium	Sourcy (Lokesh/Awsaf)	Bot emits schedule_follow_up signals correctly. Sourcy must define: when runs trigger, what state from prior runs is available, how follow-ups fire.

Links & Resources

Name	URL
Integration spec	report.ericsan.io/sourcy/sourcy_supplier_bot_integration_spec.html
Transcript viewer	report.ericsan.io/sourcy/sourcy_supplier_bot_v2_transcript_viewer.html
Weekly report	report.ericsan.io/sourcy/sourcy_supplier_bot_v2_weekly_2026_04_02.html
Eval methodology	report.ericsan.io/sourcy/sourcy_eval_methodology_v1.html
GitHub repo	github.com/neicras/sourcy-supplier-bot-eval

Verdict

The V2 intelligence layer is eval-proven and handoff-ready. 90.7% benchmark on 14 dimensions; the tool-calling pipeline is validated end-to-end, including signal delivery to the downstream brain. Remaining work is integration: port the Tier 1 files, implement the brain API, and validate against chatServer. Eric remains available for eval support and prompt iteration after porting.