Supplier Bot V2 — Weekly Delivery Report

I. Bottom Line

This week the supplier bot moved from a text-only prompt pipeline to an eval-driven, tool-calling agent architecture. The bot now emits structured signals (data logs, blocker flags, media acknowledgments, file requests) alongside its visible chat messages, and a separate LLM-based summary agent extracts escalation, scheduling, and asset-request signals after each conversation. After a full hardening pass — fixing the tool instruction injection, pinned backend enforcement, empty-text recovery, and phantom signal elimination — the pipeline reached 90.7% on a 14-dimension rubric with all tool-calling dimensions (E10-E14) at 95-100%.

Benchmark

90.7%

hardened Gemini Pro

Pass Rate

9/10

cases ≥80%

Old-Core Guard

98.2%

v1 controls intact

Eval Dims

E1–E14 + S1

II. What Was Built This Week

A. Tool-Calling Conversation Bot

The conversation engine now supports a tool-calling implementation where the bot produces a visible supplier-facing reply and structured tool calls on each turn. Four tools are available:

Tool	Purpose
`log_data`	Captures concrete sourcing answers (MOQ, price, lead time, packing, etc.) with supplier quotes
`note_blocker`	Flags a goal-blocking issue while the bot continues the conversation
`note_file_request`	Records when the supplier asks for logo, design, artwork, or spec files
`acknowledge_media`	Logs when the supplier sends images/files and what they contain

B. LLM-Based Summary Agent

A separate LLM agent reads the completed transcript + the bot's tool trace and emits operational signals: escalation flags, scheduling cues, and asset-request signals. Each signal requires a direct supplier quote for grounding.

C. Runtime-Replay Benchmark

A new eval mode where the bot generates its turns live while supplier messages are replayed from recorded fixtures. This replaced the older "envelope" mode where the bot never actually ran — making benchmark scores much closer to real production behavior.

D. Benchmark Hygiene Hardening

Backend pinning — discovered and fixed a bug where the runtime was silently using Gemini Flash instead of the intended Gemini Pro
Grounding checks — tool calls and summary signals are now validated against the actual transcript; unsupported items are dropped before scoring
Regression controls — four cases from the original delivered V1 benchmark are now included in every promotion run
Visible-reply fallback — when the LLM emits tools but no chat message, the engine forces a visible reply so the conversation never silently stalls

E. External Harness Exploration

Feasibility-probed Cursor agent, Pi CLI, Claude CLI, and Codex as alternative runtime harnesses for the conversation bot. Wired --harness vanilla|cursor|pi into the eval runner and ran smoke tests. Early results: Cursor and Pi are technically viable but 5–10× slower per turn than vanilla Gemini Pro API calls, and Cursor timed out on longer scheduler cases. The vanilla Gemini Pro path remains the practical choice for iteration speed.

III. Benchmark Progression

The score moved significantly this week, but the biggest lift came from structural fixes, not prompt rewrites.

Pre-hardening (broken)

69%

After structural fixes

87.1%

Final (hardened pipeline)

90.7%

Key insight The +21.7pp jump from 69% to 90.7% came from infrastructure hardening, not prompt rewrites. The biggest single fix: the runtime-replay was building the prompt without the tool-calling instruction block — the bot had tool schemas but no guidance on when to use them. Fixing this alone moved scores dramatically. Summary agent phantom signals were eliminated by constraining timing signal emission to explicit supplier scheduling cues.

IV. Old-Core Regression Guard

Four cases from the original delivered V1 benchmark were added as regression controls. These represent behavior that was already strong and must not break when new media/escalation capabilities are added.

Control Case	V1 Score	V2 Run Score	Delta
mes-02 artwork-request	14.0	14.0	0.0
mes-04 timing-wait	14.0	13.5	−0.5
mes-05 mixed-all	12.5	14.0	+1.5
mes-08 regression-baseline	13.5	13.5	0.0

No meaningful regression. The control slice scores 55/56 (98.2%). The new V2 capabilities did not break anything that already worked.

V. New Evaluation Dimensions

The rubric expanded from 9+1 to 14+1 dimensions to cover the new capabilities:

Dim	Name	What it tests	Status
E10	Image Read	Bot processes inbound supplier images/media	NEW
E11	Image Send	Bot manages outbound file-send decisions (logo, design)	NEW
E12	Escalation	Bot signals goal-blocking issues for Sourcy team / buyer	NEW
E13	Scheduler	Bot handles timing cues and follow-up decisions	NEW
E14	Continuation	Bot maintains goal pursuit despite events	NEW

The V2 eval dataset contains 10 primary cases: 2 media-read, 2 media-send, 2 escalation, 2 scheduler, and 2 baseline — all grounded in real supplier transcript patterns.

VI. Methodology & Caveats

A critical self-assessment was conducted mid-week and documented in detail. The main findings:

What holds

V2 dataset grounded in real transcripts
Runtime-replay generates bot turns live with tool-calling instructions
Strict grounding validates tool calls against transcript
Old-core regression guard passes cleanly
Backend pinned with transient retry (no silent fallback)
Summary agent hardened against phantom signals

Known limitations

E1/E5 structurally low — fixture design, not prompt issue
One auto-response case missed (mes2-07, 79%)
Grounding averages ~75% on bot trace, room for supplierQuote accuracy
External harnesses too slow for iteration loop
No live production validation yet

VII. Architecture Summary

The pipeline is now a dual-agent system with clear separation of concerns:

Component	Role	Runtime
Conversation Bot	Direct supplier interaction + structured tool calls	Gemini 3.1 Pro (pinned)
Summary Agent	Post-conversation signal extraction	Gemini Pro (via llmCallWithTools)
Eval Judge	Score against 14-dim rubric	Gemini Pro (via API)
Eval Runner	Orchestrates replay + grounding + scoring	run-eval-v2.js

Harness selector wired --harness vanilla|cursor|pi is now functional in the eval runner. Vanilla (direct API) remains the recommended path. External harnesses are available for future experimentation if latency improves.

VIII. Next Steps

Priority	Item	Owner
1	Fix auto-response recognition in prompt (mes2-07)	Eric
2	Live production validation on real 1688 conversations	Eric + Tek
3	Outbound attachment-send payload contract	Tek / Awsaf
4	Extend V2 fixtures for better E1/E5 coverage	Eric

Verdict

The tool-calling pipeline is hardened and delivering. Moving from 69% (broken) to 90.7% in a single session — entirely through infrastructure fixes — validates the eval-driven approach. Tool-calling dimensions (media read, media send, escalation, scheduler, continuation) are all at 95-100%. The remaining gap is structural: fixture design limits goal completion scoring, and one auto-response case needs prompt work. The foundation is solid for production wiring.

April 5 Update — Critical Assessment & Signal Validation

sendToBrain test passed Five signals were delivered correctly across three test cases: 1 human_escalation and 4 schedule_follow_up. The brain signal delivery path is proven end-to-end from harness through dispatch.

Grounding default is strict mode. Ungrounded tool calls are stripped before scoring, so the benchmark does not reward phantom tool traces.

Integration spec published Contract and wiring notes for production handoff: sourcy_supplier_bot_integration_spec.html.

Known gaps (April 5 critical assessment)

ID	Gap	Detail	Tag
a	Real multimodal not wired	`acknowledge_media` accepts text tokens only; true inbound media is not exercised in the live loop.	Product
b	Brain endpoint is dummy	Downstream `receiveSignals()` (or equivalent) must be implemented on the Sourcy side to consume production traffic.	Backend
c	chatServer format compatibility	Message and attachment payloads from chatServer have not been validated against this pipeline in integration tests.	Unverified
d	Preview model risk	`gemini-3.1-pro-preview` can change behavior without notice; pinning and regression runs remain essential.	Model
e	Synthetic coverage holes	No dedicated synthetic cases yet for supplier silence or WeChat redirect flows.	Dataset

New eval cases in flight Benchmark fixtures are being extended for supplier silence, auto-response handling, and WeChat redirect scenarios so E1/E5-style gaps and routing edge cases are scored explicitly.

Operational note Signal validation proves delivery mechanics; closing gaps (a)–(e) is required before treating the stack as production-complete.