Supplier Bot V2 — Weekly Delivery

Eval-driven architecture with tool-calling agents — hardened pipeline delivering 90.7% on 14-dimension rubric
Week of 31 March – 2 April 2026

I. Bottom Line

This week the supplier bot moved from a text-only prompt pipeline to an eval-driven, tool-calling agent architecture. The bot now emits structured signals (data logs, blocker flags, media acknowledgments, file requests) alongside its visible chat messages, and a separate LLM-based summary agent extracts escalation, scheduling, and asset-request signals after each conversation. After a full hardening pass — fixing the tool instruction injection, pinned backend enforcement, empty-text recovery, and phantom signal elimination — the pipeline reached 90.7% on a 14-dimension rubric with all tool-calling dimensions (E10-E14) at 95-100%.

Benchmark
90.7%
hardened Gemini Pro
Pass Rate
9/10
cases ≥80%
Old-Core Guard
98.2%
v1 controls intact
Eval Dims
14
E1–E14 + S1

II. What Was Built This Week

A. Tool-Calling Conversation Bot

The conversation engine now supports a tool-calling implementation where the bot produces a visible supplier-facing reply and structured tool calls on each turn. Four tools are available:

ToolPurpose
log_dataCaptures concrete sourcing answers (MOQ, price, lead time, packing, etc.) with supplier quotes
note_blockerFlags a goal-blocking issue while the bot continues the conversation
note_file_requestRecords when the supplier asks for logo, design, artwork, or spec files
acknowledge_mediaLogs when the supplier sends images/files and what they contain

B. LLM-Based Summary Agent

A separate LLM agent reads the completed transcript + the bot's tool trace and emits operational signals: escalation flags, scheduling cues, and asset-request signals. Each signal requires a direct supplier quote for grounding.

C. Runtime-Replay Benchmark

A new eval mode where the bot generates its turns live while supplier messages are replayed from recorded fixtures. This replaced the older "envelope" mode where the bot never actually ran — making benchmark scores much closer to real production behavior.

D. Benchmark Hygiene Hardening

E. External Harness Exploration

Feasibility-probed Cursor agent, Pi CLI, Claude CLI, and Codex as alternative runtime harnesses for the conversation bot. Wired --harness vanilla|cursor|pi into the eval runner and ran smoke tests. Early results: Cursor and Pi are technically viable but 5–10× slower per turn than vanilla Gemini Pro API calls, and Cursor timed out on longer scheduler cases. The vanilla Gemini Pro path remains the practical choice for iteration speed.


III. Benchmark Progression

The score moved significantly this week, but the biggest lift came from structural fixes, not prompt rewrites.

Pre-hardening (broken)
69%
After structural fixes
87.1%
Final (hardened pipeline)
90.7%
Key insight The +21.7pp jump from 69% to 90.7% came from infrastructure hardening, not prompt rewrites. The biggest single fix: the runtime-replay was building the prompt without the tool-calling instruction block — the bot had tool schemas but no guidance on when to use them. Fixing this alone moved scores dramatically. Summary agent phantom signals were eliminated by constraining timing signal emission to explicit supplier scheduling cues.

IV. Old-Core Regression Guard

Four cases from the original delivered V1 benchmark were added as regression controls. These represent behavior that was already strong and must not break when new media/escalation capabilities are added.

Control CaseV1 ScoreV2 Run ScoreDelta
mes-02 artwork-request14.014.00.0
mes-04 timing-wait14.013.5−0.5
mes-05 mixed-all12.514.0+1.5
mes-08 regression-baseline13.513.50.0
No meaningful regression. The control slice scores 55/56 (98.2%). The new V2 capabilities did not break anything that already worked.

V. New Evaluation Dimensions

The rubric expanded from 9+1 to 14+1 dimensions to cover the new capabilities:

DimNameWhat it testsStatus
E10Image ReadBot processes inbound supplier images/mediaNEW
E11Image SendBot manages outbound file-send decisions (logo, design)NEW
E12EscalationBot signals goal-blocking issues for Sourcy team / buyerNEW
E13SchedulerBot handles timing cues and follow-up decisionsNEW
E14ContinuationBot maintains goal pursuit despite eventsNEW

The V2 eval dataset contains 10 primary cases: 2 media-read, 2 media-send, 2 escalation, 2 scheduler, and 2 baseline — all grounded in real supplier transcript patterns.


VI. Methodology & Caveats

A critical self-assessment was conducted mid-week and documented in detail. The main findings:

What holds

  • V2 dataset grounded in real transcripts
  • Runtime-replay generates bot turns live with tool-calling instructions
  • Strict grounding validates tool calls against transcript
  • Old-core regression guard passes cleanly
  • Backend pinned with transient retry (no silent fallback)
  • Summary agent hardened against phantom signals

Known limitations

  • E1/E5 structurally low — fixture design, not prompt issue
  • One auto-response case missed (mes2-07, 79%)
  • Grounding averages ~75% on bot trace, room for supplierQuote accuracy
  • External harnesses too slow for iteration loop
  • No live production validation yet

VII. Architecture Summary

The pipeline is now a dual-agent system with clear separation of concerns:

ComponentRoleRuntime
Conversation BotDirect supplier interaction + structured tool callsGemini 3.1 Pro (pinned)
Summary AgentPost-conversation signal extractionGemini Pro (via llmCallWithTools)
Eval JudgeScore against 14-dim rubricGemini Pro (via API)
Eval RunnerOrchestrates replay + grounding + scoringrun-eval-v2.js
Harness selector wired --harness vanilla|cursor|pi is now functional in the eval runner. Vanilla (direct API) remains the recommended path. External harnesses are available for future experimentation if latency improves.

VIII. Next Steps

PriorityItemOwner
1Fix auto-response recognition in prompt (mes2-07)Eric
2Live production validation on real 1688 conversationsEric + Tek
3Outbound attachment-send payload contractTek / Awsaf
4Extend V2 fixtures for better E1/E5 coverageEric

Verdict

The tool-calling pipeline is hardened and delivering. Moving from 69% (broken) to 90.7% in a single session — entirely through infrastructure fixes — validates the eval-driven approach. Tool-calling dimensions (media read, media send, escalation, scheduler, continuation) are all at 95-100%. The remaining gap is structural: fixture design limits goal completion scoring, and one auto-response case needs prompt work. The foundation is solid for production wiring.


April 5 Update — Critical Assessment & Signal Validation

sendToBrain test passed Five signals were delivered correctly across three test cases: 1 human_escalation and 4 schedule_follow_up. The brain signal delivery path is proven end-to-end from harness through dispatch.

Grounding default is strict mode. Ungrounded tool calls are stripped before scoring, so the benchmark does not reward phantom tool traces.

Integration spec published Contract and wiring notes for production handoff: sourcy_supplier_bot_integration_spec.html.

Known gaps (April 5 critical assessment)

IDGapDetailTag
aReal multimodal not wiredacknowledge_media accepts text tokens only; true inbound media is not exercised in the live loop.Product
bBrain endpoint is dummyDownstream receiveSignals() (or equivalent) must be implemented on the Sourcy side to consume production traffic.Backend
cchatServer format compatibilityMessage and attachment payloads from chatServer have not been validated against this pipeline in integration tests.Unverified
dPreview model riskgemini-3.1-pro-preview can change behavior without notice; pinning and regression runs remain essential.Model
eSynthetic coverage holesNo dedicated synthetic cases yet for supplier silence or WeChat redirect flows.Dataset
New eval cases in flight Benchmark fixtures are being extended for supplier silence, auto-response handling, and WeChat redirect scenarios so E1/E5-style gaps and routing edge cases are scored explicitly.
Operational note Signal validation proves delivery mechanics; closing gaps (a)–(e) is required before treating the stack as production-complete.