This week the supplier bot moved from a text-only prompt pipeline to an eval-driven, tool-calling agent architecture. The bot now emits structured signals (data logs, blocker flags, media acknowledgments, file requests) alongside its visible chat messages, and a separate LLM-based summary agent extracts escalation, scheduling, and asset-request signals after each conversation. After a full hardening pass — fixing the tool instruction injection, pinned backend enforcement, empty-text recovery, and phantom signal elimination — the pipeline reached 90.7% on a 14-dimension rubric with all tool-calling dimensions (E10-E14) at 95-100%.
The conversation engine now supports a tool-calling implementation where the bot produces a visible supplier-facing reply and structured tool calls on each turn. Four tools are available:
| Tool | Purpose |
|---|---|
log_data | Captures concrete sourcing answers (MOQ, price, lead time, packing, etc.) with supplier quotes |
note_blocker | Flags a goal-blocking issue while the bot continues the conversation |
note_file_request | Records when the supplier asks for logo, design, artwork, or spec files |
acknowledge_media | Logs when the supplier sends images/files and what they contain |
A separate LLM agent reads the completed transcript + the bot's tool trace and emits operational signals: escalation flags, scheduling cues, and asset-request signals. Each signal requires a direct supplier quote for grounding.
A new eval mode where the bot generates its turns live while supplier messages are replayed from recorded fixtures. This replaced the older "envelope" mode where the bot never actually ran — making benchmark scores much closer to real production behavior.
Feasibility-probed Cursor agent, Pi CLI, Claude CLI, and Codex as alternative runtime harnesses for the conversation bot. Wired --harness vanilla|cursor|pi into the eval runner and ran smoke tests. Early results: Cursor and Pi are technically viable but 5–10× slower per turn than vanilla Gemini Pro API calls, and Cursor timed out on longer scheduler cases. The vanilla Gemini Pro path remains the practical choice for iteration speed.
The score moved significantly this week, but the biggest lift came from structural fixes, not prompt rewrites.
Four cases from the original delivered V1 benchmark were added as regression controls. These represent behavior that was already strong and must not break when new media/escalation capabilities are added.
| Control Case | V1 Score | V2 Run Score | Delta |
|---|---|---|---|
| mes-02 artwork-request | 14.0 | 14.0 | 0.0 |
| mes-04 timing-wait | 14.0 | 13.5 | −0.5 |
| mes-05 mixed-all | 12.5 | 14.0 | +1.5 |
| mes-08 regression-baseline | 13.5 | 13.5 | 0.0 |
The rubric expanded from 9+1 to 14+1 dimensions to cover the new capabilities:
| Dim | Name | What it tests | Status |
|---|---|---|---|
| E10 | Image Read | Bot processes inbound supplier images/media | NEW |
| E11 | Image Send | Bot manages outbound file-send decisions (logo, design) | NEW |
| E12 | Escalation | Bot signals goal-blocking issues for Sourcy team / buyer | NEW |
| E13 | Scheduler | Bot handles timing cues and follow-up decisions | NEW |
| E14 | Continuation | Bot maintains goal pursuit despite events | NEW |
The V2 eval dataset contains 10 primary cases: 2 media-read, 2 media-send, 2 escalation, 2 scheduler, and 2 baseline — all grounded in real supplier transcript patterns.
A critical self-assessment was conducted mid-week and documented in detail. The main findings:
The pipeline is now a dual-agent system with clear separation of concerns:
| Component | Role | Runtime |
|---|---|---|
| Conversation Bot | Direct supplier interaction + structured tool calls | Gemini 3.1 Pro (pinned) |
| Summary Agent | Post-conversation signal extraction | Gemini Pro (via llmCallWithTools) |
| Eval Judge | Score against 14-dim rubric | Gemini Pro (via API) |
| Eval Runner | Orchestrates replay + grounding + scoring | run-eval-v2.js |
--harness vanilla|cursor|pi is now functional in the eval runner. Vanilla (direct API) remains the recommended path. External harnesses are available for future experimentation if latency improves.
| Priority | Item | Owner |
|---|---|---|
| 1 | Fix auto-response recognition in prompt (mes2-07) | Eric |
| 2 | Live production validation on real 1688 conversations | Eric + Tek |
| 3 | Outbound attachment-send payload contract | Tek / Awsaf |
| 4 | Extend V2 fixtures for better E1/E5 coverage | Eric |
The tool-calling pipeline is hardened and delivering. Moving from 69% (broken) to 90.7% in a single session — entirely through infrastructure fixes — validates the eval-driven approach. Tool-calling dimensions (media read, media send, escalation, scheduler, continuation) are all at 95-100%. The remaining gap is structural: fixture design limits goal completion scoring, and one auto-response case needs prompt work. The foundation is solid for production wiring.
human_escalation and 4 schedule_follow_up. The brain signal delivery path is proven end-to-end from harness through dispatch.
Grounding default is strict mode. Ungrounded tool calls are stripped before scoring, so the benchmark does not reward phantom tool traces.
| ID | Gap | Detail | Tag |
|---|---|---|---|
| a | Real multimodal not wired | acknowledge_media accepts text tokens only; true inbound media is not exercised in the live loop. | Product |
| b | Brain endpoint is dummy | Downstream receiveSignals() (or equivalent) must be implemented on the Sourcy side to consume production traffic. | Backend |
| c | chatServer format compatibility | Message and attachment payloads from chatServer have not been validated against this pipeline in integration tests. | Unverified |
| d | Preview model risk | gemini-3.1-pro-preview can change behavior without notice; pinning and regression runs remain essential. | Model |
| e | Synthetic coverage holes | No dedicated synthetic cases yet for supplier silence or WeChat redirect flows. | Dataset |