The supplier bot's data collection framework has been expanded from 6 hardcoded goals to a 17-type tiered taxonomy, dynamically generated per sourcing request. Goal generation achieved 100% processing rate across 33 SRs from real client conversations. A pilot evaluation of 10 conversations (5 products) returned an average score of 7.5/9 on responsive suppliers, Opus-judged with a range of 5–9/9.
Cross-judge validation using Opus 4.6 as an independent judge alongside Kimi K2.5 revealed meaningful score divergence on E1 (goal completion rigor). The framework is structurally ready for integration; a V8 prompt is required for production deployment.
The V7 bot prompt hardcoded 6 data collection points: MOQ, price, lead time, customization, packing, and sample. This was sufficient for generic product inquiries but blind to category-specific sourcing needs.
Analysis of 20 real client conversations containing 33 sourcing requests revealed that real SRs frequently require 8–12 data points — including certifications, material specifications, artwork requirements, and shipping terms. The bot systematically missed these, leading to incomplete supplier qualification and repeated manual follow-up.
Goals are organized into three tiers based on universality. Tier 1 goals apply to every SR. Tier 2 goals are conditionally assigned based on product category and client requirements. Tier 3 is a catch-all for edge cases.
| Tier | Goal Type | Coverage | Description |
|---|---|---|---|
| TIER 1 | moq | 100% | Minimum order quantity & volume breaks |
| TIER 1 | price | 100% | Unit pricing, currency, volume tiers |
| TIER 1 | lead_time | 100% | Production lead time & delivery schedule |
| TIER 1 | customization | 100% | Logo, branding, custom spec capability |
| TIER 1 | packing | 100% | Packaging options & requirements |
| TIER 1 | sample | 100% | Sample availability, cost, timeline |
| TIER 2 | material_spec | 94% | Material composition, grade, thickness |
| TIER 2 | color_finish | 73% | Color matching, surface finish, texture |
| TIER 2 | artwork | 70% | Printing method, file format, plate costs |
| TIER 2 | certification | 67% | FDA, CE, ISO, BSCI, or category-specific |
| TIER 2 | shipping | 64% | FOB/CIF, freight estimate, port |
| TIER 2 | size_variant | 33% | Size options, variant matrix |
| TIER 2 | payment | 24% | Payment terms (T/T, L/C, deposit %) |
| TIER 2 | verification | 9% | Factory audit, trade assurance status |
| TIER 2 | stock | 9% | Ready-to-ship inventory availability |
| TIER 2 | tooling | 3% | Mold/tooling costs, existing mold reuse |
| TIER 3 | other | 3% | Catch-all for edge-case requirements |
The primary change is to E1 (Goal Completion), which now scores against a variable goal count per SR rather than the fixed 6-point checklist.
| Change | V1 (Fixed) | V2 (Dynamic) |
|---|---|---|
| E1 goal count | 6 fixed goals | Variable (6–14 per SR) |
| E1 pass threshold | All 6 covered | All T1 + ≥75% T2 |
| E1 fail threshold | <4 covered | <80% T1 covered |
| E2–E9 | Unchanged from V1 | |
| Judge models | Kimi K2.5 only | Kimi K2.5 + Opus 4.6 cross-validation |
| Product | Kimi K2.5 | Opus 4.6 | Delta |
|---|---|---|---|
| Play Gym | 6 / 9 | 9 / 9 | +3.0 |
| Oat Jars | 8.5 / 9 | 5 / 9 | −3.5 |
| Paper Cups | 7.5 / 9 | 8.5 / 9 | +1.0 |
E4 (tone/politeness) and E6 (Chinese language) are strengths. E9 (wrap-up/summary) and E1 (goal completion) are the weakest dimensions — both are directly addressable via prompt tuning.
Checkmarks indicate the goal was assigned to that SR. Colored by tier. Dashes indicate not applicable.
| Goal | Play Gym | Oat Jars | Paper Cups | Phone Case | Baby Toys | Pet Bowl | Candle Jar | Tote Bag | Label Stk | Gift Box |
|---|---|---|---|---|---|---|---|---|---|---|
| T1 moq | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| T1 price | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| T1 lead_time | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| T1 custom. | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| T1 packing | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| T1 sample | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| T2 material | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | — |
| T2 color | ✓ | — | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | — | ✓ |
| T2 artwork | — | ✓ | ✓ | ✓ | — | — | ✓ | ✓ | ✓ | ✓ |
| T2 cert. | ✓ | ✓ | ✓ | — | ✓ | ✓ | — | — | ✓ | — |
| T2 shipping | ✓ | ✓ | — | — | ✓ | — | ✓ | ✓ | — | ✓ |
| T2 size | ✓ | — | — | ✓ | ✓ | ✓ | — | — | — | — |
| T2 payment | — | — | — | — | — | — | ✓ | ✓ | — | — |
| T2 verify | — | — | — | — | ✓ | — | — | — | — | — |
| T2 stock | — | — | — | — | — | — | — | — | ✓ | — |
| T2 tooling | — | — | — | — | — | — | — | — | — | — |
| Risk | Severity | Detail |
|---|---|---|
| Tiny sample size | HIGH | n=10 conversations is a pilot, not a benchmark. Industry standard for ±2% confidence is 350+. Current results indicate direction, not production readiness. |
| Simulated suppliers | HIGH | All supplier behavior is LLM-simulated. No real 1688 supplier data has been tested. Real suppliers may be unresponsive, evasive, or off-topic in ways simulation doesn't capture. |
| Cross-judge variance | HIGH | ±3.5 between Opus and Kimi on the same conversation. Single-model scores are unreliable without calibration or averaging. |
| Prompt contradiction | MED | V7 prompt says "do not ask beyond 6 points" while the injected context block contains 10+ dynamic goals. This creates conflicting instructions for the model. |
| Overloaded goal sets | MED | 8 of 33 SRs flagged with 11+ goals. A bot trying to collect 13 data points in one conversation may feel interrogative or exhaust supplier patience. |
| Inferred goals | MED | 30 of 33 goal sets include goals inferred beyond explicit conversation evidence. Accuracy of inference has not been validated by human review. |
| Artifact | Location | Description |
|---|---|---|
| Goal taxonomy schema | workdir/supplier-bot/pipeline/ | 17-type goal type definitions with tier assignments |
| Goal generation output | workdir/supplier-bot/pipeline/output/ | 33 SR goal sets (JSON), per-case scoring |
| Prompt V7 (current prod) | workdir/supplier-bot/benchmark/prompts/v7-media.md | Production prompt — to be superseded by V8 |
| Eval rubric V2 | workdir/supplier-bot/benchmark/judge/eval-rubric-v1.md | 9 dimensions + dynamic E1 scoring |
| Pilot eval results | workdir/supplier-bot/benchmark/results/ | 10 conversation transcripts + Kimi/Opus judge scores |
| Conversation raw data | context/conversations_raw_all.md | 20 client conversations, 33 SRs with trigger points |
| This report | output/sourcy_supplier_bot_expanded_eval_v2.html | Expanded eval V2 summary (this document) |