Supplier Bot — Expanded Goal Evaluation ✦ V2

From 6 Fixed Goals to 17-Type Dynamic Taxonomy
March 18, 2026  ·  Eric San, Activation Execution Lead  ·  Sourcy

1. Executive Summary

Goal Types
6 → 17
tiered taxonomy
SRs Processed
33/33
100% success
Avg Goals/SR
10.5
6.0 T1 + 4.5 T2
Pilot Score
7.5/9
responsive suppliers

The supplier bot's data collection framework has been expanded from 6 hardcoded goals to a 17-type tiered taxonomy, dynamically generated per sourcing request. Goal generation achieved 100% processing rate across 33 SRs from real client conversations. A pilot evaluation of 10 conversations (5 products) returned an average score of 7.5/9 on responsive suppliers, Opus-judged with a range of 5–9/9.

Cross-judge validation using Opus 4.6 as an independent judge alongside Kimi K2.5 revealed meaningful score divergence on E1 (goal completion rigor). The framework is structurally ready for integration; a V8 prompt is required for production deployment.


2. The Problem

The V7 bot prompt hardcoded 6 data collection points: MOQ, price, lead time, customization, packing, and sample. This was sufficient for generic product inquiries but blind to category-specific sourcing needs.

Analysis of 20 real client conversations containing 33 sourcing requests revealed that real SRs frequently require 8–12 data points — including certifications, material specifications, artwork requirements, and shipping terms. The bot systematically missed these, leading to incomplete supplier qualification and repeated manual follow-up.

The core gap: A bot asking only about MOQ and price for a food-grade packaging SR will never surface the FDA certification or material spec information the client actually needs to make a sourcing decision.

3. Goal Taxonomy — 17 Types

Goals are organized into three tiers based on universality. Tier 1 goals apply to every SR. Tier 2 goals are conditionally assigned based on product category and client requirements. Tier 3 is a catch-all for edge cases.

TierGoal TypeCoverageDescription
TIER 1moq100%Minimum order quantity & volume breaks
TIER 1price100%Unit pricing, currency, volume tiers
TIER 1lead_time100%Production lead time & delivery schedule
TIER 1customization100%Logo, branding, custom spec capability
TIER 1packing100%Packaging options & requirements
TIER 1sample100%Sample availability, cost, timeline
TIER 2material_spec94%Material composition, grade, thickness
TIER 2color_finish73%Color matching, surface finish, texture
TIER 2artwork70%Printing method, file format, plate costs
TIER 2certification67%FDA, CE, ISO, BSCI, or category-specific
TIER 2shipping64%FOB/CIF, freight estimate, port
TIER 2size_variant33%Size options, variant matrix
TIER 2payment24%Payment terms (T/T, L/C, deposit %)
TIER 2verification9%Factory audit, trade assurance status
TIER 2stock9%Ready-to-ship inventory availability
TIER 2tooling3%Mold/tooling costs, existing mold reuse
TIER 3other3%Catch-all for edge-case requirements

4. Goal Generation Results

Cases
33
100% processed
Elapsed
29 min
~53s per SR
Avg Goals
10.5
6.0 T1 + 4.5 T2
Overloaded
8/33
11+ goals flagged

Goal Type Frequency (across 33 SRs)

moq
33 (100%)
price
33 (100%)
lead_time
33 (100%)
customization
33 (100%)
packing
33 (100%)
sample
33 (100%)
material_spec
31 (94%)
color_finish
24 (73%)
artwork
23 (70%)
certification
22 (67%)
shipping
21 (64%)
size_variant
11 (33%)
payment
8 (24%)
verification
3 (9%)
stock
3 (9%)
tooling
1 (3%)
other
1 (3%)

5. Eval Framework Changes — Rubric V2

The primary change is to E1 (Goal Completion), which now scores against a variable goal count per SR rather than the fixed 6-point checklist.

ChangeV1 (Fixed)V2 (Dynamic)
E1 goal count6 fixed goalsVariable (6–14 per SR)
E1 pass thresholdAll 6 coveredAll T1 + ≥75% T2
E1 fail threshold<4 covered<80% T1 covered
E2–E9Unchanged from V1
Judge modelsKimi K2.5 onlyKimi K2.5 + Opus 4.6 cross-validation
Why Tier 1 weighting? Tier 1 goals are universal — missing MOQ or price on any SR is a hard failure. Tier 2 goals are conditional; a bot that covers all T1 and most T2 is performing well. This prevents the eval from penalizing the bot for missing a rarely-needed goal like tooling.

6. Pilot Results

Cross-Judge Comparison (3 Responsive Suppliers)

ProductKimi K2.5Opus 4.6Delta
Play Gym6 / 99 / 9+3.0
Oat Jars8.5 / 95 / 9−3.5
Paper Cups7.5 / 98.5 / 9+1.0
Key divergence: Opus is stricter on E1 goal completion — it does not credit supplier-volunteered information that the bot didn't actively ask for. This is the correct interpretation per rubric V2. Kimi tends to be more lenient, crediting information surfaced by either party. The ±3.5 delta means single-model scores are unreliable at this sample size.

Per-Dimension Averages (10 Conversations, Kimi-Judged)

E1
0.55
E2
0.75
E3
0.50
E4
0.95
E5
0.70
E6
0.90
E7
0.75
E8
0.80
E9
0.45

E4 (tone/politeness) and E6 (Chinese language) are strengths. E9 (wrap-up/summary) and E1 (goal completion) are the weakest dimensions — both are directly addressable via prompt tuning.


7. Coverage Matrix — 10 Representative SRs

Checkmarks indicate the goal was assigned to that SR. Colored by tier. Dashes indicate not applicable.

Goal Play GymOat JarsPaper CupsPhone CaseBaby Toys Pet BowlCandle JarTote BagLabel StkGift Box
T1 moq
T1 price
T1 lead_time
T1 custom.
T1 packing
T1 sample
T2 material
T2 color
T2 artwork
T2 cert.
T2 shipping
T2 size
T2 payment
T2 verify
T2 stock
T2 tooling

8. Limitations & Risk Flags

This section is critical reading. Do not skip it.
RiskSeverityDetail
Tiny sample sizeHIGHn=10 conversations is a pilot, not a benchmark. Industry standard for ±2% confidence is 350+. Current results indicate direction, not production readiness.
Simulated suppliersHIGHAll supplier behavior is LLM-simulated. No real 1688 supplier data has been tested. Real suppliers may be unresponsive, evasive, or off-topic in ways simulation doesn't capture.
Cross-judge varianceHIGH±3.5 between Opus and Kimi on the same conversation. Single-model scores are unreliable without calibration or averaging.
Prompt contradictionMEDV7 prompt says "do not ask beyond 6 points" while the injected context block contains 10+ dynamic goals. This creates conflicting instructions for the model.
Overloaded goal setsMED8 of 33 SRs flagged with 11+ goals. A bot trying to collect 13 data points in one conversation may feel interrogative or exhaust supplier patience.
Inferred goalsMED30 of 33 goal sets include goals inferred beyond explicit conversation evidence. Accuracy of inference has not been validated by human review.

9. Recommendations & Next Steps

  1. Write V8 prompt — natively supports variable goal count, removes the V7 "max 6 points" contradiction. This is the #1 blocker for production.
  2. Increase eval sample to 50+ — 10 conversations is directional only. Need at minimum 50 to make any claims about prompt quality.
  3. Human review of goal generation — 30/33 SRs contain inferred goals. Lokesh or Shen should validate a representative sample before trusting the taxonomy at scale.
  4. Wire chatServer integration — dynamic goal injection mechanism needs to pass goals from the generation pipeline into the bot's context at conversation start.
  5. Get Nelson's baseline — A/B comparison between current bot and the dynamic-goal bot on the same supplier pool. Without a baseline, improvement claims are ungrounded.

10. Artifacts

ArtifactLocationDescription
Goal taxonomy schemaworkdir/supplier-bot/pipeline/17-type goal type definitions with tier assignments
Goal generation outputworkdir/supplier-bot/pipeline/output/33 SR goal sets (JSON), per-case scoring
Prompt V7 (current prod)workdir/supplier-bot/benchmark/prompts/v7-media.mdProduction prompt — to be superseded by V8
Eval rubric V2workdir/supplier-bot/benchmark/judge/eval-rubric-v1.md9 dimensions + dynamic E1 scoring
Pilot eval resultsworkdir/supplier-bot/benchmark/results/10 conversation transcripts + Kimi/Opus judge scores
Conversation raw datacontext/conversations_raw_all.md20 client conversations, 33 SRs with trigger points
This reportoutput/sourcy_supplier_bot_expanded_eval_v2.htmlExpanded eval V2 summary (this document)