Supplier Bot — Expanded Goal Evaluation (V2)

1. Executive Summary

Goal Types

6 → 17

tiered taxonomy

SRs Processed

33/33

100% success

Avg Goals/SR

10.5

6.0 T1 + 4.5 T2

Pilot Score

7.5/9

responsive suppliers

The supplier bot's data collection framework has been expanded from 6 hardcoded goals to a 17-type tiered taxonomy, dynamically generated per sourcing request. Goal generation achieved 100% processing rate across 33 SRs from real client conversations. A pilot evaluation of 10 conversations (5 products) returned an average score of 7.5/9 on responsive suppliers, Opus-judged with a range of 5–9/9.

Cross-judge validation using Opus 4.6 as an independent judge alongside Kimi K2.5 revealed meaningful score divergence on E1 (goal completion rigor). The framework is structurally ready for integration; a V8 prompt is required for production deployment.

2. The Problem

The V7 bot prompt hardcoded 6 data collection points: MOQ, price, lead time, customization, packing, and sample. This was sufficient for generic product inquiries but blind to category-specific sourcing needs.

Analysis of 20 real client conversations containing 33 sourcing requests revealed that real SRs frequently require 8–12 data points — including certifications, material specifications, artwork requirements, and shipping terms. The bot systematically missed these, leading to incomplete supplier qualification and repeated manual follow-up.

The core gap: A bot asking only about MOQ and price for a food-grade packaging SR will never surface the FDA certification or material spec information the client actually needs to make a sourcing decision.

3. Goal Taxonomy — 17 Types

Goals are organized into three tiers based on universality. Tier 1 goals apply to every SR. Tier 2 goals are conditionally assigned based on product category and client requirements. Tier 3 is a catch-all for edge cases.

Tier	Goal Type	Coverage	Description
TIER 1	moq	100%	Minimum order quantity & volume breaks
TIER 1	price	100%	Unit pricing, currency, volume tiers
TIER 1	lead_time	100%	Production lead time & delivery schedule
TIER 1	customization	100%	Logo, branding, custom spec capability
TIER 1	packing	100%	Packaging options & requirements
TIER 1	sample	100%	Sample availability, cost, timeline

TIER 2	material_spec	94%	Material composition, grade, thickness
TIER 2	color_finish	73%	Color matching, surface finish, texture
TIER 2	artwork	70%	Printing method, file format, plate costs
TIER 2	certification	67%	FDA, CE, ISO, BSCI, or category-specific
TIER 2	shipping	64%	FOB/CIF, freight estimate, port
TIER 2	size_variant	33%	Size options, variant matrix
TIER 2	payment	24%	Payment terms (T/T, L/C, deposit %)
TIER 2	verification	9%	Factory audit, trade assurance status
TIER 2	stock	9%	Ready-to-ship inventory availability
TIER 2	tooling	3%	Mold/tooling costs, existing mold reuse

TIER 3	other	3%	Catch-all for edge-case requirements

4. Goal Generation Results

Cases

100% processed

Elapsed

29 min

~53s per SR

Avg Goals

10.5

6.0 T1 + 4.5 T2

Overloaded

8/33

11+ goals flagged

Goal Type Frequency (across 33 SRs)

moq

33 (100%)

price

33 (100%)

lead_time

33 (100%)

customization

33 (100%)

packing

33 (100%)

sample

33 (100%)

material_spec

31 (94%)

color_finish

24 (73%)

artwork

23 (70%)

certification

22 (67%)

shipping

21 (64%)

size_variant

11 (33%)

payment

8 (24%)

verification

3 (9%)

stock

3 (9%)

tooling

1 (3%)

other

1 (3%)

5. Eval Framework Changes — Rubric V2

The primary change is to E1 (Goal Completion), which now scores against a variable goal count per SR rather than the fixed 6-point checklist.

Change	V1 (Fixed)	V2 (Dynamic)
E1 goal count	6 fixed goals	Variable (6–14 per SR)
E1 pass threshold	All 6 covered	All T1 + ≥75% T2
E1 fail threshold	<4 covered	<80% T1 covered
E2–E9	Unchanged from V1
Judge models	Kimi K2.5 only	Kimi K2.5 + Opus 4.6 cross-validation

Why Tier 1 weighting? Tier 1 goals are universal — missing MOQ or price on any SR is a hard failure. Tier 2 goals are conditional; a bot that covers all T1 and most T2 is performing well. This prevents the eval from penalizing the bot for missing a rarely-needed goal like tooling.

6. Pilot Results

Cross-Judge Comparison (3 Responsive Suppliers)

Product	Kimi K2.5	Opus 4.6	Delta
Play Gym	6 / 9	9 / 9	+3.0
Oat Jars	8.5 / 9	5 / 9	−3.5
Paper Cups	7.5 / 9	8.5 / 9	+1.0

Key divergence: Opus is stricter on E1 goal completion — it does not credit supplier-volunteered information that the bot didn't actively ask for. This is the correct interpretation per rubric V2. Kimi tends to be more lenient, crediting information surfaced by either party. The ±3.5 delta means single-model scores are unreliable at this sample size.

Per-Dimension Averages (10 Conversations, Kimi-Judged)

0.55

0.75

0.50

0.95

0.70

0.90

0.75

0.80

0.45

E4 (tone/politeness) and E6 (Chinese language) are strengths. E9 (wrap-up/summary) and E1 (goal completion) are the weakest dimensions — both are directly addressable via prompt tuning.

7. Coverage Matrix — 10 Representative SRs

Checkmarks indicate the goal was assigned to that SR. Colored by tier. Dashes indicate not applicable.

Goal	Play Gym	Oat Jars	Paper Cups	Phone Case	Baby Toys	Pet Bowl	Candle Jar	Tote Bag	Label Stk	Gift Box
T1 moq	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓
T1 price	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓
T1 lead_time	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓
T1 custom.	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓
T1 packing	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓
T1 sample	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓

T2 material	✓	✓	✓	✓	✓	✓	✓	✓	✓	—
T2 color	✓	—	✓	✓	✓	✓	✓	✓	—	✓
T2 artwork	—	✓	✓	✓	—	—	✓	✓	✓	✓
T2 cert.	✓	✓	✓	—	✓	✓	—	—	✓	—
T2 shipping	✓	✓	—	—	✓	—	✓	✓	—	✓
T2 size	✓	—	—	✓	✓	✓	—	—	—	—
T2 payment	—	—	—	—	—	—	✓	✓	—	—
T2 verify	—	—	—	—	✓	—	—	—	—	—
T2 stock	—	—	—	—	—	—	—	—	✓	—
T2 tooling	—	—	—	—	—	—	—	—	—	—

8. Limitations & Risk Flags

This section is critical reading. Do not skip it.

Risk	Severity	Detail
Tiny sample size	HIGH	n=10 conversations is a pilot, not a benchmark. Industry standard for ±2% confidence is 350+. Current results indicate direction, not production readiness.
Simulated suppliers	HIGH	All supplier behavior is LLM-simulated. No real 1688 supplier data has been tested. Real suppliers may be unresponsive, evasive, or off-topic in ways simulation doesn't capture.
Cross-judge variance	HIGH	±3.5 between Opus and Kimi on the same conversation. Single-model scores are unreliable without calibration or averaging.
Prompt contradiction	MED	V7 prompt says "do not ask beyond 6 points" while the injected context block contains 10+ dynamic goals. This creates conflicting instructions for the model.
Overloaded goal sets	MED	8 of 33 SRs flagged with 11+ goals. A bot trying to collect 13 data points in one conversation may feel interrogative or exhaust supplier patience.
Inferred goals	MED	30 of 33 goal sets include goals inferred beyond explicit conversation evidence. Accuracy of inference has not been validated by human review.

9. Recommendations & Next Steps

Write V8 prompt — natively supports variable goal count, removes the V7 "max 6 points" contradiction. This is the #1 blocker for production.
Increase eval sample to 50+ — 10 conversations is directional only. Need at minimum 50 to make any claims about prompt quality.
Human review of goal generation — 30/33 SRs contain inferred goals. Lokesh or Shen should validate a representative sample before trusting the taxonomy at scale.
Wire chatServer integration — dynamic goal injection mechanism needs to pass goals from the generation pipeline into the bot's context at conversation start.
Get Nelson's baseline — A/B comparison between current bot and the dynamic-goal bot on the same supplier pool. Without a baseline, improvement claims are ungrounded.

10. Artifacts

Artifact	Location	Description
Goal taxonomy schema	`workdir/supplier-bot/pipeline/`	17-type goal type definitions with tier assignments
Goal generation output	`workdir/supplier-bot/pipeline/output/`	33 SR goal sets (JSON), per-case scoring
Prompt V7 (current prod)	`workdir/supplier-bot/benchmark/prompts/v7-media.md`	Production prompt — to be superseded by V8
Eval rubric V2	`workdir/supplier-bot/benchmark/judge/eval-rubric-v1.md`	9 dimensions + dynamic E1 scoring
Pilot eval results	`workdir/supplier-bot/benchmark/results/`	10 conversation transcripts + Kimi/Opus judge scores
Conversation raw data	`context/conversations_raw_all.md`	20 client conversations, 33 SRs with trigger points
This report	`output/sourcy_supplier_bot_expanded_eval_v2.html`	Expanded eval V2 summary (this document)

Supplier Bot — Expanded Goal Evaluation ✦ V2

1. Executive Summary

2. The Problem

3. Goal Taxonomy — 17 Types

4. Goal Generation Results

Goal Type Frequency (across 33 SRs)

5. Eval Framework Changes — Rubric V2

6. Pilot Results

Cross-Judge Comparison (3 Responsive Suppliers)

Per-Dimension Averages (10 Conversations, Kimi-Judged)

7. Coverage Matrix — 10 Representative SRs

8. Limitations & Risk Flags

9. Recommendations & Next Steps

10. Artifacts