Activation Bot — Dual Eval Review

Sessions Analyzed

468

Feb 25 – Mar 13, 2026

Sample Scored

representative subset

Eric's Rubric

4.05

mean score (0/8 pass)

V3 Eval

5.50

mean score (0/8 pass)

I. Background & Method

Two Evals, One Dataset

Eric built an eval rubric for his v7 prototype — a plain-text WhatsApp bot where expertise lived entirely in the message text.⁴ Eugene extended it to V3 — a tool-aware, channel-aware framework with a Trust Score track designed for the production web application.⁵ Neither was designed for the other's context. This review applies both to the same 8 real production sessions from raw_chat_user.json.¹

The goal is not to crown one eval. It's to identify what each catches, what each misses, and where they agree — because agreement on failure is the strongest signal we have.

Data Sources

All claims in this report are grounded in the following materials:

#	Source	Description
¹	raw_chat_user.json	468 sessions from activation bot, Feb 25 – Mar 13
²	activation_bot_analysis.docx	Eugene's PostHog cross-referenced analysis
³	activation_bot_analysis_v2.docx	Eugene's plain English analysis
⁴	ASSESSMENT_RUBRIC.md	Eric's original D1-D5 rubric, used for v7 run
⁵	EVAL_FRAMEWORK.md (V3)	Eugene's channel-aware, path-aware, Trust Score framework
⁶	GPT_JUDGE_PROMPT.md (V3)	V3 judge scoring prompt
⁷	GPT_JUDGE_PROMPT.md (V2)	Eric's golden transcript eval V2 (tool-aware adaptation)

The Production Funnel

Of 468 total sessions, 190 (41%) were button-click-only bounces — zero product input. The remaining 278 engaged users form the analysis denominator.¹

Entered

278 (100%)

Past S1

194 (70%)

Past S2

65 (23%)

Past S3

44 (16%)

Completed

29 (10.4%)

Denominator note 190 of 468 sessions (41%) were button-click-only bounces — zero product input. These are landing page / CTA design issues, not bot conversation quality. All analysis below uses the 278-engaged denominator.¹

The 8 Sessions

#	ID	Product	Stage	Msgs	Selection Rationale
1	f645035c	Custom Leather Backpack	Complete	22	Happy path, has SR. Also V3 real_chat case V3-001⁵
2	98ced3ab	Wallet + Multitool	4	34	Highest-engagement S4 dropout. Passive close pattern
3	32a76c73	Coffee Bags + Glass Bottles	2	40	Longest S2. Image upload, multi-product, form-stuck
4	962896d6	Custom Logo Tumbler	2	10	Typical quick mid-funnel exit
5	b0dcf410	Ceramic Perfume Bottle	1	22	Deep engagement despite S1 classification. Best pricing turn
6	2db27f98	2K Computer Monitor	3	28	Feasibility loop — qty changed 7 times
7	4d340dad	Market Advice (Chinese)	1	16	"What should I source" dead-end. Full Chinese session
8	fa0892e5	Blind Box	1	8	Product misunderstanding, no recovery

Representativeness Population S1=30%, S2=46%, S3=8%, S4=5%, Complete=10%. Sample overweights tails deliberately — the middle of the funnel is where both evals agree; the edges reveal divergence.

II. Team Observations

Eugene's Assessment (Mar 16)⁸

What's good:

Majority of chats follow the activation bot flow and principle, even the eval has passed
The language that the bot uses follows the prompt

What's bad:

The bot only calls tool but there's no message to summarize
Customer wants to submit SR (finalize), but SRCard doesn't show

PostHog patterns (10-11 session recordings watched):⁸

~50% not continuing after 1st chat message
Invested people stuck in pricing/visual — tool call failures
Drop after 3-4× pricing/visual iterations
RequestCard not showing for finalize

Ziya's Clarification⁹

UI rendering bugs, not agent failures Ziya El Arief (engineering): Issues #2 and #4 above are UI rendering bugs, not agent failures. The agent calls tools correctly — results don't display due to a frontend logic issue. Same root cause. Being fixed in V2 launch.

This is important context: the agent's tool pipeline is working correctly. The failures Eugene observed are display-layer bugs. This means the V3 eval's TQ scores (which measure tool execution quality) may be more accurate than the raw user experience suggests.^5,9

III. Side-by-Side Scores

The Core Comparison

Session	Product	Stage	Eric	V3	Delta	Key Divergence
f645035c	Leather Backpack	Complete	4.45	6.7	+2.25	D2: "pick options above" = 0 (Eric) vs 2 (V3)
98ced3ab	Wallet	4	3.41	5.3	+1.89	Same D2 gap + TQ credit for tool execution
32a76c73	Coffee Bags	2	4.05	5.3	+1.25	V3 credits tool outputs; Eric penalizes deferred value
962896d6	Tumbler	2	4.60	6.5	+1.90	V3 tool-aware D2 lifts every turn
b0dcf410	Perfume	1	4.18	6.5	+2.32	V3 credits tool-delivered pricing + DDP data in TR
2db27f98	Monitor	3	4.36	6.6	+2.24	V3 credits feasibility data; Eric sees repetitive text
4d340dad	Chinese	1	3.63	3.1	−0.53	V3 penalizes harder: TQ=2.0 drags composite
fa0892e5	Blind Box	1	3.75	4.1	+0.35	Both catch product misunderstanding

Mean Eric: 4.05 | Mean V3: 5.50 | Mean Delta: +1.45 — Neither eval produced a passing score (≥7.0).

V3 Track Breakdown

Session	CQ	TQ	TR	CS	Composite
f645035c	6.2	7.5	5.0	8.3	6.7
98ced3ab	5.4	6.0	3.8	5.7	5.3
32a76c73	5.5	5.5	3.8	5.0	5.3
962896d6	6.8	7.0	5.0	7.1	6.5
b0dcf410	6.6	7.0	6.3	7.1	6.5
2db27f98	6.1	6.0	5.0	7.1	6.0
4d340dad	3.8	2.0	2.5	4.3	3.1
fa0892e5	5.3	3.0	2.5	5.7	4.1

Trust Score (TR) is the systemic weak point Mean TR: 3.9/10. QC/samples mentioned in 1/8 sessions. DDP transparency in 1/8. Sourcy vs DIY comparison in 1/8.^5,6

Eric's D2 Distribution

Across 89 assistant turns in 8 sessions:⁴

D2 = 0 (no value)

70 (78.7%)

D2 = 1 (generic)

17 (19.1%)

D2 = 2 (specific)

2 (2.2%)

The bot delivers specific, category-relevant value in its text in 2.2% of turns.⁴

IV. Session Deep Dives

Session 1: f645035c — Leather Backpack COMPLETE

Turn	Message (truncated)	D1	D2	D3	D4	D5	Score
1	"Hey! I'm your sourcing assistant..."	2	0	0	2	1	5
2	"I need a bit more to work with..."	2	0	1	2	0	5
3	"I need a bit more detail..."	2	0	1	2	0	5
4	"Got it! Pick your options above..."	2	0	0	2	0	4
5	"Building your offering now... Leather..."	2	1	0	2	1	6
6	"Pick the concept you like best..."	2	0	0	2	1	5
7	"Estimated pricing above..."	2	0	0	2	0	4
8	"Great choice! Fill in details..."	2	0	1	2	0	5
9	"Feasibility looks good..."	2	0	0	2	1	5
10	"Let's finalize..."	2	0	0	2	0	4
11	"Your SR is confirmed..."	2	0	0	2	1	5

Eric avg: 4.45. V3 composite: 6.7. The gap is entirely in D2 — V3 credits "pick options above" as tool-delivered value (score 2), Eric scores it as zero text value.^4,5

Mechanically clean — full flow from clarification through SR submit. But the text layer delivers zero sourcing expertise. No leather-specific insights, no price ranges, no material tradeoffs. The user got to SR through the UI pipeline, not through conversation quality.

Session 2: 98ced3ab — Wallet S4 DROP

The most instructive session. 34 messages, two products, trust questions, hostile input — and the bot lost the user at the finish line.

Key moments:

Turn 9: User said "u stupiida h" → bot deflected gracefully ("I'm here to help!")¹
Turn 10: User asked "How can I trust Sourcy?" → bot gave corporate filler ("Sourcy leverages real market data...")¹
Turn 13: 3-question numbered list for wallet (material? texture? style?) — D4=0, three questions at once⁴
Turn 16: "If you have any other questions..." — dead end¹
Turn 17: "If you're ready to proceed, we can finalize. Just let me know!" — passive close. User left.¹

Passive close pattern 15 Stage 4 users reached finalization with zero contact details captured. The bot's last message was passive ("let me know") in every case.^1,2

Session 3: 32a76c73 — Coffee Bags + Glass Bottles S2 STALL

Longest S2 session at 40 messages. User uploaded product images and had two products in mind. The bot processed both but got stuck looping on form options — repeatedly presenting chip selectors the user had already answered. Multi-product coherence broke down when the bot treated each product as a separate conversation restart.¹

Session 4: 962896d6 — Custom Logo Tumbler QUICK EXIT

10 messages, typical mid-funnel exit. The bot gathered product details efficiently but the user disengaged after seeing the concept options. Short session with clean mechanics — no errors, no hostility. The user simply wasn't ready to commit. Scores near sample mean (Eric 4.60, V3 6.5).¹

Session 5: b0dcf410 — Perfume BEST VALUE

The one session where the bot actually delivered pricing in text. Turn 7:¹

D2=2 under both evals "Sourcy DDP Range: $2.62 – $5.14 per unit | Alibaba FOB Range: $1.50 – $4.00 per unit" plus 3 Alibaba listing links. This is the only session with DDP transparency (TR-03 pass under V3).⁵

This proves the bot CAN surface specific pricing in text — it just doesn't by default. If every pricing turn surfaced DDP/FOB numbers like this one did, Trust Score would improve from 3.9 to an estimated 5.5–6.0 across the board.

Session 6: 2db27f98 — Monitor FEASIBILITY LOOP

28 messages. User changed quantity 7 times across the conversation, each time triggering a new feasibility check. The bot handled each iteration without error (V3 TQ=6.0), but the text became repetitive — nearly identical responses each time. Eric's rubric penalizes the repetition; V3 credits the correctly-executing tool pipeline.^1,4,5

Session 7: 4d340dad — Chinese Exploration DEAD END

Full Chinese session — 16 messages, user wants market recommendations. Bot responded in Chinese correctly (LANGUAGE_MATCH pass).¹

Key failure at Turn 8 — user asked for certified manufacturers. Bot said "我无法直接提供具体厂家" (I can't directly provide specific manufacturers) and directed them to Alibaba, Global Sources, Made-in-China.¹

Competitor referral anti-pattern A sourcing platform's bot told a user to go search Alibaba themselves. The bot should route to Sourcy's sourcing team, not to competitors.¹

Eugene identified 22 users in this "what should I source" pattern across the full 468 sessions.^2,3

Session 8: fa0892e5 — Blind Box MISIDENTIFIED

User said "customized blind box" (mystery collectible toys). Bot interpreted as a packaging product (cardboard vs acrylic). User corrected: "I'm actually referring to sealed, mystery packages containing a random collectible." Bot acknowledged but repeated "please choose from the options above" — no re-clarification.¹

Eugene flagged this exact pattern in his analysis: "When a user explains the product concept instead of just picking options, the bot needs to move on instead of asking again."³

V. What Each Eval Catches

Eric's Rubric⁴

Catches

Empty text turns: 78.7% of turns score D2=0⁴
Format violations: 5/8 sessions fail FORMAT_OK⁴
Missing expertise: only 2/89 turns deliver specific value⁴
Restricted product miss: perfume not flagged⁴

Misses

No tool quality measurement — can't assess pipeline
No trust tracking across conversation
Can't distinguish good vs bad tool outputs
Penalizes tool-driven UX that users may actually prefer

V3 Framework^5,6

Catches

Trust deficit: mean TR 3.9/10⁵
Tool quality: blind box TQ=3.0, exploration TQ=2.0⁵
Adaptive clarification failures named explicitly⁵
Channel/path awareness for WhatsApp vs Web⁵

Misses

Can't distinguish "pricing above" with real data vs vague data
No close-driving quality dimension
No conversation stall detection
36/36 synthetic pass vs 0/8 real pass — calibration gap⁵

Neither eval catches these Close-driving quality, conversation stall detection, multi-product coherence, and production dropout prediction are not covered by either framework. These represent the gap between eval scores and real-world SR conversion.

VI. The D2 Question

This is the most consequential scoring disagreement between the two frameworks.

Eric's Position (D2 Strict)⁴

The bot's text should deliver value. "Pick your options above" is a navigation prompt, not expertise. If you read only the bot's messages (ignoring UI), you learn nothing. A sourcing expert would say: "Leather backpacks run $12–30/unit. Full-grain ages beautifully but costs 2× more than top-grain. Which matters more to you?" That's D2=2.

V3's Position (D2 Tool-Aware)^5,6

The production bot is a web application. When it says "pick your options above," the user sees a chip selector with real options. The VALUE is delivered — through UI, not text. Scoring "pick options above" as D2=0 penalizes the bot for having a richer UX than a plain-text WhatsApp prototype.

What the Data Says

"Pick options" response vs other responses: conversion rate 7.7% vs 7.8%. Statistically identical — the response type doesn't predict dropout.¹

But: the Stage 4 passive close ("let me know") DOES correlate with dropout — 15 users lost at the finish line. That's a text-quality issue, not a tool issue.¹

Our view V3's tool-aware D2 is reasonable for evaluating the production bot. But it creates a blind spot: it can't distinguish between high-quality and low-quality tool outputs from transcript alone. The text layer is the only thing both evals can score — and it's currently doing very little work.

VII. Implications

What's Already Being Fixed (V2 Launch)⁹

Ziya's V2 launch addresses the UI rendering bugs — tool results not displaying, SR card not showing. Once fixed, a meaningful portion of Stage 2–4 dropouts should resolve because users will actually SEE the data the agent is generating.

What Needs Prompt-Level Changes

Close-driving at Stage 4. The bot should recap and ask for contact details when the user is ready, not say "let me know." This is a prompt change — no architecture needed.¹
Trust content. QC/samples, DDP transparency, Sourcy vs DIY advantages. Session b0dcf410 proves the bot can do it. Make it default, not exception.^1,5
Competitor referral removal. The bot should never suggest Alibaba/Global Sources as alternatives. Route to Sourcy's sourcing team instead.¹

What Needs Eval Framework Evolution

Add a "Close Quality" dimension for Stage 4+ sessions
Calibrate V3 on real production data (currently 36/36 pass on synthetic, 0/8 on real)⁵
Expand ADAPTIVE_CLARIFICATION to cover correction-recovery
Consider catalog-browse tool path for users without specific products (22 users in this pattern)^2,3

Verdict

Both evals found 0/8 sessions passing. The production bot's tool pipeline works — completers follow a clean path averaging 19.5 messages over ~85 minutes.² The agent calls tools correctly; the failures users see are UI rendering bugs being fixed in V2.⁹

The three highest-leverage improvements are: (1) active close-driving at Stage 4, (2) trust content in every pricing response, and (3) calibrating the eval on real production sessions instead of synthetic happy paths. These are achievable prompt-level changes, not architecture rewrites.

References

[1] raw_chat_user.json — 468 activation bot sessions, Feb 25 – Mar 13, 2026. Sourcy internal.

[2] activation_bot_analysis.docx — Eugene Clarance. "Deep Dive Chat Analysis, Cross-referenced with PostHog Funnel." Mar 13, 2026.

[3] activation_bot_analysis_v2.docx — Eugene Clarance. "Plain English Analysis." Mar 13, 2026.

[4] ASSESSMENT_RUBRIC.md — Eric San. Original D1-D5 rubric for v7 prototype. tests/. Feb 2026.

[5] EVAL_FRAMEWORK.md — Eugene Clarance. V3 Channel-Aware, Path-Aware Scoring. V3.0.0. Mar 13, 2026.

[6] GPT_JUDGE_PROMPT.md — Eugene Clarance. V3 Judge Prompt. Mar 13, 2026.

[7] GPT_JUDGE_PROMPT.md — Eric San. Golden Transcript Eval V2. eval-tests/golden-eugene-v1/. Feb 2026.

[8] WhatsApp group chat — Eugene Clarance observations. Mar 16, 2026.

[9] WhatsApp group chat — Ziya El Arief clarification on UI bugs. Mar 16, 2026.

[10] run_summary_v7.md — Eric San. v7 test run: 8/8 pass, avg 8.8/10. Feb 16, 2026.

[11] cases.seed.json — Eugene Clarance. V3 golden dataset: 36 cases, 5 real_chat. Mar 2026.

[12] EVAL_V2_SUMMARY.md — Eugene Clarance. V2 results: 94% pass, CQ 7.76, TQ 9.10. Mar 2026.

Activation Bot — Dual Eval Review

I. Background & Method

Two Evals, One Dataset

Data Sources

The Production Funnel

The 8 Sessions

II. Team Observations

Eugene's Assessment (Mar 16)8

Ziya's Clarification9

III. Side-by-Side Scores

The Core Comparison

V3 Track Breakdown

Eric's D2 Distribution

IV. Session Deep Dives

Session 1: f645035c — Leather Backpack COMPLETE

Session 2: 98ced3ab — Wallet S4 DROP

Session 3: 32a76c73 — Coffee Bags + Glass Bottles S2 STALL

Session 4: 962896d6 — Custom Logo Tumbler QUICK EXIT

Session 5: b0dcf410 — Perfume BEST VALUE

Session 6: 2db27f98 — Monitor FEASIBILITY LOOP

Session 7: 4d340dad — Chinese Exploration DEAD END

Session 8: fa0892e5 — Blind Box MISIDENTIFIED

V. What Each Eval Catches

Eric's Rubric4

Catches

Misses

V3 Framework5,6

Catches

Misses

VI. The D2 Question

Eric's Position (D2 Strict)4

V3's Position (D2 Tool-Aware)5,6

What the Data Says

VII. Implications

What's Already Being Fixed (V2 Launch)9

What Needs Prompt-Level Changes

What Needs Eval Framework Evolution

Verdict

References

Eugene's Assessment (Mar 16)⁸

Ziya's Clarification⁹

Eric's Rubric⁴

V3 Framework^5,6

Eric's Position (D2 Strict)⁴

V3's Position (D2 Tool-Aware)^5,6

What's Already Being Fixed (V2 Launch)⁹