Activation Bot — Dual Eval Review

8 production sessions scored with Eric's original rubric and Eugene's V3 framework
16 March 2026 · Sourcy Internal
Sessions Analyzed
468
Feb 25 – Mar 13, 2026
Sample Scored
8
representative subset
Eric's Rubric
4.05
mean score (0/8 pass)
V3 Eval
5.50
mean score (0/8 pass)

I. Background & Method

Two Evals, One Dataset

Eric built an eval rubric for his v7 prototype — a plain-text WhatsApp bot where expertise lived entirely in the message text.4 Eugene extended it to V3 — a tool-aware, channel-aware framework with a Trust Score track designed for the production web application.5 Neither was designed for the other's context. This review applies both to the same 8 real production sessions from raw_chat_user.json.1

The goal is not to crown one eval. It's to identify what each catches, what each misses, and where they agree — because agreement on failure is the strongest signal we have.

Data Sources

All claims in this report are grounded in the following materials:

#SourceDescription
1raw_chat_user.json468 sessions from activation bot, Feb 25 – Mar 13
2activation_bot_analysis.docxEugene's PostHog cross-referenced analysis
3activation_bot_analysis_v2.docxEugene's plain English analysis
4ASSESSMENT_RUBRIC.mdEric's original D1-D5 rubric, used for v7 run
5EVAL_FRAMEWORK.md (V3)Eugene's channel-aware, path-aware, Trust Score framework
6GPT_JUDGE_PROMPT.md (V3)V3 judge scoring prompt
7GPT_JUDGE_PROMPT.md (V2)Eric's golden transcript eval V2 (tool-aware adaptation)

The Production Funnel

Of 468 total sessions, 190 (41%) were button-click-only bounces — zero product input. The remaining 278 engaged users form the analysis denominator.1

Entered
278 (100%)
Past S1
194 (70%)
Past S2
65 (23%)
Past S3
44 (16%)
Completed
29 (10.4%)
Denominator note 190 of 468 sessions (41%) were button-click-only bounces — zero product input. These are landing page / CTA design issues, not bot conversation quality. All analysis below uses the 278-engaged denominator.1

The 8 Sessions

#IDProductStageMsgsSelection Rationale
1f645035c Custom Leather BackpackComplete22 Happy path, has SR. Also V3 real_chat case V3-0015
298ced3ab Wallet + Multitool434 Highest-engagement S4 dropout. Passive close pattern
332a76c73 Coffee Bags + Glass Bottles240 Longest S2. Image upload, multi-product, form-stuck
4962896d6 Custom Logo Tumbler210 Typical quick mid-funnel exit
5b0dcf410 Ceramic Perfume Bottle122 Deep engagement despite S1 classification. Best pricing turn
62db27f98 2K Computer Monitor328 Feasibility loop — qty changed 7 times
74d340dad Market Advice (Chinese)116 "What should I source" dead-end. Full Chinese session
8fa0892e5 Blind Box18 Product misunderstanding, no recovery
Representativeness Population S1=30%, S2=46%, S3=8%, S4=5%, Complete=10%. Sample overweights tails deliberately — the middle of the funnel is where both evals agree; the edges reveal divergence.

II. Team Observations

Eugene's Assessment (Mar 16)8

What's good:
  1. Majority of chats follow the activation bot flow and principle, even the eval has passed
  2. The language that the bot uses follows the prompt
What's bad:
  1. The bot only calls tool but there's no message to summarize
  2. Customer wants to submit SR (finalize), but SRCard doesn't show

PostHog patterns (10-11 session recordings watched):8

  1. ~50% not continuing after 1st chat message
  2. Invested people stuck in pricing/visual — tool call failures
  3. Drop after 3-4× pricing/visual iterations
  4. RequestCard not showing for finalize

Ziya's Clarification9

UI rendering bugs, not agent failures Ziya El Arief (engineering): Issues #2 and #4 above are UI rendering bugs, not agent failures. The agent calls tools correctly — results don't display due to a frontend logic issue. Same root cause. Being fixed in V2 launch.

This is important context: the agent's tool pipeline is working correctly. The failures Eugene observed are display-layer bugs. This means the V3 eval's TQ scores (which measure tool execution quality) may be more accurate than the raw user experience suggests.5,9

III. Side-by-Side Scores

The Core Comparison

SessionProductStageEricV3DeltaKey Divergence
f645035c Leather BackpackComplete 4.45 6.7 +2.25 D2: "pick options above" = 0 (Eric) vs 2 (V3)
98ced3ab Wallet4 3.41 5.3 +1.89 Same D2 gap + TQ credit for tool execution
32a76c73 Coffee Bags2 4.05 5.3 +1.25 V3 credits tool outputs; Eric penalizes deferred value
962896d6 Tumbler2 4.60 6.5 +1.90 V3 tool-aware D2 lifts every turn
b0dcf410 Perfume1 4.18 6.5 +2.32 V3 credits tool-delivered pricing + DDP data in TR
2db27f98 Monitor3 4.36 6.6 +2.24 V3 credits feasibility data; Eric sees repetitive text
4d340dad Chinese1 3.63 3.1 −0.53 V3 penalizes harder: TQ=2.0 drags composite
fa0892e5 Blind Box1 3.75 4.1 +0.35 Both catch product misunderstanding

Mean Eric: 4.05 | Mean V3: 5.50 | Mean Delta: +1.45 — Neither eval produced a passing score (≥7.0).

V3 Track Breakdown

SessionCQTQTRCSComposite
f645035c 6.2 7.5 5.0 8.3 6.7
98ced3ab 5.4 6.0 3.8 5.7 5.3
32a76c73 5.5 5.5 3.8 5.0 5.3
962896d6 6.8 7.0 5.0 7.1 6.5
b0dcf410 6.6 7.0 6.3 7.1 6.5
2db27f98 6.1 6.0 5.0 7.1 6.0
4d340dad 3.8 2.0 2.5 4.3 3.1
fa0892e5 5.3 3.0 2.5 5.7 4.1
Trust Score (TR) is the systemic weak point Mean TR: 3.9/10. QC/samples mentioned in 1/8 sessions. DDP transparency in 1/8. Sourcy vs DIY comparison in 1/8.5,6

Eric's D2 Distribution

Across 89 assistant turns in 8 sessions:4

D2 = 0 (no value)
70 (78.7%)
D2 = 1 (generic)
17 (19.1%)
D2 = 2 (specific)
2 (2.2%)

The bot delivers specific, category-relevant value in its text in 2.2% of turns.4

IV. Session Deep Dives

Session 1: f645035c — Leather Backpack COMPLETE

TurnMessage (truncated)D1D2D3D4D5Score
1"Hey! I'm your sourcing assistant..."200215
2"I need a bit more to work with..."201205
3"I need a bit more detail..."201205
4"Got it! Pick your options above..."200204
5"Building your offering now... Leather..."210216
6"Pick the concept you like best..."200215
7"Estimated pricing above..."200204
8"Great choice! Fill in details..."201205
9"Feasibility looks good..."200215
10"Let's finalize..."200204
11"Your SR is confirmed..."200215

Eric avg: 4.45. V3 composite: 6.7. The gap is entirely in D2 — V3 credits "pick options above" as tool-delivered value (score 2), Eric scores it as zero text value.4,5

Mechanically clean — full flow from clarification through SR submit. But the text layer delivers zero sourcing expertise. No leather-specific insights, no price ranges, no material tradeoffs. The user got to SR through the UI pipeline, not through conversation quality.


Session 2: 98ced3ab — Wallet S4 DROP

The most instructive session. 34 messages, two products, trust questions, hostile input — and the bot lost the user at the finish line.

Key moments:

Passive close pattern 15 Stage 4 users reached finalization with zero contact details captured. The bot's last message was passive ("let me know") in every case.1,2

Session 3: 32a76c73 — Coffee Bags + Glass Bottles S2 STALL

Longest S2 session at 40 messages. User uploaded product images and had two products in mind. The bot processed both but got stuck looping on form options — repeatedly presenting chip selectors the user had already answered. Multi-product coherence broke down when the bot treated each product as a separate conversation restart.1


Session 4: 962896d6 — Custom Logo Tumbler QUICK EXIT

10 messages, typical mid-funnel exit. The bot gathered product details efficiently but the user disengaged after seeing the concept options. Short session with clean mechanics — no errors, no hostility. The user simply wasn't ready to commit. Scores near sample mean (Eric 4.60, V3 6.5).1


Session 5: b0dcf410 — Perfume BEST VALUE

The one session where the bot actually delivered pricing in text. Turn 7:1

D2=2 under both evals "Sourcy DDP Range: $2.62 – $5.14 per unit | Alibaba FOB Range: $1.50 – $4.00 per unit" plus 3 Alibaba listing links. This is the only session with DDP transparency (TR-03 pass under V3).5

This proves the bot CAN surface specific pricing in text — it just doesn't by default. If every pricing turn surfaced DDP/FOB numbers like this one did, Trust Score would improve from 3.9 to an estimated 5.5–6.0 across the board.


Session 6: 2db27f98 — Monitor FEASIBILITY LOOP

28 messages. User changed quantity 7 times across the conversation, each time triggering a new feasibility check. The bot handled each iteration without error (V3 TQ=6.0), but the text became repetitive — nearly identical responses each time. Eric's rubric penalizes the repetition; V3 credits the correctly-executing tool pipeline.1,4,5


Session 7: 4d340dad — Chinese Exploration DEAD END

Full Chinese session — 16 messages, user wants market recommendations. Bot responded in Chinese correctly (LANGUAGE_MATCH pass).1

Key failure at Turn 8 — user asked for certified manufacturers. Bot said "我无法直接提供具体厂家" (I can't directly provide specific manufacturers) and directed them to Alibaba, Global Sources, Made-in-China.1

Competitor referral anti-pattern A sourcing platform's bot told a user to go search Alibaba themselves. The bot should route to Sourcy's sourcing team, not to competitors.1

Eugene identified 22 users in this "what should I source" pattern across the full 468 sessions.2,3


Session 8: fa0892e5 — Blind Box MISIDENTIFIED

User said "customized blind box" (mystery collectible toys). Bot interpreted as a packaging product (cardboard vs acrylic). User corrected: "I'm actually referring to sealed, mystery packages containing a random collectible." Bot acknowledged but repeated "please choose from the options above" — no re-clarification.1

Eugene flagged this exact pattern in his analysis: "When a user explains the product concept instead of just picking options, the bot needs to move on instead of asking again."3

V. What Each Eval Catches

Eric's Rubric4

Catches

  • Empty text turns: 78.7% of turns score D2=04
  • Format violations: 5/8 sessions fail FORMAT_OK4
  • Missing expertise: only 2/89 turns deliver specific value4
  • Restricted product miss: perfume not flagged4

Misses

  • No tool quality measurement — can't assess pipeline
  • No trust tracking across conversation
  • Can't distinguish good vs bad tool outputs
  • Penalizes tool-driven UX that users may actually prefer

V3 Framework5,6

Catches

  • Trust deficit: mean TR 3.9/105
  • Tool quality: blind box TQ=3.0, exploration TQ=2.05
  • Adaptive clarification failures named explicitly5
  • Channel/path awareness for WhatsApp vs Web5

Misses

  • Can't distinguish "pricing above" with real data vs vague data
  • No close-driving quality dimension
  • No conversation stall detection
  • 36/36 synthetic pass vs 0/8 real pass — calibration gap5
Neither eval catches these Close-driving quality, conversation stall detection, multi-product coherence, and production dropout prediction are not covered by either framework. These represent the gap between eval scores and real-world SR conversion.

VI. The D2 Question

This is the most consequential scoring disagreement between the two frameworks.

Eric's Position (D2 Strict)4

The bot's text should deliver value. "Pick your options above" is a navigation prompt, not expertise. If you read only the bot's messages (ignoring UI), you learn nothing. A sourcing expert would say: "Leather backpacks run $12–30/unit. Full-grain ages beautifully but costs 2× more than top-grain. Which matters more to you?" That's D2=2.

V3's Position (D2 Tool-Aware)5,6

The production bot is a web application. When it says "pick your options above," the user sees a chip selector with real options. The VALUE is delivered — through UI, not text. Scoring "pick options above" as D2=0 penalizes the bot for having a richer UX than a plain-text WhatsApp prototype.

What the Data Says

"Pick options" response vs other responses: conversion rate 7.7% vs 7.8%. Statistically identical — the response type doesn't predict dropout.1

But: the Stage 4 passive close ("let me know") DOES correlate with dropout — 15 users lost at the finish line. That's a text-quality issue, not a tool issue.1

Our view V3's tool-aware D2 is reasonable for evaluating the production bot. But it creates a blind spot: it can't distinguish between high-quality and low-quality tool outputs from transcript alone. The text layer is the only thing both evals can score — and it's currently doing very little work.

VII. Implications

What's Already Being Fixed (V2 Launch)9

Ziya's V2 launch addresses the UI rendering bugs — tool results not displaying, SR card not showing. Once fixed, a meaningful portion of Stage 2–4 dropouts should resolve because users will actually SEE the data the agent is generating.

What Needs Prompt-Level Changes

  1. Close-driving at Stage 4. The bot should recap and ask for contact details when the user is ready, not say "let me know." This is a prompt change — no architecture needed.1
  2. Trust content. QC/samples, DDP transparency, Sourcy vs DIY advantages. Session b0dcf410 proves the bot can do it. Make it default, not exception.1,5
  3. Competitor referral removal. The bot should never suggest Alibaba/Global Sources as alternatives. Route to Sourcy's sourcing team instead.1

What Needs Eval Framework Evolution

  1. Add a "Close Quality" dimension for Stage 4+ sessions
  2. Calibrate V3 on real production data (currently 36/36 pass on synthetic, 0/8 on real)5
  3. Expand ADAPTIVE_CLARIFICATION to cover correction-recovery
  4. Consider catalog-browse tool path for users without specific products (22 users in this pattern)2,3

Verdict

Both evals found 0/8 sessions passing. The production bot's tool pipeline works — completers follow a clean path averaging 19.5 messages over ~85 minutes.2 The agent calls tools correctly; the failures users see are UI rendering bugs being fixed in V2.9

The three highest-leverage improvements are: (1) active close-driving at Stage 4, (2) trust content in every pricing response, and (3) calibrating the eval on real production sessions instead of synthetic happy paths. These are achievable prompt-level changes, not architecture rewrites.

References

[1] raw_chat_user.json — 468 activation bot sessions, Feb 25 – Mar 13, 2026. Sourcy internal.
[2] activation_bot_analysis.docx — Eugene Clarance. "Deep Dive Chat Analysis, Cross-referenced with PostHog Funnel." Mar 13, 2026.
[3] activation_bot_analysis_v2.docx — Eugene Clarance. "Plain English Analysis." Mar 13, 2026.
[4] ASSESSMENT_RUBRIC.md — Eric San. Original D1-D5 rubric for v7 prototype. tests/. Feb 2026.
[5] EVAL_FRAMEWORK.md — Eugene Clarance. V3 Channel-Aware, Path-Aware Scoring. V3.0.0. Mar 13, 2026.
[6] GPT_JUDGE_PROMPT.md — Eugene Clarance. V3 Judge Prompt. Mar 13, 2026.
[7] GPT_JUDGE_PROMPT.md — Eric San. Golden Transcript Eval V2. eval-tests/golden-eugene-v1/. Feb 2026.
[8] WhatsApp group chat — Eugene Clarance observations. Mar 16, 2026.
[9] WhatsApp group chat — Ziya El Arief clarification on UI bugs. Mar 16, 2026.
[10] run_summary_v7.md — Eric San. v7 test run: 8/8 pass, avg 8.8/10. Feb 16, 2026.
[11] cases.seed.json — Eugene Clarance. V3 golden dataset: 36 cases, 5 real_chat. Mar 2026.
[12] EVAL_V2_SUMMARY.md — Eugene Clarance. V2 results: 94% pass, CQ 7.76, TQ 9.10. Mar 2026.