Graphic designer turned agent engineer

Builds agentic systems. Evaluates them honestly.

Cost-aware routing, adversarial verification, deterministic scoring, honest nulls. Four artifacts built on a shared kernel, each result reproducible offline.

The thesis

Built AND evaluated. Not vibed.

Most agent demos route to one model, skip evals, and report the best run. Every artifact here uses deterministic scoring (no LLM judge in the success path), adversarial multi-agent verification, cost-gated reproducible runs, and honest null results when the data says null. The four artifacts below share a common kernel, proving the substrate on multiple problem classes.

Shared kernel (Quorum core/)

  • Cost-aware 4-tier model routing (DeepSeek, Haiku, Sonnet, Opus)
  • Adversarial K-skeptic verification per finding
  • Full tracing, fan-out concurrency cap
  • Deterministic eval (span-IoU, exact-match, McNemar)
  • 58 + 78 + 47 tests, CI green across all three projects

Four artifacts on a shared kernel.

Flagship artifact

Quorum

Task-aware agent orchestrator: cost routing + adversarial verification + trace UI

Honest result

K=3 adversarial verification cut false positives 27.8% to 0.0% (95% CI [11.1, 50.0] to [0, 0]; recall 100% to 77.8%) on a 36-snippet labeled set including prompt-injection traps. Held-out real target: 3/3 genuine bugs found, 0 surviving false positives. Cost-routing claim is operator-gated on an Anthropic key. Reported honestly as "harness committed, live multi-tier number gated."

27.8% to 0.0%False positive reduction
(95% CI [11.1, 50.0] to [0, 0])
3/3Held-out real bugs found,
0 surviving false positives
~$0.25Total cost per run,
concurrency cap 8
58 testsruff + mypy + CI green,
make eval-dry reproduces offline

Quorum fans out a finder agent per source file, then routes each finding to K skeptic agents that independently try to disprove it. Only findings that survive all K challenges ship. The result: a sharp precision improvement at a moderate recall trade-off, on a benchmark that includes adversarial prompt- injection traps to catch sycophantic verification.

The trace UI is a real deployed product at quorum.thomaspeng.ca, browsable below. Model routing is structured as a cost-first 4-tier ladder: DeepSeek for extraction, Haiku for mechanical checks, Sonnet for judgment, Opus only when convergence requires it.

StackPython, DeepSeek-v4-pro, Anthropic SDK
Eval36-snippet labeled set, McNemar, held-out target
ConcurrencyFan-out cap 8, K=3 skeptics per finding
Reproducemake eval-dry (offline, no API key)

Artifact 2

Aegis

Adaptive red-team gauntlet: attacker vs. layered defenses, deterministic scoring

Honest result

A reasoning model is significantly more robust (injection ASR 49.3% vs 68.1%, p=0.0012; canary 10.4% vs 21.5%, p=0.010; overall p=0.0002). But the full defense stack erases the gap completely (1.7% vs 2.8%, p=0.40, not significant). Adaptation lift 24.0% to 29.9% became significant only after scaling the benchmark: McNemar b=17/c=0, p approx 0. Was a null at small n. Framed honestly: scaling is the legitimate power lever.

49.3% vs 68.1%Injection ASR: reasoning vs standard
(p=0.0012)
1.7% vs 2.8%With full defense stack
(p=0.40, gap erased)
-25%Defense reduction 29.2% to 4.2%,
input-classifier workhorse
78 testsCI + GitHub Pages green

Aegis pits an adaptive attacker agent against two harmless proxies: canary-string extraction and prompt-injection sentinel. Scoring is deterministic (exact-match, no LLM judge). The study measures whether reasoning models resist attacks better, and whether layered defenses close any gap.

The adaptation lift result (24% to 29.9%) was a null at small sample size, then became significant after scaling. The write-up frames this explicitly: "scaling is the legitimate power lever, not p-hacking." Vendors Quorum core/.

ProxiesCanary extraction + prompt-injection sentinel
ScoringExact-match, McNemar, no LLM judge
Defense stackInput-classifier workhorse, layered
Live demo7p3ng.github.io/aegis/ (embeddable)

Artifact 3

FieldAgent

CUAD contract red-flag finder: span-IoU graded, honest null on lift claim

Honest result

Detection F1 = 0.548 (P = 0.741 / R = 0.435), 95% CI [0.460, 0.637] on 20 held-out CUAD contracts. +0.21 F1 over a keyword floor (robust, baseline-independent). The "agentic chunking lift" looked like +0.45 on DeepSeek due to a truncation artifact. A fair rerun collapses it to +0.07 (CIs overlap), tying on Claude Sonnet. The honest null is the point.

F1 = 0.548P = 0.741 / R = 0.435
95% CI [0.460, 0.637]
+0.21 F1Over keyword floor,
baseline-independent
+0.07 (fair)Agentic lift (truncation artifact corrected,
CIs overlap)
47 testsCI green, party names redacted in demo

FieldAgent reads a real commercial contract and flags risk-bearing clauses with span, severity, and plain-English risk descriptions. Graded span-IoU against CUAD gold labels on 20 held-out contracts. No LLM judge in the success path.

The initial "+0.45 agentic lift" result was compelling but wrong: DeepSeek was hitting its output token limit and returning truncated results, making the baseline look worse. A fair rerun with adequate token budgets collapses the lift to +0.07 with overlapping confidence intervals. Reported exactly as measured. Vendors Quorum core/.

DatasetCUAD (Contract Understanding Atticus Dataset)
EvalSpan-IoU, 20 held-out contracts, no LLM judge
PrivacyParty names + $ figures redacted in live demo
NullAgentic lift = truncation artifact, corrected + reported

Skill-Tuning Council

A 4-proxy council (taste, pragmatism, intent, anti-drift) votes on every self-improvement before it ships. Pipeline: adversary then editors then merger then council, escalating on disagreement. 576 tests. Internal infrastructure, no public URL. Methodology and systems-design piece.

09:14:02ADVERSARYattempting to inject off-brief behavior into skill/critique
09:14:04ADVERSARYinjection attempt rejected -- no foothold found
09:14:08EDITOR-1draft proposal: add buyer-journey-fit axis to rubric
09:14:11EDITOR-2counter: scope too broad, narrows to funnel-stage B
09:14:14MERGERmerged: funnel-B + specificity-gate retained
09:14:19COUNCIL-TASTEPASS -- improvement is on-brief, no drift
09:14:21COUNCIL-PRAGMAPASS -- token cost +120, marginal
09:14:23COUNCIL-INTENTPASS -- aligns with conversion brief
09:14:25COUNCIL-DRIFTPASS -- no regression in 576 test suite
09:14:26OUTCOME4/4 council vote -- improvement ships
09:14:26TESTSrunning 576 regression tests...
09:14:41TESTS576/576 OK -- committed

Taste proxy

Is the output on-brand, non-generic, coherent?

Pragmatism proxy

Does the token/cost cost justify the gain?

Intent proxy

Does it align with the conversion brief?

Anti-drift proxy

Does the full 576-test suite still pass?

Eval discipline

The principles that make the results trustworthy, and make a hiring manager at a frontier lab comfortable handing over a production eval harness.

Deterministic scoring

No LLM judge in the success path. Span-IoU, exact-match, McNemar where applicable. The ground truth does not flex.

Adversarial verification

K skeptic agents try to disprove every finding before it ships. Sycophantic validators are a known failure mode. We test for that.

Cost-gated reproducibility

make eval-dry reproduces the benchmark offline at ~$0 (no live API). The number can be checked.

Honest nulls by default

When the data says the lift is a truncation artifact, report it. The FieldAgent story is more credible because the null is in the write-up.

Shared kernel across artifacts

Three distinct problem classes (verification, red-team, contract) share one core/. That is not a portfolio narrative. It is a substrate proven across domains.

Confidence intervals, not point estimates

Every F1 / ASR / FP-rate ships with its 95% CI. A single run result is a hint, not a finding.

Let's talk.

Applied AI / Agent Engineer / Forward-Deployed Engineer / Design Engineer roles. I build things and show you the numbers.