Designing the Decision Layer: Why I Built an AI Eval Test Bench Before I Finished the UI
There was a moment in our AI Learning Assistant project when the UI stopped being the hard part.
We had screens. We had components. We had multiple interaction models mapped out: a front-end chat, an ambient right-hand panel, an admin interface for agent configuration, and lightweight nudges to guide learners at the right moment.
And yet, the system kept breaking.
Not crashing, but misbehaving.
Responses were almost right.
Accurate, but missing nuance.
Helpful, but occasionally untrustworthy.
While we instructed the agent to be grounded in our documents, it seemed unable to navigate grounded content with consistent operational decision logic otherwise know as good judgement. Designing the navigation within the LLM became our next challenge. No amount of UI refinement will fix that.
That was the moment I realized the real design work had moved somewhere else.
When “What Goes in the Vessel” Becomes the Product
Internally, we started using a phrase that at first sounded obvious, and then became unsettling:
The vessel versus what goes in the vessel.
The vessel was familiar:
prompt input fields
response areas
system messages
configuration panels
But “what goes in the vessel” was not a string of better instructions.
It was:
how the AI reasoned across our domain
how it decided what not to answer
how it handled ambiguity
how it failed
how humans corrected it after the fact
This came into focus because our product owner kept doing something valuable: she kept breaking the system.
She would test a configuration, get promising results, then push it slightly outside the happy path. The cracks would appear immediately: inaccuracy here, brittleness there, confidence without justification somewhere else.
And the question she kept asking me, as the designer, was deceptively simple: “How do we design what goes in the vessel?”
My first honest answer was: “I don’t know yet.” My second answer was better: “We need a way to see what the AI is actually doing, and then design from that.” That’s when the idea of an AI eval test bench emerged.
The Eval Test Bench as a Design Surface
At first, “eval test bench” sounded like an engineering or research concept. But the more we talked about it, the clearer it became:
This was the design surface I was struggling to name. Not a tool for validation after the fact, but a place where:
behavior could be observed
failure modes could be named
constraints could be added intentionally
judgment could be exercised by humans, repeatedly
The eval test bench reframed our work.
Instead of asking: “What prompt should we write?” We started asking:
“Under what conditions does this response become unacceptable?”
“What assumptions is the model making here?”
“Where should it escalate, defer, or ask for clarification?”
“What patterns of failure do we expect, and which are unacceptable?”
Design moved from crafting ideal interactions to shaping feedback loops.
Tight Loops, Not Perfect Answers
Around the eval bench, a pattern emerged that now repeats every few weeks:
Do → Break → Understand → Reframe
Each loop reveals:
clearer chunking strategies for documents
better separation between instructions, data, and metadata
new questions about how the model navigates knowledge, not just retrieves it
pressure on agent builder tools that reveals missing abstractions
UI Design is Receding
The more central the eval test bench becomes, the less dominant the UI feels. The chat surface has not disappeared, but it might not be the design hero in this UX story right now.
The agent builder’s admin interface is increasingly the bottleneck: a small set of form fields straining to instruct an AI system to behave like a human with nuance, judgment, and sensitivity. The limitations aren’t visual. They’re cognitive and structural, rooted in the absence of a shared language or abstraction layer for expressing intent, constraints, and judgment.
Why This Is Still a Design Problem
None of this feels like traditional UX. And yet, it is deeply about human judgment, system intent, responsibility, and control. The eval test bench is the closest thing I have right now to a way of making design decisions possible in a system that is still learning.
What’s becoming clear is that these decisions need a more durable form like a shared way of expressing intent and constraints that can travel between people, interfaces, and models. Until that language exists, the eval bench functions as both a testing ground and a stand-in for a missing design layer.
A Note on Tokens (Next Steps)
A “shared language or abstraction layer” could take the form of tokens, not as styling artifacts like UI patterns or visual consistency, but as portable carriers of intent, constraints, and judgment.
In traditional design systems, tokens encode visual decisions: color, spacing, typography. In AI-enabled systems, that idea begins to stretch into something richer, say rich tokens? The same mechanism that once stabilized appearance can begin to stabilize behavior by capturing things like thresholds, defaults, confidence levels, escalation rules, or degrees of autonomy in a form that both humans and systems can reason about.
Tokens point toward a future where decision logic and judgment could become encoded as explicit semantics like:
confidence.threshold = high
autonomy.level = assisted
escalation.on_uncertainty = human
retrieval.scope = narrow
For now, the eval test bench is where those needs surface. Tokens are simply design variables through which decision logic may eventually be expressed.