Designing the Decision Layer: Why I Built an AI Eval Test Bench Before I Finished the UI

There was a moment in our AI Learning Assistant project when the UI stopped being the hard part.

We had screens. We had components. We had multiple interaction models mapped out: a front-end chat, an ambient right-hand panel, an admin interface for agent configuration, and lightweight nudges to guide learners at the right moment.

And yet, the system kept breaking.

  • Not crashing, but misbehaving.

  • Responses were almost right.

  • Accurate, but missing nuance.

  • Helpful, but occasionally untrustworthy.

While we instructed the agent to be grounded in our documents, it seemed unable to navigate grounded content with consistent operational decision logic otherwise know as good judgement. Designing the navigation within the LLM became our next challenge. No amount of UI refinement will fix that.

That was the moment I realized the real design work had moved somewhere else.

When “What Goes in the Vessel” Becomes the Product

Internally, we started using a phrase that at first sounded obvious, and then became unsettling:

The vessel versus what goes in the vessel.

The vessel was familiar:

  • prompt input fields

  • response areas

  • system messages

  • configuration panels

But “what goes in the vessel” was not a string of better instructions.

It was:

  • how the AI reasoned across our domain

  • how it decided what not to answer

  • how it handled ambiguity

  • how it failed

  • how humans corrected it after the fact

This came into focus because our product owner kept doing something valuable: she kept breaking the system.

She would test a configuration, get promising results, then push it slightly outside the happy path. The cracks would appear immediately: inaccuracy here, brittleness there, confidence without justification somewhere else.

And the question she kept asking me, as the designer, was deceptively simple: “How do we design what goes in the vessel?”

My first honest answer was: “I don’t know yet.” My second answer was better: “We need a way to see what the AI is actually doing, and then design from that.” That’s when the idea of an AI eval test bench emerged.

The Eval Test Bench as a Design Surface

At first, “eval test bench” sounded like an engineering or research concept. But the more we talked about it, the clearer it became:

This was the design surface I was struggling to name. Not a tool for validation after the fact, but a place where:

  • behavior could be observed

  • failure modes could be named

  • constraints could be added intentionally

  • judgment could be exercised by humans, repeatedly

The eval test bench reframed our work.

Instead of asking: “What prompt should we write?” We started asking:

  • “Under what conditions does this response become unacceptable?”

  • “What assumptions is the model making here?”

  • “Where should it escalate, defer, or ask for clarification?”

  • “What patterns of failure do we expect, and which are unacceptable?”

Design moved from crafting ideal interactions to shaping feedback loops.

Tight Loops, Not Perfect Answers

Around the eval bench, a pattern emerged that now repeats every few weeks:

Do → Break → Understand → Reframe

Each loop reveals:

  • clearer chunking strategies for documents

  • better separation between instructions, data, and metadata

  • new questions about how the model navigates knowledge, not just retrieves it

  • pressure on agent builder tools that reveals missing abstractions

UI Design is Receding

The more central the eval test bench becomes, the less dominant the UI feels. The chat surface has not disappeared, but it might not be the design hero in this UX story right now.

The agent builder’s admin interface is increasingly the bottleneck: a small set of form fields straining to instruct an AI system to behave like a human with nuance, judgment, and sensitivity. The limitations aren’t visual. They’re cognitive and structural, rooted in the absence of a shared language or abstraction layer for expressing intent, constraints, and judgment.

Why This Is Still a Design Problem

None of this feels like traditional UX. And yet, it is deeply about human judgment, system intent, responsibility, and control. The eval test bench is the closest thing I have right now to a way of making design decisions possible in a system that is still learning.

What’s becoming clear is that these decisions need a more durable form like a shared way of expressing intent and constraints that can travel between people, interfaces, and models. Until that language exists, the eval bench functions as both a testing ground and a stand-in for a missing design layer.

A Note on Tokens (Next Steps)

A “shared language or abstraction layer” could take the form of tokens, not as styling artifacts like UI patterns or visual consistency, but as portable carriers of intent, constraints, and judgment.

In traditional design systems, tokens encode visual decisions: color, spacing, typography. In AI-enabled systems, that idea begins to stretch into something richer, say rich tokens? The same mechanism that once stabilized appearance can begin to stabilize behavior by capturing things like thresholds, defaults, confidence levels, escalation rules, or degrees of autonomy in a form that both humans and systems can reason about.

Tokens point toward a future where decision logic and judgment could become encoded as explicit semantics like:

  • confidence.threshold = high

  • autonomy.level = assisted

  • escalation.on_uncertainty = human

  • retrieval.scope = narrow

For now, the eval test bench is where those needs surface. Tokens are simply design variables through which decision logic may eventually be expressed.

Next
Next

Evolving the Customer Persona: From Slide Decks to Living Conversations