How Claude Is Challenging Traditional Design Systems

← Back to Blog

The design system, in its classical form, is a contract. Design defines the tokens, the components, the permitted variations, and the rules for how everything connects. Engineering implements them. Every surface in the product honours the contract. The user experiences consistency, and consistency — the logic goes — is what makes an interface trustworthy.

It's a model that served us well for roughly two decades. Then Claude arrived. Not just Claude specifically, but the class of large language models that generate interface content, adapt to context, and respond to natural language in real time. And they are, quietly and thoroughly, breaking the assumptions the traditional design system was built on.

The consistency problem

Traditional design systems achieve consistency through constraint. A button looks the same everywhere because the component is the same everywhere. An error message follows the same structure because it's drawn from the same pattern library. The designer's job is to define the rules; the system enforces them.

Now consider what happens when Claude writes the error message. The text adapts to what the user just tried to do, their apparent level of technical sophistication, and the conversational tone established earlier in the session. Two users encountering the same error state will read different messages — both accurate, both helpful, neither identical. The output is probabilistic by design.

This is not a bug. It's often a better user experience than a static, one-size-fits-all message. But it fundamentally challenges the consistency model. You cannot put a generative AI response into a component library. You can't version-control it or audit it in the same way. The traditional governance model has no good answer for outputs it didn't specify in advance.

The design system used to define what the interface would say. Now it has to define how the interface should think — and those are very different documents.

Tokens, components, and the problem of behavior

Design tokens — the named values for colour, spacing, typography — were a genuine leap forward in design system thinking. They abstracted visual decisions from their implementations, making it possible to change a product's visual language without touching every individual component.

In an AI-native product, the equivalent abstraction is behavioral. What tone should the product take when a user is frustrated? How much should it explain versus assume? When should it be direct and when should it hedge? These are not visual decisions, but they shape the user experience just as powerfully as colour and spacing.

I've started calling these "behavior tokens" — named, reusable decisions about how an AI-powered interface should act in different contexts. They live in system prompts and evaluation criteria rather than Figma files and code variables. But they serve the same function: they make behavioral decisions consistent, auditable, and changeable without rewriting everything from scratch.

Evaluation as a design artifact

One of the most underappreciated shifts in AI-native product work is the rise of evaluation as a design discipline. When your interface generates its own content, you cannot manually review every output. Instead, you define what "good" looks like — and build systems to check whether the AI is meeting that bar.

This is design work. Deciding what constitutes a high-quality AI response, what failure modes are unacceptable, what the edge cases look like — these are design decisions that require the same kind of rigorous thinking that goes into defining component states or writing a content style guide. They just live in a different place.

Design system teams that aren't yet thinking about evaluation rubrics as part of their remit are missing a significant part of what their product's quality actually depends on.

What a design system looks like in an AI-native product

The design system isn't going away. Visual consistency still matters. Accessibility still requires defined standards. Brand expression still needs to be coherent across surfaces. But the scope of what a design system needs to cover has expanded significantly:

Visual layer: tokens, components, layout patterns — same as before, still essential.
Content layer: voice and tone guidelines written as behavioral constraints for the AI, not just advice for human copywriters.
Behavior layer: named decisions about how AI-powered surfaces should respond to context, including explicit rules for edge cases and failure states.
Evaluation layer: criteria for assessing whether AI outputs meet the product's quality and safety standards, with examples and counter-examples.

None of this replaces what came before. But the design system teams who will be most valuable in the next few years are the ones who understand that their scope now extends into territory that wasn't theirs to own two years ago.

A practical starting point

If you're working on a product that already has AI-generated content somewhere — a recommendation, a generated message, an adaptive UI copy — ask yourself: who defined how that content should behave? Is that definition written down anywhere? Could a new team member understand it?

In most cases, the answer is no. Someone made a decision in a system prompt, or in a model fine-tuning run, or in a product requirements document that nobody has looked at since the feature shipped. That's the gap your design system needs to close.

Start there. Document the behavioral decisions that already exist. Make them visible, reviewable, and changeable through a process. That's not a complete answer to the challenge conversational AI poses — but it's a more honest starting point than pretending the traditional model still covers everything it used to.