The AI Isn't Broken.
The Collaboration Is.
When AI conversations go wrong, we blame the model. But after years designing these systems, I've started to wonder if the real failure isn't intelligence at all, it's that nobody designed for how humans actually communicate. Anthropic's latest AI Fluency research is starting to confirm what practitioners have been seeing all along.
When a conversational AI system fails in production, the diagnostic conversation that follows tends to circle the same verdict. The model is underperforming. The prompt is too loose. We need a bigger context window, a smarter retrieval layer, a more capable model. The disappointment is real, and so are the metrics that track it. After years of shipping conversational AI to millions of users across eight markets, I have grown skeptical of the reflex that hands the failure to the model.
The more interesting question is not whether the model is intelligent enough. It is whether the collaboration we built around the model ever had a chance to succeed. That reframing matters because the design discipline most teams need at the moment is not a better model. It is a better theory of how people and machines actually cooperate at the interface.
The misattribution reflex
The history of new technologies is a history of misattribution. Early automobile accidents were often blamed on the animal the driver swerved to avoid; early aviation disasters on weather that had been predictable if anyone had a standard way of recording it. The pattern repeats whenever a system grows complex enough that the failure surface spans several disciplines. When no single person can hold all the moving pieces in view, causation gets pinned to the most visible component.
In conversational AI, the most visible component is the model. The model is the thing that produces the output the user reads. When the output is wrong, strange, or absent, the model earns the blame. The underlying prompt, the retrieval pipeline, the tool schema, the fallback policy, the handoff contract, and the knowledge base that feeds all of them sit behind the curtain. None of those components is as narratively satisfying to indict. The model is.
The cost of this reflex is specific. Teams iterate on the wrong variable. They evaluate models when they should be evaluating their evaluation harness. They tune prompts when the retrieval layer is returning irrelevant documents. They escalate to a larger context window when the issue is that no one defined what the model should carry between turns. Donald Norman, writing about ordinary door handles long before anyone was arguing about context windows, called this an affordance problem[1] : the system tells the user what kind of interaction is possible, and the failures live in the mismatch between what it offers and what is required. Conversational AI has the same problem, at a higher temperature.
What the research is beginning to say
Anthropic's recent work on what it calls AI Fluency[2] offers one of the cleaner articulations of the shift. The people who consistently get useful work out of AI systems, the research suggests, are not the ones with the deepest technical knowledge. They are the ones who have learned to communicate with these systems as collaborators. They give context. They specify constraints. They treat the response as input to the next turn rather than as a final judgment. They know when to stop and when to iterate. The skill looks less like programming and more like editorial judgment applied in real time.
That framing is important because it relocates the design problem. If fluency is a collaborative practice, then the instrument of collaboration, the prompt, the tool surface, the memory structure, the handoff policy, is where design leverage sits. Anthropic's work on constitutional AI points in the same direction from the opposite side of the interaction: the behavior of a capable model is shaped by the principles the system encodes, and those principles live outside the model itself. Design is not prompt phrasing. It is the architecture of what the system treats as given, what it treats as variable, and what it treats as out of bounds.[3]
Adjacent disciplines have been saying similar things for a long time. Conversation analysis, going back to Harvey Sacks and his contemporaries,[4] has documented how ordinary human talk depends on shared repair strategies. When one party mishears or misspeaks, a functioning conversation has well worn mechanisms to recover: paraphrase, clarification, partial repetition, confirmation. Conversational AI inherits the same problem and has to solve it without the social cues humans rely on. The system that lacks a repair strategy is not unintelligent. It is underdesigned.
Three collaboration failures that masquerade as model failures
The failure modes I see most often, across enterprise engagements and post mortems, cluster into three categories. None of them is a problem the model can solve alone.
Context failures
The model is asked to do a task without the information it would need to do the task well. The user knows something the system does not. The system has access to a database that the model cannot query. A previous turn established a constraint, but the current turn does not carry it. In each case, the model either fabricates to fill the gap or answers thinly. Neither outcome is a model problem. It is a context engineering problem: the discipline of deciding what the system remembers, what it retrieves, what it hands across, and what it lets go. A great deal of what reads as hallucination in production logs is, on inspection, the model confidently completing a pattern in the absence of the one fact that would have corrected it.
Feedback failures
The system produces an output. The user does something with it. Some of what they do, the click, the follow up question, the abandonment, contains information about whether the output served them. In many production systems, that information is never routed back into the loop that would improve the next output. Reinforcement learning from human feedback has become a popular phrase, but the day to day version of it, the capturing of labeled moments, the promotion of useful examples into an evaluation set, the deliberate closing of the improvement loop, is often absent. The result is a system that stops learning from its own operation. Practitioners mistake this for the model plateauing. The model is not plateauing. The feedback architecture is missing.
Handoff failures
The collaboration is never only between the user and the model. Somewhere downstream a tool has to be called, a record has to be written, an agent has to step in, a human has to take over. Each of these transitions is a handoff, and each handoff is a point of friction where context is supposed to travel across an interface. When the handoff schema is vague, context leaks. The human agent picks up a conversation they cannot reconstruct. The downstream tool receives a payload it cannot validate. The second model in the orchestration has no record of what the first one established. These are not AI problems. They are integration and specification problems that the AI cannot patch over.
The orchestration frame
The industry has converged, somewhat awkwardly, on the word "agent" to describe the systems that handle these multi step, multi tool tasks. The word carries more weight than it can support. Most of what gets called an agent is, in practice, an orchestrated workflow with language model backed decision points. That is a useful thing to build. It is not, strictly speaking, an intelligent entity pursuing goals in the world. Anthropic's own essay on building effective agents[5] makes the architectural distinction clear: the useful unit is an orchestration pattern, and the design work sits in the choreography, not in the personality of the component.
Conversation design, rightly understood, is orchestration design. The practitioner's job is to decide which role the model plays at which moment, what each role knows, what tools each role can reach for, and how state flows between them. The system prompt is a role specification. The tool schema is a contract. The memory strategy is a policy. None of these artifacts are visible to the user. All of them are what the user is actually experiencing. When we invest in the model and neglect the orchestration, we are investing in an actor and neglecting the play.
What conversation design actually designs
The term conversation design invites a narrow reading: the craft of composing what the bot says. The Conversation Design Institute[6] and the practitioner community around it have been expanding the definition for years, and the broader reading is the one that holds up in production. The work is the design of the collaboration, which includes but exceeds the utterance. What can the user say. What does the system understand. What does it do when it does not understand. What does it carry forward. When does it ask for help. When does it hand off. What does it show the user about its own confidence. What does it learn from what happens next.
A useful test: if two competent practitioners disagree about a system's behavior, the correct response is almost never to change the prompt. It is to find the place in the specification where the behavior was underdetermined, and to tighten it. That discipline looks more like spec writing than copywriting. It is why the best conversational AI work sits uncomfortably between product, engineering, research, and content. None of those functions on their own can specify the collaboration, because the collaboration is what they produce together.
Shipping the glue
The craft question that follows from all of this is practical. If the failures we blame on the model are really failures of collaboration, how does a team start building the collaboration on purpose? The short answer, reflected in the eval harnesses and prompt specifications and handoff contracts my engagements tend to produce, is that the team starts taking the scaffolding seriously as a first class artifact. Prompt architecture gets versioned like code. Tool schemas get reviewed in pull requests. Memory policy becomes a written document. Handoff payloads get their own evaluation. The system prompt is edited in the same room as the retrieval strategy, with the same care, by people who understand what each change costs the user.
This is not a call to build more infrastructure for its own sake. It is a call to recognize that the infrastructure is where the design lives. The model will get better. It already has, many times over. The collaboration around the model will get better only if someone is doing the work of designing it. That person does not have a single job title. They sit in conversation design, prompt engineering, LLM product strategy, AI orchestration, conversational AI architecture, and increasingly in a role that does not have a clean name yet. The label matters less than the work.
The failure modes are fixable. That is the point I keep returning to. The teams I have seen recover from bad launches have rarely done it by switching models. They have done it by fixing the collaboration. They specified what the system knows. They built the feedback loop. They rewrote the handoff. They stopped asking the model to solve a problem the model was never holding.
The AI is not broken. It was never the thing that broke. The collaboration we designed around it was thin, and thin collaboration produces thin outputs. That is a design problem, and design problems have the decency to be solvable.
References
- [1] Donald A. Norman. The Design of Everyday Things . Basic Books , 2013 (revised ed.) ↩
- [2] Anthropic. AI Fluency: Framework & Foundations . Anthropic , 2025 ↩
- [3] Yuntao Bai et al.. Constitutional AI: Harmlessness from AI Feedback . arXiv:2212.08073 , 2022 ↩
- [4] Harvey Sacks. Lectures on Conversation, Vols. I & II . Blackwell , 1992 — Edited by Gail Jefferson; foundational work on conversation analysis and repair. ↩
- [5] Anthropic. Building effective agents . Anthropic Engineering , 2024 ↩
- [6] Conversation Design Institute. Conversation Design Curriculum & Practitioner Resources . CDI ↩