The AI Isn't Broken. The Collaboration Is., Christi Akinwumi

When a conversational AI system fails in production, the diagnostic conversation that follows tends to circle the same verdict. The model's underperforming. The prompt is too loose. We need a bigger context window, a smarter retrieval layer, a more capable model. The disappointment is real, and so are the metrics that track it. After years of building conversational AI to millions of users across eight markets, I've grown skeptical of the reflex that hands the failure to the model.

The more interesting question isn't whether the model is intelligent enough. It's whether the collaboration we built around the model ever had a chance to succeed. That reframing matters, because the design discipline most teams need right now isn't a better model. It's a better theory of how people and machines actually cooperate at the interface.

The misattribution reflex

The history of new technologies is a history of misattribution. Early automobile accidents were often blamed on the animal the driver swerved to avoid; early aviation disasters on weather that had been predictable if anyone had a standard way of recording it. The pattern repeats whenever a system grows complex enough that the failure surface spans several disciplines. When no single person can hold all the moving pieces in view, causation gets pinned to the most visible component.

In conversational AI, the most visible component is the model. It's the thing that produces the output the user reads. When that output is wrong, strange, or absent, the model earns the blame. The underlying prompt, the retrieval pipeline, the tool schema, the fallback policy, the handoff contract, and the knowledge base that feeds all of them sit behind the curtain. None of those is as narratively satisfying to indict. The model is.

The cost of this reflex is specific. Teams iterate on the wrong variable. They evaluate models when they should be evaluating their evaluation harness. They tune prompts when the retrieval layer is returning irrelevant documents. They escalate to a larger context window when the real issue is that no one defined what the model should carry between turns. Donald Norman, writing about ordinary door handles long before anyone was arguing about context windows, called this an affordance problem^[1]: the system tells the user what kind of interaction is possible, and the failures live in the mismatch between what it offers and what's required. Conversational AI has the same problem, at a higher temperature.

What the research is beginning to say

Anthropic's recent work on what it calls AI Fluency^[2] offers one of the cleaner articulations of the shift. The people who consistently get useful work out of AI systems, the research suggests, aren't the ones with the deepest technical knowledge. They're the ones who have learned to communicate with these systems as collaborators. They give context. They specify constraints. They treat the response as input to the next turn rather than as a final judgment. They know when to stop and when to iterate. The skill looks less like programming and more like editorial judgment applied in real time.

That framing is important because it relocates the design problem. If fluency is a collaborative practice, then the instrument of collaboration, the prompt, the tool surface, the memory structure, the handoff policy, is where design leverage sits. Anthropic's work on constitutional AI points in the same direction from the opposite side of the interaction: the behavior of a capable model is shaped by the principles the system encodes, and those principles live outside the model itself. Design isn't prompt phrasing. It's the architecture of what the system treats as given, what it treats as variable, and what it treats as out of bounds.^[3]

Adjacent disciplines have been saying similar things for a long time. Conversation analysis, going back to Harvey Sacks and his contemporaries,^[4]has documented how ordinary human talk depends on shared repair strategies. When one party mishears or misspeaks, a functioning conversation has well worn mechanisms to recover: paraphrase, clarification, partial repetition, confirmation. Conversational AI inherits the same problem and has to solve it without the social cues humans rely on. A system that lacks a repair strategy isn't unintelligent. It's underdesigned.

Three collaboration failures that masquerade as model failures

The failure modes I see most often, across enterprise engagements and post mortems, cluster into three categories. None of them is a problem the model can solve on its own.

Context failures

The model is asked to do a task without the information it would need to do the task well. The user knows something the system does not. The system has access to a database that the model cannot query. A previous turn established a constraint, but the current turn does not carry it. In each case, the model either fabricates to fill the gap or answers thinly. Neither outcome is a model problem. It's a context engineering problem: the discipline of deciding what the system remembers, what it retrieves, what it hands across, and what it lets go. A great deal of what reads as hallucination in production logs is, on inspection, the model confidently completing a pattern in the absence of the one fact that would have corrected it.

Feedback failures

The system produces an output. The user does something with it. Some of what they do, the click, the follow up question, the abandonment, contains information about whether the output served them. In many production systems, that information is never routed back into the loop that would improve the next output. Reinforcement learning from human feedback has become a popular phrase, but the day to day version of it, the capturing of labeled moments, the promotion of useful examples into an evaluation set, the deliberate closing of the improvement loop, is often absent. The result is a system that stops learning from its own operation. Practitioners mistake this for the model plateauing. The model isn't plateauing. The feedback architecture is missing.

Handoff failures

The collaboration is never only between the user and the model. Somewhere downstream a tool has to be called, a record has to be written, an agent has to step in, a human has to take over. Each of these transitions is a handoff, and each handoff is a point of friction where context is supposed to travel across an interface. When the handoff schema is vague, context leaks. The human agent picks up a conversation they cannot reconstruct. The downstream tool receives a payload it can't validate. The second model in the orchestration has no record of what the first one established. These aren't AI problems. They're integration and specification problems the AI can't patch over.

The orchestration frame

The industry has converged, somewhat awkwardly, on the word "agent" to describe the systems that handle these multi step, multi tool tasks. The word carries more weight than it can support. Most of what gets called an agent is, in practice, an orchestrated workflow with language model backed decision points. That's a useful thing to build. It isn't, strictly speaking, an intelligent entity pursuing goals in the world. Anthropic's own essay on building effective agents^[5] makes the architectural distinction clear: the useful unit is an orchestration pattern, and the design work sits in the choreography, not in the personality of the component.

Conversation design, rightly understood, is orchestration design. The practitioner's job is to decide which role the model plays at which moment, what each role knows, what tools each role can reach for, and how state flows between them. The system prompt is a role specification. The tool schema is a contract. The memory strategy is a policy. None of these artifacts are visible to the user. All of them are what the user is actually experiencing. Invest in the model and neglect the orchestration, and you've hired an actor and skipped the play.

What conversation design actually designs

The term conversation design invites a narrow reading: the craft of composing what the bot says. The Conversation Design Institute^[6] and the practitioner community around it have been expanding the definition for years, and the broader reading is the one that holds up in production. The work is the design of the collaboration, which includes but exceeds the utterance. What can the user say? What does the system understand? What does it do when it doesn't understand? What does it carry forward? When does it ask for help? When does it hand off? What does it show the user about its own confidence? What does it learn from what happens next?

A useful test: if two competent practitioners disagree about a system's behavior, the right response is almost never to change the prompt. It's to find the place in the specification where the behavior was underdetermined, and tighten it. That discipline looks more like spec writing than copywriting. It's why the best conversational AI work sits uncomfortably between product, engineering, research, and content. None of those functions on its own can specify the collaboration, because the collaboration is what they produce together.

Building the glue

The craft question that follows from all of this is practical. If the failures we blame on the model are really failures of collaboration, how does a team start building the collaboration on purpose? The short answer, reflected in the eval harnesses and prompt specifications and handoff contracts my engagements tend to produce, is that you start taking the scaffolding seriously as a first class artifact. Prompt architecture gets versioned like code. Tool schemas get reviewed in pull requests. Memory policy becomes a written document. Handoff payloads get their own evaluation. The system prompt gets edited in the same room as the retrieval strategy, with the same care, by people who understand what each change costs the user.

This isn't a call to build more infrastructure for its own sake. It's a call to recognize that the infrastructure is where the design lives. The model will get better. It already has, many times over. The collaboration around the model will get better only if someone is doing the work of designing it. That person doesn't have a single job title. They sit in conversation design, prompt engineering, LLM product strategy, AI orchestration, conversational AI architecture, and increasingly in a role that doesn't have a clean name yet. The label matters less than the work.

The failure modes are fixable. That's the point I keep returning to. The teams I've seen recover from bad launches have rarely done it by switching models. They've done it by fixing the collaboration. They specified what the system knows. They built the feedback loop. They rewrote the handoff. They stopped asking the model to solve a problem it was never holding.

The AI isn't broken. It was never the thing that broke. The collaboration we designed around it was thin, and thin collaboration produces thin outputs. That's a design problem, and design problems have the decency to be solvable.

References

[1]Donald A. Norman. The Design of Everyday Things. Basic Books, 2013 (revised ed.)↩
[2]Anthropic. AI Fluency: Framework & Foundations. Anthropic, 2025↩
[3]Yuntao Bai et al.. Constitutional AI: Harmlessness from AI Feedback. arXiv:2212.08073, 2022↩
[4]Harvey Sacks. Lectures on Conversation, Vols. I & II. Blackwell, 1992 · Edited by Gail Jefferson; foundational work on conversation analysis and repair.↩
[5]Anthropic. Building effective agents. Anthropic Engineering, 2024↩
[6]Conversation Design Institute. Conversation Design Curriculum & Practitioner Resources. CDI↩

The AI Isn't Broken.
The Collaboration Is.