
Customer service is where most mid-market companies try AI first. It is also where it most visibly fails. The problem is almost never the model. It is that nobody decided, on purpose, which work the AI should touch and which work it should never go near.
By Skip Marshall & Chuck Griess
When an operator tells me they are "doing AI," the next sentence is almost always about customer service.
It makes sense. Support is high volume, it is measurable, and it shows up on a budget line the CFO already wants to shrink. So the chatbot goes on the website, the email triage tool gets switched on, and for about a quarter everyone feels like the future arrived on schedule. Then the wheels start to wobble. The bot confidently tells a customer something that is not true. A refund gets approved that should have been escalated. The CSAT line, the one nobody was watching because deflection looked so good, has quietly dropped half a point.
I was on a call last month with the COO of a mid-market services business who had been through exactly this. She opened with a number she was proud of: the AI was now closing sixty-eight percent of inbound tickets without a human. Impressive. Then I asked her what happened to the customers in that sixty-eight percent. Did their problem actually get solved, or did the ticket just close? She did not have that number. Nobody did. The system measured whether the conversation ended, not whether the customer was helped.
The companies getting AI customer service wrong are not picking bad tools. They are skipping the decision that comes before the tool.
That gap is the whole article. The companies getting AI customer service wrong are not picking bad tools. They are skipping the decision that comes before the tool. They are treating "support" as one thing to be automated, when it is actually three different kinds of work that each deserve a different answer.
This one is a joint piece because the decision has two halves. I can tell you which work to put the AI on and which work to keep away from it. Chuck can tell you how to wire it so the line between those two never blurs by accident, and how to measure whether any of it is actually working. We have been making this call on our own client engagements for six months now, and we have a framework for it.
The question I get asked is "can AI handle our support?" It is the wrong question, the same way "should we automate this" was the wrong question in Series 1 when we wrote about delivery.
The better question is the one Skip just framed. Support is not one workflow. It is a portfolio of interactions that differ on two axes that actually matter: how clear the correct outcome is, and how much it costs you when the AI gets it wrong. Sort your support volume on those two axes and three tiers fall out on their own.
Automate. Clean rules, clear outcomes, low cost of error. "Where is my order." "Reset my password." "What are your hours." The correct answer exists, it is retrievable, and a wrong answer is a minor annoyance, not a lawsuit. This is where agents earn their keep, and it is a bigger slice of your volume than your team wants to admit.
Augment. The outcome is knowable but not mechanical, and the cost of a wrong answer is real. A billing dispute. A product question where the answer depends on the customer's configuration. Here the AI drafts, retrieves, and proposes, and a human approves before anything goes out the door. The human is faster because the AI did the assembly. The human is still accountable because the AI does not press send.
Leave alone. High stakes, judgment-heavy, often low volume. A churning enterprise account. A grieving customer. A complaint that is one bad reply away from becoming a regulatory problem or a screenshot on social media. The volume is small, so the efficiency upside is small, and the downside is enormous. The math is not close. Keep a human on it, fully, from the first word.
Most vendors will not tell you about the third tier, because their pricing model rewards deflection, and the third tier is the volume they cannot deflect. But the third tier is exactly where a careless rollout does the damage that wipes out the savings from the first two.
I want to dwell on that third tier, because it is the one operators are most tempted to skip and the one that has burned the most companies I have talked to.
The instinct, when you have paid for the AI, is to point it at everything. You want the utilization. But customer service is not just a cost center. It is the place where your relationship with a customer is either deepened or destroyed, usually in a single interaction, usually on a bad day for that customer. The interactions in the leave-alone tier are low volume precisely because they are rare and serious. Putting an agent on them does not save you much, because there were not many to begin with. And when one of them goes wrong, it does not cost you a ticket. It costs you the account, or the reputation, or both.
I talked to a founder who learned this the expensive way. Their support AI handled a cancellation request from what turned out to be their third-largest customer as if it were a routine churn ticket. Clean, efficient, automated, and completely deaf to the fact that a quarter-million-dollar relationship was walking out the door over something a human would have escalated in the first thirty seconds. The AI did its job. The job was the wrong one to give it.
Drawing the leave-alone line is not a failure of ambition. It is the thing that lets you be aggressive everywhere else. The boundary is what makes the speed safe.
Drawing the leave-alone line is not a failure of ambition. It is the thing that lets you be aggressive everywhere else. Once you know which interactions are off limits, you can let the AI run hard on the first two tiers without lying awake about the rare interaction that turns into a crisis. The boundary is what makes the speed safe.
Skip can draw the line in a strategy meeting. My job is to make sure the system actually respects it at two in the morning when the founder is asleep and the volume is real.
This is where the Fortify Gate comes in. In Series 2 we wrote that nothing ships on our side until it passes the Fortify Gate, the checkpoint that asks whether a change is safe to release, not just whether it works. The same gate belongs in customer service, except here it does not run once at release. It runs on every interaction, in real time, deciding which tier this conversation is in and what the AI is allowed to do inside it.
Concretely, that means the routing logic is a first-class part of the system, not a setting buried in a vendor dashboard. Before the AI answers, something classifies the interaction. Is this a tier-one question with a clean answer? The agent handles it. Is there a refund above a threshold, a sentiment signal that says the customer is angry, an account flagged as high-value, a topic on the do-not-automate list? The interaction gets gated. It goes to a human, or it goes to draft-and-approve, and the AI is structurally prevented from acting on its own. The classification is owned by you, lives in your system, and is auditable. It is not a prompt you hope the model honors. It is a gate the model cannot walk around.
The reason this matters is the same reason it mattered for code. Speed without a gate is just a faster way to ship the wrong thing. We learned in Series 1 that when time-to-delivery collapses, the failures move upstream to the decisions, not the typing. Customer service AI is the same tension, applied to customer interactions instead of releases. The gate is what lets you go fast on the safe volume without the unsafe volume slipping through in the rush.
There is a second engineering problem, and it is the one the COO in Skip's opening call ran into. Most teams measure their support AI on activity. Deflection rate. Tickets closed. Average handle time. Those numbers go up the moment you turn the AI on, which is exactly why they are dangerous. They tell you the AI is busy. They do not tell you the customer was helped.
A deflection rate of ninety percent with a true resolution rate of forty percent is not a success. It is containment dressed up as resolution.
This is the Telemetry point, and it is the part vendor demos skip. Telemetry is not "did the agent answer faster." It is "did the customer outcome improve." A deflection rate of ninety percent with a true resolution rate of forty percent is not a success. It is containment dressed up as resolution, and the difference shows up two weeks later as repeat contacts, escalations, and churn that nobody traces back to the support bot.
So measure the things that are actually outcomes. Did the issue stay resolved, or did the customer come back within ten days about the same problem? Did CSAT hold on AI-handled interactions, or did it quietly slide while deflection climbed? When the gate routed something to a human, was it the right call, and when it did not, what got through that should have been stopped? Those are the numbers that tell you whether the system is working, and they are the numbers you have to instrument yourself, because the platform you bought is optimized to show you the flattering ones.
The whole framework comes down to this. Tiers tell the AI what it is allowed to touch. The gate enforces the boundary in real time. Telemetry tells you the truth about whether any of it helped. Skip the third and you are flying on the instruments the vendor chose for you.
If you are a founder reading this and thinking customer service is not your problem, look again, because the same three-tier decision shows up the moment you put an AI feature inside your product.
Every AI feature you ship to your users sorts onto the same axes. The autocomplete, the summary, the suggested reply: clean outcome, low cost of error, automate it and let it run. The feature that drafts something the user will send under their own name, or makes a change to their data, or spends their money: the outcome matters and a mistake is expensive, so build the human approval step in on purpose, do not bolt it on after the first angry support ticket. And the thing that touches a user's money, their legal exposure, or their safety: that is your leave-alone tier, and "the model is usually right" is not the standard that protects you there.
Founders get this wrong in the same way operators do. They point the AI at everything because the demo was impressive, and they find the boundary the hard way, in production, with a real user on the other end. The discipline is identical whether the interaction is your support queue or a feature inside your app. Decide the tiers on purpose. Gate the boundary in the system. Measure the outcome, not the activity.
The companies pulling ahead are not the ones that automated the most. They are the ones that decided, deliberately, what to automate, what to augment, and what to leave alone.
That is the difference between AI that runs a piece of your business and AI you are quietly cleaning up after. The companies pulling ahead are not the ones that automated the most. They are the ones that decided, deliberately, what to automate, what to augment, and what to leave alone, and then built the system to hold that line.
Next week we get to the part of this series I have been wanting to write since we started. The conventional wisdom says big enterprises will win the AI race because they have the budgets and the data. I think that is exactly backwards, and that owner-operators are structurally better positioned for this shift than the enterprises that are supposed to dominate it. We will make that case.
AI that runs your business. Not the other way around.
Written by Skip Marshall & Chuck Griess
Learn more about our team