May 28, 2026

AI chatbots for customer service: what actually works after 90 days in production

Ninety days of running customer service bots in production. What survived, what got ripped out, and the design choices that decide whether your bot helps or annoys.

KEY TAKEAWAYS

↳The bot that works is boring: tight scope, real escalation, logs you actually read every Monday.
↳Retrieval quality matters more than model choice. Bad chunks beat a good model every time.
↳Free tiers and open-source repos get you to a demo, not to 90 days. Plan for the boring infra.
↳Measure deflection and CSAT together, or you'll optimize for ignoring customers faster.
↳Most failures we saw were product problems wearing a chatbot costume.

We’ve had a handful of customer service bots running in production for clients for about three months now. Some are quietly handling 60% of tickets. One we turned off after six weeks. The difference between those outcomes had almost nothing to do with which model we picked.

This is the honest debrief. What worked, what didn’t, and the stuff nobody puts in the demo video.

The first 30 days are a lie

Every AI chatbot for customer service looks great in week one. You wire it up to your docs, ask it five questions you already know the answers to, and it nails them. Stakeholders are happy. The CEO posts on LinkedIn.

Then real users show up.

Real users ask things like “hey is my order from last tuesday the one with the blue thing” with no order number, no email, no context. They paste screenshots. They write in three languages in one message. They yell. The clean Q&A demo dataset you tested with does not survive contact with a Tuesday afternoon.

What we found around day 20-30 across three deployments: the bot’s accuracy on real conversations is roughly 40-50% of its accuracy on the eval set you built before launch. Plan for that gap or you’ll be in a meeting explaining why the bot told a customer to email a support address that hasn’t existed since 2022.

What actually moved the needle

Four things, in order of how much they mattered:

Retrieval quality. Not the model. The chunks.
A real escalation path with context handoff.
Scope discipline. Say no to questions you can’t answer well.
A weekly review of failed conversations by a human who can fix things.

The model choice was almost a tiebreaker. We ran the same support bot on Claude 3.5 Sonnet and GPT-4o for a month on parallel traffic. The CSAT difference was inside the noise. The conversations where the bot failed were failing for the same reason on both models: it didn’t have the right context, or the right context was buried in a 40-page PDF that got chunked badly.

If you’re picking between a customer service AI chatbot from a vendor and building your own, this is the actual decision point. Do you have someone who will own the retrieval layer and look at logs every week? If yes, build. If no, buy something with a managed knowledge base and a human in the dashboard.

Retrieval is the whole game

The boring truth: most “the AI hallucinated” complaints we investigated were retrieval failures, not generation failures. The model dutifully answered based on the chunks it got handed. The chunks were wrong.

A few things that helped:

Chunking by semantic section, not by token count. We use markdown headings as natural boundaries where possible.
Storing the source URL with every chunk and surfacing it in the response. Users trust answers with a link. Engineers can debug.
A small reranker pass before the final generation. We’ve used Cohere’s rerank-3 for this. It’s cheap and it noticeably reduces “close but wrong” answers.
Keeping a separate index for evergreen docs vs. policy that changes. Pricing and refund policy especially. You do not want the bot quoting last quarter’s return window.

// rough shape of what runs on every message
const chunks = await vectorSearch(query, { topK: 20 });
const reranked = await rerank(query, chunks, { topK: 5 });
const answer = await generate({
  system: SUPPORT_SYSTEM_PROMPT,
  context: reranked,
  history: conversation.slice(-6),
  tools: [lookupOrder, escalateToHuman, scheduleCallback]
});

That escalateToHuman tool is not decoration. It gets called on roughly 1 in 5 conversations in the bots we’ve shipped, and getting that handoff right is what makes the difference between “the bot is helpful” and “the bot is a wall.”

The free and open-source question

People keep asking about an AI chatbot for customer service free tier, or pointing at an AI chatbot for customer service GitHub repo and asking if they can just run that. Short answer: yes, you can get to a demo in a weekend with Chatwoot plus an LLM, or with one of the LangChain support templates, or with Botpress’s free tier.

Getting to 90 days in production is a different sport.

The free version doesn’t include the part where someone notices the bot has been confidently telling Brazilian customers the wrong shipping times for a week. It doesn’t include the queue when your traffic spikes during a launch. It doesn’t include the SOC 2 review your enterprise customer is going to ask for. For an AI chatbot for small business with 50 tickets a week, a managed tool with a free or cheap tier is genuinely fine. For anything past that, the cost is in the operations, not the software license.

What we measure now

We used to track deflection rate alone. Bad idea. A bot that confidently says “I can’t help with that, goodbye” has a 100% deflection rate and a 0% satisfaction rate.

The dashboard we actually look at every Monday:

Resolution rate (user confirmed the answer solved their problem, via a thumbs up or a follow-up classifier)
Escalation rate, with reason codes
Time-to-first-response on escalated tickets (the bot makes this worse if you’re not careful)
CSAT on bot-only conversations vs. bot-then-human conversations
A sampled set of 20 random conversations a human reads end to end

That last one is the most valuable thing on the list and the easiest to skip. You will find product bugs, broken links, and policy contradictions every single week. The chatbot is the most honest user research tool you’ve ever deployed, because it logs everything verbatim and never gets tired.

The honest unresolved part

I still don’t know how to feel about voice. We’ve prototyped a couple of voice support agents on top of the same retrieval stack and they’re impressive in the demo and exhausting in practice. Latency matters more than I expected, interruption handling is harder than it looks, and customers seem to either love it or hang up in 15 seconds with no middle ground. We’re still figuring out where it actually belongs versus where it’s just a party trick.

The other thing I keep going back and forth on: how much personality to give these bots. Too little and they feel like a worse search box. Too much and they feel uncanny when they have to escalate. The clients whose bots people actually like landed somewhere boring and competent. Not friends. Not robots. Just a useful coworker who knows the docs.

If you’re thinking about putting one of these in front of your customers and want a second opinion before you do, come talk to us. We’ll tell you if we think you should build it, buy it, or wait six months.

FREQUENTLY ASKED

Common questions

▸How long does it take to get an AI customer service chatbot to actually work well?

Plan for a real 60-90 day curve. The first two weeks are easy: connect docs, run evals, ship. The next eight are where you find the gaps. Expect resolution rate to start around 30-40% on real traffic and climb into the 60s with weekly review of failed conversations, better chunking, and a real escalation path. Anyone promising production-quality in a week is showing you a demo, not a deployment.

▸Is there a free AI chatbot for customer service that's good enough for a small business?

For genuinely small volume, 20-50 tickets a week, the free tiers of Chatwoot, Tidio, or Intercom's Fin starter, paired with an LLM key, can work. You'll outgrow it the moment you need custom logic, real order lookups, or multilingual support. The software is rarely the bottleneck. The bottleneck is whoever has to read the logs and improve the answers.

▸Should I build on top of an open-source GitHub project or use a vendor?

Build if you have an engineer who will own retrieval, evals, and weekly conversation review. Repos like Chatwoot, Botpress, and LangChain templates get you to a working prototype fast. Buy if you don't have that person. The hidden cost of building is not the code, it's the ongoing operations: knowledge base updates, prompt tuning, monitoring escalations, and fixing the product bugs your bot is going to surface.

▸What's the realistic deflection rate I should expect?

For a well-scoped support bot with decent docs, 40-65% of conversations can end without a human, depending on how complex your product is. SaaS with strong docs lands higher. Anything with billing, account-specific data, or shipping tends to land lower. If a vendor quotes 80%+ deflection in their pitch, ask exactly how they're counting it. "User stopped replying" is not the same as "problem solved."

▸What kills these projects most often?

Three things, in order. One: nobody owns the bot after launch, so the docs go stale and quality decays. Two: the escalation path is broken, so frustrated users blame the company, not the bot. Three: the team optimized for deflection instead of resolution, and customers learned the bot is a wall to get around. The model choice is rarely in the top five.

6/27/2026