klarna — when ai customer support was half-right — cases

The headline opinion

Klarna's customer-support announcement on 27 February 2024 was the cleanest AI launch press release of the year. The numbers were specific. The framing was confident. The OpenAI logo on the co-marketing collateral did half the work. For about ninety days it was the case study every AI vendor pointed at when a buyer asked "where has this actually shipped at scale."

The 2025 walk-back was quieter, slower, and considerably more instructive. CEO Sebastian Siemiatkowski sat in front of journalists and admitted, in plain language, that going too far on AI had hurt service quality and that Klarna was hiring humans back. That admission is the part of the case that matters. The launch told you what an AI deflection feature can do in its first month. The walk-back tells you what it does in its second year, and what a PM is supposed to do in between.

This case is not a victory lap and it is not a takedown. Klarna shipped something real, measured some of it honestly, measured some of it loosely, and then — to the company's credit — corrected the loose parts in public. The lesson is in the gap between the two halves of the story. That gap is where the judgment lives.

The Decision

In early 2024 Klarna's customer-service operation was an obvious cost line. The company was preparing to file for a US IPO (a process that would eventually open in 2025), and the cost structure of a global consumer fintech with operations in 23 markets, 24 languages of customer support, and seasonal volume that peaked viciously around Black Friday and the December holidays was unattractive on a public-markets income statement. Customer support was outsourced to a major BPO. The contract was visible. The unit economics were not flattering.

Klarna's leadership made a deliberate choice. Rather than tune the BPO contract, ship a marginal self-service overhaul, or wait for the technology to mature, they would put OpenAI's models directly on the customer-service front line — as the first point of contact, in every supported language, across every consumer-facing surface where a customer might open a chat. The AI assistant would not be positioned as a help-centre widget or a deflection wedge in front of a contact form. It would be positioned as the agent. Humans would sit behind it, not in front of it.

The announcement on 27 February 2024, published on klarna.com, put numbers around what that decision had produced in the first month of live operation:

The AI assistant had handled 2.3 million customer-service conversations — two-thirds of all customer-service chats in the period.
It was doing the equivalent work of 700 full-time agents, in Klarna's framing.
It was live in 23 markets, available 24/7, and conversing in more than 35 languages.
Klarna estimated the assistant would drive a $40 million USD improvement in profit in 2024.
Average resolution time on a chat dropped from 11 minutes to under 2 minutes.
Repeat inquiries — the share of customers who came back with the same problem within a window — were down 25%.
Customer satisfaction was described as being "on par with human agents."

This is a specific list. It is also, on close reading, two lists smuggled into one. The first six bullets are operational metrics that Klarna's data warehouse can answer with a SQL query: conversation count, languages, resolution time, repeat-inquiry rate, projected cost savings. The seventh bullet — customer satisfaction "on par with human agents" — is the one that does almost all of the narrative work in the press release and is, by some distance, the softest claim in the document. Hold on to that asymmetry. It is the entire case.

What was actually strong

Several of the things Klarna shipped were genuinely good product work and should not be discounted by the later walk-back.

The scoping decision was right. Klarna did not ask the model to do everything. They sent the routine, well-known classes of question to the AI — order status, refund timeline, payment plan changes, "how do I update my card," "where is my parcel" — and kept humans in the loop for the cases where the answer was not in a help-centre article. That is the textbook way to ship an AI deflection feature, and most companies who tried to copy Klarna in 2024 got that part wrong, not right. The scoping is the judgment. The model is the commodity.

The multilingual reach was real. Customer support in 35+ languages, 24/7, across 23 markets is a hard product to staff with humans even if you have the budget. The cost of running an additional language with a model is approximately zero once the system prompt and the retrieval corpus are translated, while the cost of running an additional language with humans is a recruiting funnel and a training programme and a shift schedule. This is the kind of job where AI's economics are honestly different in kind, not just in degree.

The 11-minutes-to-2-minutes resolution drop is credible. This is the metric that most directly maps to a measurable user benefit. A customer waiting on a refund clarification at 11pm in Stockholm now gets an answer in two minutes instead of waiting eleven minutes for the next available agent — or, far more likely, waiting until tomorrow for an email reply. Time-to-first-meaningful-answer is the dimension of customer support most worth optimising, and the AI clearly moved it.

The 25% reduction in repeat inquiries is the metric I would have led with. Repeat-inquiry rate is the closest thing customer support has to a true quality signal. If a customer comes back within a week with the same problem, the first contact did not resolve it — regardless of what the customer marked in the post-chat survey. A 25% drop, if measured cleanly, is a meaningful claim about resolution quality, not just deflection. Klarna's communications team buried this number behind the 700-agents headline. The PM-side reading of the same document leads with this number and treats the agent-equivalent figure as the cost-savings annotation underneath.

What was studiously soft

The "customer satisfaction on par with human agents" claim is where the press release stops being a product document and starts being an IPO document.

Klarna did not, at any point, publish: the survey instrument, the response rate, the CSAT scale (4-point, 5-point, NPS-style), the response-bias correction, the segmentation by issue type, the comparison cohort that "human agents" referred to (Klarna's own pre-AI baseline? the BPO's contracted SLA? the industry average?), or the breakdown by language. "On par with" is a phrase that, in the absence of any of that supporting detail, is doing very heavy lifting. A diligent reader of the press release in February 2024 could have noted that every other number had a definition attached and only this one did not. That is usually a signal.

This matters because the comparison the AI was being measured against — humans answering on an off-the-shelf BPO contract under cost pressure — is not a high bar. Beating that baseline on a satisfaction survey is genuinely possible, particularly on speed and availability, while still leaving substantial unhappiness on the resolution-quality dimension that the survey instrument may not be picking up. "On par" can mean both "as good as humans on every dimension that matters" and "as good as humans on the dimensions we measured, which were the easy ones." The press release was compatible with both readings and committed to neither.

The other quietly missing number is the escalation rate — what percentage of AI-handled chats were eventually routed to a human, with what latency, and at what satisfaction cost to the customer who had now been on the platform for fifteen minutes before reaching the person who could actually help. Klarna reported the two-thirds-handled figure but not the two-thirds-resolved figure. In customer support, those are different numbers. Handled means the conversation closed in the AI's session. Resolved means the underlying customer problem went away and did not come back. The repeat-inquiry metric is the closest proxy, and it improved — but a clean public eval would have published the escalation funnel end to end. Klarna did not.

What the press said vs. what the room was saying

The press cycle in February and March 2024 was extraordinarily favourable. The TechCrunch and Bloomberg pieces led with the 700-agents number. The Wall Street Journal framed it as the first major enterprise demonstration of OpenAI's models in a frontline customer-facing role at scale. OpenAI's own blog amplified the same numbers. The figure of "$40M in 2024 profit improvement" travelled to every "what AI will do to white-collar work" essay published over the following six weeks.

What was quieter — and what insiders close to Klarna's customer-experience function were saying — was a second, more nuanced version. Some of it was structural. Klarna had laid off a meaningful portion of its workforce in 2022 as the company restructured ahead of public-markets readiness, and a hiring freeze had been in place since. The AI launch was not — primarily — replacing humans who had been there a month earlier. It was filling a gap that the cost-cutting had already created. That is a different story than "AI replaces 700 people," and it travels less well in a headline.

There was also the IPO incentive. Investors evaluating a fintech IPO in 2024 wanted to see two things on the deck: an AI story and an operating-leverage story. The Klarna announcement, intentionally or not, served both. The company was telling a credible story about variable cost lines being converted to fixed model-inference costs that scale sub-linearly with volume — exactly the financial transformation public-market investors reward. This does not mean the numbers were wrong. It means the numbers had a beneficiary inside the company beyond the product team, and that beneficiary had a strong say in how the numbers were framed.

The walk-back

By May 2025, in interviews with Bloomberg and reporting that ran in the Financial Times, Siemiatkowski had publicly changed his tune. The phrasing was careful but the substance was clear. Cost-cutting "had gone too far." Service quality was suffering. Klarna was hiring humans back into customer-support roles. The framing shifted from "AI replaces agents" to "AI handles tier-1 questions; humans handle anything that requires judgment, empathy, or a complaint that has gone past the first round."

A few things are worth being precise about, because the walk-back was misreported as a retreat from AI and it was not.

Klarna did not turn the AI assistant off. The deflection bot stayed. The volume it handles is still a real fraction of total contacts. What changed is the routing logic and the threshold at which a human is brought in. The system in 2025 is closer to "AI as triage, human as the resolution layer for anything non-trivial" than to the 2024 framing of "AI as the agent."

Klarna did not reverse the cost savings narrative. The company has continued to claim meaningful operating-cost reduction from AI in customer support and across other functions. The walk-back was specifically on the customer-experience claim, not the unit-economics claim. Those are separable, and Klarna kept them separate in the climbdown.

The judgement-level statement matters more than the operational change. Siemiatkowski's "I'm a bit worried about the quality" is the rarest thing in enterprise-AI communications: a senior leader at a public-facing tech-adjacent company publicly stating that they over-rotated on AI, in a market environment that punishes that admission. The candour is unusual. The fact that it took roughly a year for the data to force it into the open is the part that should make every PM uncomfortable.

What a PM should take from this

The Klarna case is the cleanest available example of the gap between deflection rate and resolution rate — and of what happens when an organisation, an IPO timeline, and a press cycle all reward the first metric and quietly under-measure the second.

The first lesson is mechanical. Deflection rate is the number you can measure on day one. Resolution rate is the number that determines whether the feature is good. A chatbot can deflect 100% of inbound chats by responding to every one with "thank you for contacting us, your issue has been logged." Deflection is trivially gameable; resolution is not. The metric you publish in the press release is, in most cases, the metric that the AI feature is most likely to be over-optimising for. Build the eval discipline that measures the harder number before you ship the easier one. (See Eval Before Launch Rule ai-30 on offline-vs-online evals, and Rule ai-29 on regression evals for every prompt change.)

The second lesson is about the soft claim in a sea of hard claims. Klarna's press release had six measurable bullets and one studiously hedged one. The hedged one — "on par with human agents" on satisfaction — was the load-bearing claim for the narrative. A diligent product person reading any AI launch document should, as a habit, identify the bullet that has no definition under it and assume that is the bullet doing the political work. If you are writing the launch document yourself: either publish the definition or do not make the claim. The third option — making the claim without the definition because you do not yet have it — is the option that will get you a walk-back twelve months later.

The third lesson is about AI as a cost-saving wedge versus AI as a customer-experience wedge. These are different products built by different teams answerable to different stakeholders. Klarna's 2024 announcement attempted to be both. The cost-saving story was probably true. The customer-experience story was probably oversold. A PM who has been asked to ship "AI in support" should know, going in, which of those two products they are actually building, and should resist the executive temptation to claim both outcomes from a single deployment. If the budget came from the CFO, you are building the first product. If the budget came from the head of customer experience, you are building the second. They have different success metrics and they should be measured against different baselines.

The fourth lesson is about the announcement incentive in a pre-IPO window. Klarna's announcement was not a lie. It was a real product launch with real metrics. But the timing, the framing, and the choice of which number led the headline were calibrated for an audience that included investors as well as customers. AI initiatives at any company within twelve months of a fundraise, an IPO, or a quarter-end earnings call carry that incentive whether anyone says it out loud. The PM's job is to make sure the product numbers and the narrative numbers are the same set, even when the communications team would prefer otherwise.

The fifth lesson is about the eval gap on edge cases. The publicly documented eval methodology for Klarna's launch consisted, as far as the press release reveals, of an internal CSAT survey and operational telemetry. There is no public mention of an adversarial eval set, a refusal eval, a hallucination-rate measurement, a multilingual quality breakdown, or a regression suite catching prompt-change drift across the 35 supported languages. Edge cases are where customer-support AI fails ugliest — fraud claims, complaints about charging errors, regulatory-sensitive questions in different jurisdictions, customers in genuine financial distress. A clean public eval covering these classes would have made the launch claim defensible. Its absence is what made the walk-back inevitable.

What you would do differently

If a PM came to me in 2026 about to launch an AI customer-support deflection feature, the Klarna case generates a short list of moves to make and not make. None of them are exotic. All of them are uncomfortable on the timeline most companies want to ship on.

Publish the escalation funnel, not just the handled rate. The two-thirds-handled number is a vanity metric without the share that escalated, the average latency before escalation, and the satisfaction delta between escalated and non-escalated chats. If you cannot publish those, you are not ready to claim the headline number.

Define "satisfaction" before you measure it. Decide whether you are measuring CSAT, NPS, or first-contact resolution, and decide whether your baseline is your own pre-AI numbers or an external benchmark, before any data starts flowing. Comparison cohorts chosen after the fact are the most common source of misleading AI claims.

Build the wave-two eval set across languages. A hundred cases per supported language, stratified by routine / edge / refusal / adversarial, with the regression suite running on every prompt change. This is non-negotiable for a 35-language deployment. Without it you will silently regress on Polish and Portuguese for six weeks and only notice when the support inbox fills up with the same complaint in two languages. (See Eval Before Launch for the three-wave structure.)

Publish the cost story and the quality story separately. Two press releases. One for the CFO audience: "AI customer support has reduced our contact-centre cost by X% with the following quality controls in place." One for the customer audience: "Here is how our AI assistant works, here are the questions it handles well, here are the questions where you will always get a human, and here is how to reach a human at any time." Conflating them is what creates the conditions for a walk-back.

Decide the rollback trigger in writing, before launch. "If repeat-inquiry rate goes up rather than down, we roll back to human-first routing within seven days" is a decision a leadership team can make calmly in February. It is a decision they cannot make calmly in May, when the press has already written the launch story. Pre-committing to the trigger is the single highest-leverage move in this entire playbook.

// from the field

The Klarna walk-back coincided with — and was reinforced by — a broader industry recalibration in 2025 about AI customer-support deployments. Air Canada had been ordered by a tribunal to honour a refund the airline's chatbot had hallucinated. DPD had taken its support chatbot offline after the bot swore at customers and wrote a poem about how bad DPD was. McDonald's had ended its drive-through AI pilot with IBM after viral failure videos. None of these are Klarna's failure mode — Klarna's bot did not melt down in public — but they reset the buyer-side expectation that "AI handles X% of contacts" is a launch metric rather than a finish line. The companies still telling clean AI-support success stories in 2026 are the ones who published their eval methodology alongside their deflection rate. The ones telling the same stories from 2024 are no longer telling them.

What this case teaches

A condensed list of the rules this case reinforces in the AI Manual. Each one is referenced where it is treated in depth.

Deflection rate is not resolution rate. Reinforces When AI Is the Right Answer (and When It Isn't) Rule ai-4 (define the cost of a wrong answer before you ship) and Eval Before Launch Rule ai-30 (offline evals predict, online evals confirm; you need both).
The soft claim in a hard-claim list is the load-bearing claim. Reinforces Eval Before Launch Rule ai-31 (write down the launch threshold before you measure).
Scoping is the judgment; the model is the commodity. Reinforces When AI Is the Right Answer (and When It Isn't) Rule ai-1 (use AI for unstructured inputs that require judgment) — Klarna's scoping was textbook, which is what made the parts that did work, work.
Cost-saving wedges and customer-experience wedges are different products. Reinforces When AI Is the Right Answer (and When It Isn't) Rule ai-6 (know whether you are in AI-as-feature or AI-as-product territory) and Rule ai-2 (translate every AI request into a specific user job before you commit a sprint).
Multilingual AI without a per-language eval set is a regression waiting to happen. Reinforces Eval Before Launch Rule ai-26 (build evals in three waves — ten, hundred, thousand).
The pre-IPO / pre-fundraise window distorts the launch narrative. Reinforces When AI Is the Right Answer (and When It Isn't) Rule ai-8 (the removal test — would a single customer complain if the AI were removed?). A press release that reads better than the underlying numbers is its own warning sign.
Pre-commit the rollback trigger in writing. Reinforces Eval Before Launch Rule ai-31 — the threshold and the rollback are the same artifact, written down before the launch ships.
Public candour about over-rotation is a reputational asset, not a liability. Siemiatkowski's 2025 admission is the part of the case that will age best. The leaders who get caught up in the AI cycle and refuse to publicly correct will look worse, on a five-year horizon, than the ones who publicly recalibrate within twelve months of shipping.

The compressed version: Klarna shipped the right product, measured half of it honestly, oversold the other half, and corrected in public. That is a better outcome than most AI launches you will see in 2026. It is also a worse outcome than the one a disciplined eval-and-rollback regime would have produced. The gap between those two outcomes is the chapter your team has not written yet. Go write it before you ship.