harvey — vertical ai for a high-stakes profession — cases

Legal work is where AI failure costs the most. A wrong answer from a customer service bot produces a refund. A wrong answer from a legal AI produces a malpractice claim, a sanctions motion, or a client who loses a case. The cost of a confident wrong output in law is measured in hundreds of thousands of dollars and professional reputations.

Harvey AI built a legal AI product that partners at major law firms — not just associates doing research — actually trust. By 2025 it was valued at ~$1.5B with partnerships at Allen & Overy, PwC Legal, and other major international firms. This case is about how you earn trust in a high-stakes profession, and what the product architecture looks like when you can't afford confident errors.

Why general AI fails in legal

The standard objection to legal AI is that LLMs hallucinate. This is correct but insufficient as an explanation for why general-purpose AI tools fail in legal practice.

The more precise problem is that general models lack doctrinal precision. Legal reasoning involves specific terms, specific hierarchies of authority, specific jurisdictional nuances, and specific citation formats. A model that says "under the UCC, the seller must provide reasonable notice before cancellation" may be generally accurate but useless or harmful if the relevant case turns on whether Article 2 or Article 2A applies, or whether the contract was formed under a specific state's variant.

This is not a hallucination in the everyday sense — the model didn't invent a case. It applied a rule correctly in general but failed to apply the correct specific rule. In law, this distinction is not a quality difference — it's the difference between a correct analysis and a wrong one.

The product implication: legal AI requires domain-specific quality standards that general AI benchmarks don't capture. A model that scores 90th percentile on general legal reasoning questions may score much lower on the specific task a litigator or transactional lawyer actually needs. The benchmark problem is a product problem: if your eval system doesn't measure the task you're actually doing, a high benchmark score is misleading.

The fine-tuning bet

Harvey's core technical bet was fine-tuning. Rather than relying on general-purpose models (GPT-4, Claude 3) with legal-domain prompts, Harvey trained specialized models on legal corpora: case law, regulatory text, contracts, briefs, legislative history.

This was a correct bet for several reasons:

Doctrinal precision requires domain distribution. A model trained predominantly on internet text has seen some legal text, but not enough to internalize the precise patterns of legal reasoning. A model trained primarily on legal text has the right distributional prior for the task.

Consistency of citation format matters. Legal work requires specific citation standards (Bluebook in the US, OSCOLA in the UK). Fine-tuning on correctly-cited legal documents produces more consistent citation behavior than prompting a general model.

Jurisdictional specialization is possible at fine-tune time. Harvey fine-tuned separate or specialized models for specific jurisdictions and practice areas (M&A, litigation, employment law), allowing the product to deliver better results for narrow tasks than a general model with practice-area prompts.

The cost: fine-tuning requires a training data pipeline, ongoing model maintenance, and re-evaluation every time a new base model releases. Harvey built this capability as a core competency rather than treating it as a one-time task.

Eval design for high-stakes outputs

Harvey's eval system had to solve a problem that most consumer AI eval systems don't face: ground truth in legal analysis is often contested. Two experienced partners may analyze the same contract and reach different conclusions about a clause's interpretation. This is not a quality failure — it's the nature of legal judgment.

Standard golden-set evals (where the correct answer is known and fixed) work poorly here. Harvey's eval approach involved:

Expert panel review. A panel of practicing lawyers (not just annotators) evaluated model outputs against rubrics developed with domain experts. Panels were calibrated so that inter-rater agreement was measured and maintained — if two lawyers agreed on an evaluation, it was treated as ground truth; if they disagreed, the case was escalated for discussion rather than used in training.

Jurisdiction-specific test suites. Rather than a single global eval, Harvey maintained eval suites by practice area and jurisdiction. A model could perform well on US M&A analysis and poorly on UK employment law — and the product team needed to see that distinction, not an averaged score.

Factual accuracy vs. analytical quality. These were evaluated separately. Factual accuracy (did the model correctly state the holding of a case? did it correctly describe a statute?) was measurable and verifiable. Analytical quality (was the analysis sound? did it identify the right issues?) required expert judgment and was scored with explicit rubrics rather than binary right/wrong.

The adversarial test set. Harvey maintained a set of deliberately tricky inputs designed to elicit hallucinations, doctrinal errors, or citation fabrication — the specific failure modes that would be most harmful in practice. Performance on the adversarial set was a mandatory quality gate, not just a monitoring metric.

Auditability as a first-class feature

Harvey's legal customers operate under professional responsibility rules that require attorneys to supervise any work product, including AI-assisted work product. This is not optional compliance — it is an ethical obligation of the profession. An attorney who files a brief with AI-generated content they did not verify can face professional discipline.

Harvey designed auditability into the product rather than treating it as a compliance obligation. Every AI output in Harvey includes:

Source citation down to the paragraph. Not just "citing Smith v. Jones" — citing the specific passage in Smith v. Jones that supports the claim, with a direct link to the text. The attorney can verify the citation is accurate and that the model hasn't overstated or misrepresented it.

Confidence differentiation. Rather than uniform confident prose, Harvey surfaces explicit uncertainty signals when the model's confidence is lower — flagging areas where the attorney should do additional research rather than relying on the AI's analysis.

Version tracking. Outputs are logged with model version and configuration at generation time. If a discrepancy arises between what the AI said and what the attorney relied on, there is a reconstruction path.

The product lesson: auditability is not a feature you add to a high-stakes AI product. It's the foundation you build the product on. An auditable output is a trustworthy output. The attorney who can verify the analysis end-to-end is the attorney who will use the tool confidently.

The buyer trust problem in legal AI

Law firm technology purchasing is different from most enterprise sales. The decision-makers are attorneys who have spent careers developing professional judgment they trust more than any software tool. The sales objection is not usually "this is too expensive" or "this doesn't integrate with our systems" — it's "I don't trust this tool not to embarrass me in front of a client or a judge."

Harvey's go-to-market strategy addressed this directly:

Pilot with the skeptics. Rather than starting with enthusiastic early adopters in law firms (usually younger associates), Harvey specifically sought to pilot with senior partners — the people who would most aggressively probe the limits of the tool and whose endorsement would carry the most weight. A managing partner who says "I've tested this thoroughly and I trust it" is more valuable than twenty associate testimonials.

Transparent about limitations. Harvey's sales materials were explicit about what the tool couldn't do: it was not a substitute for attorney judgment, it couldn't give advice in novel legal areas, and its outputs required verification. This transparency built credibility with buyers who had been burned by overconfident AI vendors before.

Malpractice insurance implications. Harvey actively engaged with professional liability insurers to develop guidance on how attorney use of Harvey affected malpractice coverage. This addressed a practical concern (does using AI expose me to malpractice liability?) that traditional software vendors don't face. Addressing it proactively rather than leaving it as an unanswered question removed a significant barrier to adoption.

The vertical AI pattern

Harvey is a clear example of what "vertical AI" looks like when it's done well — and the pattern generalizes beyond legal:

Invest in domain-specific fine-tuning when the domain has enough documented knowledge and the performance gap between general and specialized models is real and large. Legal, medical, financial, engineering, and scientific domains typically meet this bar.

Design your eval system for domain-specific failure modes, not generic LLM benchmarks. Know what "wrong" looks like in your domain in a way that is more precise than "hallucinated."

Make auditability first-class. High-stakes professional users don't just want good outputs — they want outputs they can sign their name to. That requires visible provenance.

Sell to the most skeptical buyers first. In high-accountability professions, the buyer who trusts you after thorough examination is more valuable than ten buyers who adopted casually.

PM takeaway

Harvey's product story is about what it takes to earn professional trust in a domain where the cost of AI failure is measured in professional consequences, not user satisfaction scores.

The three structural requirements for vertical AI in high-stakes domains:

Domain-specific quality standards. General benchmarks are insufficient. Build your eval for the actual failure modes of your domain.

Auditability by design. Every output must carry its provenance. Trust requires verifiability, not just accuracy.

Sales designed for skeptics. In high-accountability domains, the credibility of your endorsers matters as much as the quality of your product. Win the most demanding users first.

These requirements cost more than a horizontal AI feature. They are also the things that create the moat. A general AI tool can be copied. A specialized tool with an auditable track record at 50 major law firms cannot.