The build/buy/wrap decision for AI is not a technical question. It's a business question about where your moat is, what you're willing to own, and how much risk you can absorb. Most teams make it technically.
After this page, you’ll be able to:
- The four paths: API wrapper, fine-tuned hosted model, self-hosted model, and capability layer
- The five factors that determine which path is right for your product
- Why 'fine-tuning for moat' is usually wrong and 'fine-tuning for cost at scale' is usually right
- The lock-in risks most teams don't price into the decision
- How to make this decision before the sprint starts, not after six months of engineering
Every AI feature starts with the same question: should we build this on top of someone else's model, adapt someone else's model to our specific task, or train something ourselves? Most teams decide this based on what the engineering team is comfortable with, not based on a structured analysis of what the product actually needs.
This page gives you the framework. Make the decision before the sprint starts. It's expensive to reverse.
The four paths
Path 1: API wrapper (buy). You call an external model API (OpenAI, Anthropic, Google, Cohere) with a well-crafted prompt. The model provider owns the infrastructure, the model weights, the serving, and the maintenance. You own the prompt, the application logic, and the user experience.
This is the starting point for almost every AI feature in 2026. The major APIs are capable, fast, and reliable enough for most production use cases. The cost of getting started is a few hundred dollars in API credits and a week of prompt engineering.
Path 2: Fine-tuned hosted model (adapt). You take a base model from a provider (OpenAI, Anthropic, Google, or an open-source model via a fine-tuning service like Together AI, Modal, or Replicate) and train it further on your own data. The provider or a third-party hosts the fine-tuned model. You own the training data and the fine-tuning process.
This is the right call when you have labeled data, a narrow task, and a cost or latency problem that a well-prompted large model can't solve. It is not the call you make to create a data moat — the fine-tuned weights don't contain your users' data, they contain a model biased toward your task.
Path 3: Self-hosted model (build). You download open-source model weights (Llama 4, Mistral, Falcon) and run inference on infrastructure you control. You own the compute, the serving, the maintenance, the upgrades, and the uptime. You have complete control over the model and your data never leaves your infrastructure.
This is the right call when: data residency or sovereignty requirements prevent external API use, inference volume is high enough that self-hosting is cheaper than API costs, or you need to modify model behavior in ways APIs don't allow. It is NOT the right call for teams without ML infrastructure experience. Self-hosting a production-grade inference service is hard engineering work.
Path 4: Capability layer (build from scratch). You train a model on your proprietary data from a base checkpoint or from scratch. This is what OpenAI, Anthropic, and Google do. It requires tens of millions of dollars in compute, months of training time, and a research team. Almost no product team should be doing this. If you are asking whether to do it, the answer is no.
The five deciding factors
Factor 1: Data residency and privacy requirements.
If your users are in regulated industries (healthcare, finance, legal, government) or jurisdictions with strict data residency requirements (EU, India's DPDP, certain enterprise contracts), external APIs may be off the table entirely. Check before designing your architecture. Anthropic, Azure OpenAI, and Google Cloud all offer VPC deployment or EU-region hosting, but at higher cost.
Factor 2: Quality on your specific task.
Benchmark before deciding. The default assumption — "we need a fine-tuned model because our domain is specialized" — is usually wrong. Frontier models (GPT-4o, Claude 3.5 Sonnet) perform well on specialized tasks with good prompts. Fine-tuning improves quality when: the task requires consistent format/style the base model doesn't follow by default, the domain vocabulary is genuinely rare in training data, or you have hundreds of thousands of labeled examples.
Test with a well-crafted prompt on the large model first. Fine-tune only if the quality gap is real and large.
Factor 3: Cost at projected scale.
This is where fine-tuning and self-hosting become genuinely competitive. The math: API costs scale linearly with volume. Fine-tuning amortizes training cost across inference, but still pays per-token. Self-hosting pays for compute upfront and has near-zero marginal cost per inference.
At low volume (< 1M tokens/month): API wins. At medium volume (1M-100M tokens/month): API with caching and cascading is usually competitive. At high volume (>100M tokens/month), especially with a narrow task: evaluate self-hosting carefully — the compute cost may be 1/5th the API cost.
Factor 4: Latency requirements.
API providers add network latency (typically 100-300ms overhead) on top of inference latency. For most use cases this is irrelevant. For latency-sensitive applications (real-time voice, interactive coding, fast search autocomplete), 100ms matters. Self-hosted inference on co-located hardware can cut this overhead to < 10ms.
Factor 5: Team capability and maintenance capacity.
This is the factor teams most often ignore. Self-hosting requires: ML infra engineers who know CUDA, ONNX, and container orchestration; on-call rotation for the inference service; a process for evaluating and upgrading model versions; monitoring and observability tooling. If your team doesn't have this capability today, estimate the time and cost to build it before adding self-hosting to the decision matrix.
Fine-tuning requires: labeled data (at least 1,000-10,000 examples for useful improvement), a training pipeline, and ongoing maintenance as new base models release. When a new Llama version ships, your fine-tune on the old version doesn't automatically transfer — you're either re-fine-tuning on the new base or staying on an older model.
The lock-in risks that most teams under-price
API lock-in. If you build deeply on GPT-5's specific behavior (prompt formats tuned to GPT-5's instruction-following quirks, features that depend on GPT-5-specific capabilities), switching to Claude or Gemini is non-trivial. Mitigation: abstract the model call behind a provider-agnostic interface layer from the start. Test on multiple providers during development. The cost of abstraction is low; the cost of deep entanglement is very high when pricing changes.
Fine-tune staleness. Fine-tuned models don't benefit from base model improvements. When GPT-4o-mini improves at the next release, your fine-tuned GPT-3.5-turbo doesn't. You will periodically need to re-evaluate whether your fine-tune on an older model still beats prompting a newer base model. Plan for this evaluation cycle.
Abstract the model call behind a provider-agnostic interface from day one. Deep entanglement with one provider's prompt quirks or specific capabilities is cheap to avoid and expensive to undo.
Data pipeline lock-in. Fine-tuning creates a training data pipeline: collect examples, label them, format them, run training, evaluate, deploy. This pipeline is an ongoing maintenance obligation. Teams that fine-tune often underestimate the cost of keeping the training data current as their product evolves.
The decision matrix
| Factor | API wrapper | Fine-tuned hosted | Self-hosted | When to use |
|---|---|---|---|---|
| Data privacy | External API (some risk) | External training run | Your infra only | Self-host if HIPAA/GDPR/SOC2 enterprise requires it |
| Task quality | General; may need prompt work | High for narrow tasks with data | Depends on base model | Fine-tune if prompt quality insufficient after real testing |
| Cost at scale | Linear with tokens | Training amortized | Compute only | Evaluate at 100M tokens/month+ |
| Latency | Network overhead | Network overhead | Sub-10ms possible | Self-host for real-time voice / live autocomplete |
| Maintenance | Minimal | Moderate (retraining cycle) | High (infra + ops) | Honest team capacity assessment |
| Time to ship | Days | Weeks to months | Months | API almost always wins for time-to-market |
Default recommendation for 2026: Start with API wrapper. It ships faster, requires no ML infrastructure, and in most cases provides sufficient quality. Evaluate fine-tuning when you have a production quality signal that your prompts have hit their ceiling. Evaluate self-hosting only when cost or privacy requirements create a hard constraint.
Start with API wrapper. Evaluate fine-tuning only after production quality signals that prompt engineering has hit its ceiling. Evaluate self-hosting only when privacy or cost at scale create a hard constraint. The default is not the lazy path — it is the correct one.
Architecture review. PM, ML engineer, and CTO reviewing the plan for an AI resume review feature.
ML Engineer: “I want to fine-tune a Llama-4 model on our historical resume feedback data. We have 5,000 past feedback examples. It'll be more consistent and cheaper than calling GPT-4o.”
CTO: “How long does the fine-tune take to be production-ready?”
ML Engineer: “Two to three weeks, plus evaluation time.”
PM: “Have we tested whether GPT-4o-mini with a well-crafted prompt matches the quality of those historical feedback examples?”
ML Engineer: “No, I assumed we'd need fine-tuning for consistency.”
PM: “Let's test that assumption. Give me 48 hours — I'll run 50 examples through GPT-4o-mini with a structured prompt and rate quality against the golden set. If mini matches our historical quality, we ship that. If it doesn't, we have the data to justify the fine-tuning investment.”
The mini test matched quality on 44 of 50 examples. The team shipped an API wrapper in two weeks instead of six.
Fine-tuning is expensive to build and maintain. Always test the cheaper hypothesis first.
When fine-tuning is the right call
Fine-tuning is genuinely the right call in these scenarios:
Format consistency at scale. Your task requires precise output formatting (structured JSON, specific field names, consistent report sections) and your prompting has not been able to achieve reliable consistency. Fine-tuning can bake in format adherence that system prompts can't reliably achieve.
Domain vocabulary. Your domain uses terms, abbreviations, or concepts that are systematically underrepresented in pre-training data (specific medical subfields, niche legal frameworks, proprietary industry jargon). A fine-tuned model that has seen your corpus will handle these terms correctly.
Cost at scale with quality evidence. You have benchmarked prompt-only approaches and found a quality gap. You have labeled data. You have run the cost model and the fine-tuned smaller model's cost savings justify the training and maintenance investment. All three conditions must be true.
What fine-tuning is not for: creating a "data moat" from user inputs (the weights don't store users' data in a extractable form), achieving reasoning capabilities the base model doesn't have (fine-tuning reinforces existing capabilities, it doesn't add new ones), or making a small model match the general reasoning quality of a large model on diverse tasks.
Test the task against a well-prompted large model before committing to fine-tuning. If the quality gap is not real and measurable, you do not have a fine-tuning problem — you have a prompt engineering problem.
What to do this week
-
Name your current AI capability approach. For each AI feature, write down which path you're on: API wrapper, fine-tuned, or self-hosted. If you're planning one but haven't shipped, write down why that path was chosen.
-
Run the quality benchmark. If you're planning to fine-tune, test your task against GPT-4o-mini and GPT-4o with a well-crafted prompt first. Score quality on your eval dimensions. Get a real baseline before committing to the fine-tuning path.
-
Build the cost model at 100x volume. Use the framework from
latency-cost. At your projected scale, what does each path cost per month? Let the economics inform the architecture conversation.
Where to go next
- Latency and Cost — the quantitative model behind the cost factor
- LLM Fundamentals — fine-tuning, LoRA, and quantisation explained
- Eval Design — the quality benchmarking required before any architecture decision