Reliability does not come from one magical sentence. Reliability comes from architecture. If your prompt is a single paragraph with mixed intent, mixed constraints, and mixed tone guidance, you are relying on luck.
The stable structure that works in production is straightforward: role, task, constraints, examples, format. Not because it is fashionable, because it makes behavior controllable.
Ground yourself in prompt design as product design and pair it with eval before launch. Prompt structure without evaluation is theater. Evaluation without structure is whack-a-mole. The two are joined at the hip.
Let’s break the structure down.
Role block.
Role is not a costume. Role is a prioritization rule. It tells the assistant what kind of judgment to favor under ambiguity. A role that is too broad creates generic output. A role that is too narrow overfits and collapses outside known cases. The sweet spot is role plus context: who the user is, what decision they face, and which tradeoffs matter.
Task block.
Task defines the job to be done in one sentence. One job. If the line contains “and” three times, you likely have multiple tasks pretending to be one. Split them or sequence them. Reliability drops when the assistant has to infer which verb matters most.
Constraints block.
This is the highest leverage block in the whole prompt. Constraints define forbidden behavior, mandatory checks, uncertainty handling, and domain boundaries. Many teams underwrite this section because they worry the assistant will become too rigid. The opposite usually happens in production: constraints reduce embarrassing behavior while preserving enough room for useful variation.
Examples block.
Examples show what success looks like in practice. They are especially useful when style, structure, or caution level matters. But examples are expensive. Bad examples teach bad habits. Overly narrow examples teach imitation instead of reasoning. Keep examples representative and minimal.
Format block.
Format is a product decision, not an aesthetic one. The best answer in the wrong shape is still a bad user experience. Format lines should encode output sections, ordering, and limits. If your users need action, force an actionable format. If they need confidence cues, require explicit confidence language.
Now the reliability trap: teams write this structure once and assume they are done.
You are not done until each block has a corresponding evaluation target.
Role evaluation asks: does output stay in the intended decision frame?
Task evaluation asks: did the assistant complete the correct job, not a nearby one?
Constraint evaluation asks: did any forbidden behavior appear?
Example evaluation asks: did the assistant generalize correctly rather than mimic blindly?
Format evaluation asks: is output consistently usable in product context?
Without this mapping, prompt edits become untraceable. With it, you can run tight regression checks and learn quickly.
A practical way to implement this is to maintain a small golden set. A golden set is a curated set of test inputs that represent your real failure surface. It is not a giant benchmark. It is a sharp one. This is exactly the discipline argued in eval before launch.
For prompt reliability, include at least four golden set categories.
Baseline cases: common, uncontroversial inputs.
Edge cases: ambiguous or underspecified inputs.
Adversarial cases: prompts likely to induce overconfidence or fabricated detail.
Policy cases: sensitive contexts where escalation and refusal behavior matter.
Each prompt revision should be judged against this set. If a change improves one slice and harms another, you discuss tradeoffs explicitly instead of discovering damage after launch.
Case studies make this concrete.
Harvey’s legal AI model worked because it constrained scope and audit expectations around a domain where mistakes are expensive. That product posture is reflected in prompt design: explicit boundaries, explicit structure, explicit evidence handling.
Perplexity’s search rewrite shows the opposite pressure: users want fast synthesized answers, but trust depends on visible grounding. Prompt structure must enforce citation behavior and uncertainty language, otherwise fluency outruns truth.
In both examples, the structure is doing product work. The model is only half the story.
There is also a sequencing rule that teams miss.
You should add blocks in this order: role and task first, constraints second, format third, examples last.
Why examples last? Because examples can mask structural weaknesses. A prompt that only works with heavy examples often has unclear task or missing constraints.
Another practical warning: avoid unstable adjectives in core instructions.
Words like “insightful,” “professional,” or “thoughtful” are too elastic unless tied to concrete format and criteria. Reliability improves when abstract adjectives are translated into explicit output requirements.
You should also write negative constraints with care.
“Do not be verbose” is weak.
“Use at most five bullets, each under 20 words” is reviewable.
“Do not hallucinate” is wishful.
“If evidence is missing, say what is unknown and ask one clarifying question” is actionable.
Finally, treat structure as an interface contract between prompt authors and product teams. If you change structure, you are changing behavior expectations. That deserves review and regression testing.
The goal is not to eliminate all variance. Some variance is healthy, especially in creative surfaces. The goal is to eliminate harmful variance—the kind that creates trust debt, rework, or policy risk.
Role, task, constraints, examples, format is not a template to copy mindlessly. It is a control system. Use it to make quality legible.
When your team can explain which block fixed which defect, you have moved from prompting as art to prompting as product craft.
Rules from this lesson
- Use role, task, constraints, examples, and format as distinct control blocks, not one blended paragraph.
- Map each block to evaluation checks using a small, sharp golden set.
- Add examples last so they do not hide weak task or constraint design.
- Rewrite vague adjectives into concrete, reviewable output requirements.
- Every structural prompt change must run against regression cases before shipping.