Safety and Auditability — the pm manual

The model can't decide what's acceptable. That's a product decision, and it's yours. Every AI feature you ship has a policy layer. The question is whether you designed it or inherited it by accident.

Talvinder Singh, Pragmatic Leaders

When an AI feature produces a harmful output, the question "why did the model do that?" is the wrong question. The right question is "what did the product design allow?" The model's behavior within the scope you gave it is a product decision. The scope you gave it is also a product decision. If you didn't make those decisions deliberately, you made them by default — and default decisions are usually bad ones.

What a PM owns vs. what others own

This is the table that most teams don't have a clear version of.

What a PM owns	What engineering owns	What legal/policy owns	What the model provider owns
Defining acceptable use — what the feature should and shouldn't do	Implementing the content filters and policy guards	Regulatory compliance obligations (GDPR, AI Act, liability)	Model-level safety (RLHF tuning, hard refusals, toxicity limits)
Designing the human-in-the-loop gates and escalation paths	Building the audit log infrastructure	Reviewing outputs that create legal exposure	Publishing the acceptable use policy you must comply with
Speccing the "model is wrong" UX	Monitoring outputs at scale (LLM-as-judge, classifiers)	Contracting with the model provider	Providing API-level filtering options
Communicating risk and limitations to users	Rate limiting, abuse detection	User agreement and terms of service	Responding to model capability changes
Writing the incident response playbook	Incident tooling and retrospective process	Handling user complaints and regulatory inquiries	Safety research and model updates

The most important column is the first one. PMs routinely defer "AI safety" to engineering or legal, treating it as a compliance obligation rather than a product design discipline. This is wrong for two reasons: (1) the policy layer is deeply entangled with UX decisions that only a PM can make, and (2) the reputational and user-trust damage from a high-profile AI failure hits the product, not the legal team.

Designing the policy layer

A policy layer is the set of decisions about what your AI feature will and won't do. It has three components:

1. Capability scope. What can the user ask the AI to do? For a customer support bot, the scope might be: "Answer questions about our product, pricing, and policies. Do not give legal advice, medical advice, or financial advice. Do not discuss competitors. Do not engage with off-topic requests." Capability scope must be written before the system prompt is written — the system prompt is the implementation of the scope, not the place to discover what the scope is.

2. Output guardrails. What outputs are never acceptable, regardless of what the user asks? For most products: personally identifiable information (PII) about other users, content that violates your terms of service, outputs that impersonate a specific real person. These are usually implemented as output classifiers or system prompt restrictions.

3. Escalation triggers. Under what conditions does the AI output go to a human for review before being delivered? Common triggers: outputs that contain financial figures above a threshold, legal language, medical guidance, any output flagged by a safety classifier above a confidence threshold.

// learn the judgment

Your product is an AI tutor for school-age students (ages 12-18). A student asks the AI: 'My parents are fighting a lot and I'm really stressed. Can you help me?' The AI has been designed to answer academic questions only. It deflects with 'I'm only able to help with schoolwork.' The student's parents file a complaint saying the AI was 'cold and uncaring' during a mental health moment. You're reviewing the incident.

The call: Was the AI response correct? How do you redesign the policy layer?

Your reasoning:

Hallucination handling — the response playbook

Hallucinations will occur. The question is what happens when they do.

Tier 1: Low-stakes hallucination. The model states an incorrect fact that is easily checkable and has no material consequence (a wrong release date, an incorrect statistic). Response: design inline fact-checking cues ("Always verify important information"). Log the query type for your eval improvement backlog. No immediate escalation required.

Tier 2: Medium-stakes hallucination. The model states something incorrect that a user might act on — a wrong price, an incorrect policy detail, an incomplete procedure. Response: design explicit uncertainty signals in the UX ("This information is AI-generated — confirm with [authoritative source] before acting"). Add the query type to your golden set and eval regression suite. Consider adding a verification step for the specific topic category.

Tier 3: High-stakes hallucination. The model states something that could cause real harm if acted upon — incorrect medical dosing, wrong legal advice, a fabricated financial figure in a report. Response: human-in-the-loop gate before the output is delivered. Output classifier to flag potential medical/legal/financial content. User-visible disclaimer with mandatory verification prompt. Escalation to human review for flagged outputs. Consider whether this task is appropriate for AI delivery at all.

The general rule for PM decision-making: before shipping any AI feature, explicitly assign every output category to a tier. Don't discover tier 3 hallucinations after launch.

Audit trails

An audit trail is the record that tells you, after the fact: what did a user ask, what did the model say, when, and based on what retrieved context?

You need this for five reasons:

Debugging. When a user reports a bad output, you need to replay exactly what happened.
Eval improvement. Every production failure is a potential golden set entry. You can't use it if you didn't log it.
Regulatory compliance. The EU AI Act (2026 enforcement) and India's DPDP regulation require data processing records, including AI-assisted decisions that affect users.
Incident response. If an AI output causes harm, you need to reconstruct the event for your legal team, regulators, or press response.
Model improvement. LLM providers need explicit authorization and logging to use outputs for improvement. If you're in a regulated industry, you may need to confirm your provider does NOT use your outputs for training.

Minimum viable audit trail:

Timestamp of every AI interaction
User identifier (not necessarily PII — an anonymized ID is sufficient for debugging)
The user's input, verbatim
The full prompt sent to the model (system prompt + retrieved context + user input)
The model's output, verbatim
Model version and configuration used
Any classifier outputs (content policy flags, confidence scores)
Whether the output was delivered, held for review, or blocked

Retention policy: work with legal to define how long audit logs are retained. GDPR gives users the right to deletion — audit logs containing personal data must be covered by your deletion workflow.

Human-in-the-loop gates

A human-in-the-loop (HITL) gate is a checkpoint where a human reviews an AI output before it's delivered to a user or acted upon. HITL is the risk management tool for high-stakes AI features. It's also a UX tax — every gate adds latency and friction.

Design principle: HITL gates should be designed around the cost of a wrong AI output, not around the presence of AI. If the cost of an error is low (a slightly awkward customer support message), don't add a gate. If the cost is high (a medical recommendation, a loan rejection, a legal draft), add a gate and design the workflow for the human reviewer.

A HITL gate spec includes:

The trigger condition (what triggers human review — a classifier, a query category, a confidence threshold)
The reviewer workflow (who reviews, how they access the queue, what they see, what actions are available)
The latency SLA (how long until the human reviewer must act)
The fallback (what the user sees / experiences while waiting for review)
The override path (can the user proceed without review if they accept the risk?)

Async HITL (the reviewer acts before delivery, user waits) is appropriate for high-stakes one-time outputs (a detailed financial report, a legal document summary). Parallel HITL (deliver the AI output immediately, flag for retrospective review) is appropriate for medium-stakes ongoing interactions where synchronous review would break the UX. Retrospective HITL (sample-based human review of past outputs) is appropriate for monitoring and improvement, not risk management.

Briefing legal and policy on AI risk

Legal and policy teams are often either over-involved (reviewing every prompt before launch) or under-involved (not consulted until something goes wrong). Neither is right.

What legal needs from you:

A description of what the AI feature does in plain language, including what types of outputs it can generate
A list of categories of output that carry potential liability (medical, legal, financial, minor-targeted)
Your human-in-the-loop gates and what they catch
Your audit trail spec — what you log and how long you retain it
The model provider's acceptable use policy and whether your use case is compliant

What you need from legal:

Jurisdiction-specific requirements (if you operate in the EU, are you covered under the EU AI Act as a high-risk system?)
Terms of service language for AI-generated content (who owns it, what disclaimers are required, what you're not responsible for)
A rapid-response plan for high-profile failures (who is the spokesperson, what can you say before a full incident review)

The right time to brief legal: before launch, not after a complaint. A 30-minute briefing with a well-prepared PM (use the five-point list above) is far less expensive than a reactive legal review following an incident.

What to do this week

Write your capability scope. For one AI feature, write three sentences: what it can do, what it explicitly won't do, and what happens when a user asks for something out of scope. Share with your engineering lead and check whether the system prompt matches.
Tier your output categories. List five types of outputs your AI feature might generate. Assign each to tier 1, 2, or 3. For any tier-3 category, write down whether you have a human-in-the-loop gate and what it looks like.
Check your audit trail. Open your logging config. Are user inputs, full prompts, and model outputs being stored? How long are they retained? If you don't know, find out this week.

Where to go next

Eval Design — turning production failures into golden set entries
Ethical PM — the broader ethical PM framework applied to AI failure modes
Agent Design — HITL design for agentic systems with real-world side effects