The Sentry + Checkly + Playwright stack — PL's working production observability — Production Harnesses — observability, recovery, the bill

The move. Read a real production stack before you design your own. Borrow what fits; reject what does not.

Most writing about production observability describes a stack in the abstract: you should have error tracking, you should have synthetic monitoring, you should have end-to-end tests.

The lists are accurate. They are not useful.

What is useful is seeing a specific, operating stack and understanding why each layer is there, what it catches, and how the layers fail when they are not maintained correctly.

PL's observability stack is small, opinionated, and operating in production today. It does not cover everything. It has known gaps.

Reading it — including the gaps and the incidents that exposed them — is faster than reading a vendor whitepaper about what you should ideally have.

The lesson is not "use these tools."

It is "design your layered defense in depth, understand what each layer catches, and know what falls through the gap between layers."

The picture

Three concentric circles. The outer layer is Sentry: continuous. It runs without being triggered. Every uncaught exception, every console error above threshold, and every performance degradation that crosses a defined budget is captured, tagged with environment (dev, staging, prod) and feature, and routed to the appropriate alert channel. Sentry also captures session replays — browser-side recordings of user interactions that led to an error. When a user encounters a failure, you can watch what they did.

The middle layer is Checkly: scheduled. It runs on a cron — every ten minutes for the high-frequency checks, every thirty minutes for the baseline. It probes the public surface of the application: can it load the homepage? Can it complete a login flow? Can it reach the API health endpoint? These are synthetic checks — they simulate user behavior rather than waiting for a real user to encounter a problem. Checkly is also wired as a post-deploy gate: when a deployment completes, Checkly runs a set of smoke checks and fails the deploy if they do not pass.

The inner layer is Playwright: triggered. It runs on every PR (to catch regressions before merge) and after every production deploy (to verify the deployed state). The Playwright suite is larger than Checkly's checks: it runs the user's golden path through the product — creating an account, enrolling in a course, completing a lesson, checking progress. These are the tests that would catch a broken enrollment flow or a missing lesson page.

Below the three layers: the gaps. What does not get caught? A Sentry error that never surfaces as an exception (silent data corruption). A Checkly check that passes because it is pointed at the wrong URL. A Playwright test that is flaky because of timing assumptions and has been muted by the team.

The gaps are where the interesting incidents happen.

Why it matters now

Most teams talk about observability and ship without it. The reasons are always the same: "we'll add it after launch," "we're moving too fast right now," "the product is too simple to need this level of monitoring yet."

PL's stack demonstrates that "too simple to need this" is almost always wrong.

The incidents that surfaced during the polish-foundation-sprint — deploy failures masked by stale infrastructure, monitoring pointed at the wrong URL, QA reports going unactioned because nobody could tell which production they were testing — are not exotic failures.

They are the ordinary failures of a product that had monitoring gaps at the wrong moments.

The other thing PL's stack demonstrates is that layered monitoring catches things that no single layer would catch.

Sentry does not test user flows. Checkly does not capture exceptions. Playwright does not run continuously.

The union of the three catches most of the failure modes that matter.

A source you should trust

Sentry's documentation on environment tagging is the first thing to read. Without environment tags, Sentry errors from dev, staging, and prod are mixed in the same alert stream. The first step after wiring Sentry is tagging every event with NEXT_PUBLIC_APP_ENV — a variable that PL sets to dev, staging, or prod at deploy time and passes to Sentry at initialization.

Checkly's documentation on deploy-triggered checks is the second. The default Checkly setup runs on a schedule; the valuable setup is the one that also runs on every deploy and blocks the deploy status from going green until the checks pass.

PL's prod-smoke.yml workflow and the pl-app/scripts/prod-e2e-validate.mjs script are working artifacts. Reading them is faster than reading documentation about what you should theoretically set up. The smoke workflow fires on workflow_run after deploy-prod succeeds, then runs the Playwright suite against the production URL. If any test fails, the workflow fails and the team is paged.

A recipe

A layered-observability protocol for any production web product:

Errors layer — wire Sentry before the first deploy. Set environment tags on initialization. Create one alert channel per environment that goes to a different audience: staging errors go to the dev team's Slack channel, prod errors go to the on-call rotation. Do not route staging noise to the on-call rotation; it trains people to ignore alerts.
Synthetic monitoring layer — set up Checkly (or equivalent) with two check types: a lightweight availability check (can the homepage load?) that runs every five minutes, and a flow check (can a user log in and see their dashboard?) that runs every thirty minutes. Wire Checkly as a post-deploy gate so new deploys do not go to green unless these checks pass.
E2E smoke layer — write Playwright tests for the user's most critical paths: account creation, key feature flow, and checkout if applicable. Run them on every PR via CI and after every production deploy via prod-smoke.yml. Keep the suite fast; a suite that takes thirty minutes will be disabled.
Maintain the URL list — record which URL each layer is monitoring in a document. Review the document when infrastructure changes. When you switch from Vercel to Fly, update the URL list. When you add a new environment, add it to the list.
Own the alert channels — each alert channel has a named owner. The owner reviews the channel daily when there are no incidents. When alerts go unreviewed for three days, the monitoring is effectively off.

The smell of it going wrong

Only one layer is wired. Sentry is set up because the CEO asked for error tracking, but Checkly and Playwright are "on the list." When the enrollment flow breaks silently — no exception, just a broken UI state — Sentry does not catch it because there is no exception to catch. Users start complaining on day three.

Sentry exists but is unowned. Alerts go to a Slack channel with 24 members. Nobody is assigned to the channel. The last five alerts were read by three people and resolved by zero. The team has learned that the alert channel is where noise goes; they have stopped reading it.

Playwright tests exist but are flaky. Three of the twelve tests in the smoke suite fail intermittently due to timing issues. The team added them to the flaky-test exception list six weeks ago. They have been red so long that nobody notices when a new real failure joins the red list.

A production incident is discovered by a customer. The customer contacts support. Support reads the ticket and escalates. Engineering investigates. The failure had been present for fourteen hours. The monitoring stack had been running. The failure did not match any of the monitoring patterns.

A judgment call from real work

PL's polish-foundation-sprint produced the canonical example of the monitoring-URL problem. The Checkly checks were passing. The deploy-dev workflow was failing. These two facts were not obviously contradictory until someone asked: what URL is Checkly monitoring?

The answer was the Vercel URL from before the migration to Fly. Vercel was still serving the previous version of the application. Checkly was confirming that the Vercel instance was up, which it was. The Fly staging instance — the one that was failing — was not in the Checkly URL list.

The failure had been invisible for several days. Talvinder's QA reports were landing in a context where the monitoring showed green, so the reports looked like user error rather than systemic failure. The incident surfaced only when someone traced the Checkly URL to its source.

Two changes followed. First, the Checkly URL list was updated to point at the Fly staging URL and the Fly production URL, with explicit notes indicating the Vercel URL was retired. Second, a review step was added to the deployment runbook: after every infrastructure change, verify that all monitoring tools are pointed at the new addresses. The step is simple; the cost of missing it had been days of confused QA.

The second lesson from this incident is less obvious: when QA reports conflict with monitoring, investigate the monitoring before dismissing the reports. The monitoring was wrong; the QA reports were right. Monitoring systems are trusted artifacts, but they are only as good as the URL they are pointed at.

Rules from this lesson

Observability is a layered defense; Sentry catches exceptions, Checkly catches availability failures, Playwright catches broken user flows — no single layer catches all three.
The monitored URL list is a maintained artifact, not a one-time setup; every infrastructure change requires a URL list review.
An unowned alert channel is no alert channel; assign a named reviewer before launch, not when you notice the channel is being ignored.
When QA reports conflict with monitoring, audit the monitoring first; the monitoring may be pointed at the wrong thing.