Skip to content

18. Recoverable Client Errors: Surface as GraphQLError + Anomaly Alerts

Date: 2026-06-23

Status

Accepted

Context

The application distinguishes two broad classes of failures for observability:

  • Bugs / unexpected conditions → reported to Sentry. Developers need to see these.
  • Expected, user-recoverable conditions (account deactivated, certain auth states, business-rule violations the UI can guide the user through) → surfaced as GraphQLError and intentionally not reported to Sentry.

Firebase ID token expiry (auth/id-token-expiredCODE_611) is a hybrid case:

  • The client already recovers on its own: a Relay network middleware inspects every response for the code and, on detection, redirects the browser to /token-refresh.
  • But the server raised this condition as a bug-class error, so it reached the Sentry reporting wrapper and produced noise — even though the client silently recovers.

This ADR establishes how to handle this whole class: surface it as a GraphQLError so the existing “do not report” path suppresses it, while still detecting genuine anomalies.

The goal for this class of error:

  1. Never treat a single (or even occasional) occurrence as a Sentry “error” event.
  2. Still let client code detect the condition via a stable code in the GraphQL errors[].message.
  3. Detect when the condition becomes an anomaly (sudden spike, sustained high rate across tenants, regression in the refresh machinery) and notify humans.

A Sentry-native, low-code approach was chosen over per-request rate counters or always-on exception sampling (see Alternatives considered below).

Decision

When a failure meets all of the following, apply the pattern below:

  • The client (browser / another service) can meaningfully recover without developer intervention (refresh a token and retry, navigate to a login/refresh page, show a friendly message and offer retry, etc.).
  • The condition is not a programming defect.
  • Client recovery logic (or observability) needs a stable machine-readable identifier — we put the existing [CODE_xxx]: Human message in the GraphQL error message.
  • We still want to know when the rate of the condition becomes abnormal.

The pattern:

  1. Transport shape. Surface the error as a GraphQLError whose .message contains the familiar [CODE_xxx]: ... text, and attach no wrapped original error. The “no wrapped original error” part is load-bearing: only a bare GraphQLError is classified as expected and returned as HTTP 200. A GraphQLError wrapping a bug-class error yields HTTP 500, and on HTTP ≥ 400 the client’s response-inspecting recovery middleware never runs — so the recovery is lost. The 200 keeps the message in errors[].message and lets the client middleware act on it. (Construction and instance-safety details: see Implementation mechanics.)

  2. Reporting suppression. Because it is now a GraphQLError, it flows through the existing “do not report” machinery and the Sentry wrapper does not fire. (Specific predicates and call sites: see Implementation mechanics.)

  3. Visibility without noise.

    • Always emit a cheap Sentry.addBreadcrumb({ category: 'recoverable-error', message: '...', level: 'info', data: { code: '611', ... } }).
    • Optionally emit a sampled Sentry.captureMessage('recoverable-condition-observed', { level: 'info', tags: { code: '611', ... } }) (e.g. 1–5 %).
    • Never call captureException (or the reporter) for the normal path.
  4. Anomaly detection. Use Sentry’s first-class alerting:

    • Create an Alert (Issues or a custom metric alert) filtered on the tag / breadcrumb category / fingerprint / message pattern containing the code.
    • Condition examples: “event count > 30 in last 15 min”, “spike > 3× 7-day baseline”, “affects > N distinct users/tenants”.
    • Optionally give the family a stable fingerprint in beforeSend:
      event.fingerprint = ['recoverable', 'auth', code || 'unknown'];

    This turns normal occurrences into cheap, queryable signals and only raises a real actionable issue when the signal crosses a threshold.

  5. Consistency rules for implementers.

    • The same isExpected* / classify* predicate that decides “this is recoverable” must also decide “produce a GraphQLError here”.
    • Client-side recovery code must continue to be driven by the stable code in the message. The message format is the contract.
    • Add or update a test asserting: when the recoverable condition is raised from a GraphQL context / resolver, no reporter is invoked for the happy path.
    • Document the code in the relevant dev-manual / project-description section.

Non-goals / alternatives considered (and why not chosen)

  • Custom RecoverableError class taught to every reporter: higher maintenance, leaks into many layers.
  • Always captureException + level: 'warning' + heavy sampling: still generates issue noise and costs.
  • In-process / Redis rate limiter that decides whether to report: useful as an additional guard for very high-severity families, but not required for the normal “rare but expected” case. Sentry alerts are the right tool for rate-based anomaly detection.
  • Suppressing in beforeSend only: too late; the event has already been created and may have consumed quota / created an issue.

Consequences

Benefits

  • Preserves the simple global invariant: “bug → Sentry; recoverable GraphQLError → no Sentry”.
  • The client recovery contract (stable code in the error message) is unchanged and now also satisfied for context-level errors.
  • Observability cost is near zero for normal volume; humans are only notified when something has actually gone wrong at scale.
  • Easy to apply consistently: any future “we can refresh / redirect / retry this” condition follows the same steps.

Risks & Mitigations

  • A GraphQLError that should have been reported as a bug is accidentally classified as recoverable.
    • Mitigation: the predicate lives in one place. Code review must verify the condition really is client-recoverable. Add a comment linking to this ADR next to the if.
  • Loss of stack trace for a legitimate rare case.
    • Mitigation: breadcrumbs still carry context; when the alert fires the full trace can be obtained from a sampled info message or by temporarily raising the sample rate.
  • Developers forget to emit the breadcrumb / sampled message.
    • Mitigation: the ADR + companion investigation provide copy-paste examples. A future helper (reportRecoverable(...)) can be added later.

Checklist when applying this pattern

  1. Identify the predicate that says “this is recoverable for the client”.
  2. In every place that can produce the error for a GraphQL response (resolvers, context factory, middleware), raise a GraphQLError carrying the [CODE_xxx] message.
  3. Add Sentry.addBreadcrumb (and optional sampled captureMessage) at the point of detection.
  4. Verify the client recovery layer still sees the code.
  5. Add / update a unit test that the recoverable path does not cause a reporter call.
  6. Create or update the Sentry Alert for the code/family.
  7. Update relevant dev-manual / project-description docs with a one-line reference.

Implementation mechanics (this codebase)

These details are specific to the current stack. They explain how the pattern is wired, but are not needed to make the decision above.

Constructing the error. Build it with graphql-yoga’s createGraphQLError(message). graphql ships dual CJS/ESM, so a GraphQLError built from a different module instance fails yoga’s internal instanceof check and gets re-wrapped into a 500. createGraphQLError builds from yoga’s own graphql instance, so yoga recognizes it as expected (isOriginalGraphQLError) and returns 200.

Masking. maskError (lib/graphql/maskError.ts) must detect GraphQLErrors by shape (name === 'GraphQLError'), not instanceof, so an expected coded error is never masked into “Internal server error”.

Why HTTP ≥ 400 breaks recovery. On a ≥ 400 response the Relay network layer (react-relay-network-modern) throws inside convertResponse before any response-inspecting middleware runs, so expiredTokenMiddleware is bypassed and inner-query recovery is lost.

Reporting suppression — the actual call sites:

  • Try.throwThroughErrorTypes(['GraphQLError'])
  • Early if (result.error instanceof GraphQLError) throw ... paths
  • The outer handler Try in pages/api/graphql.ts does not trigger .report(...)
  • The useSentry envelop plugin treats it as an expected GraphQL error.

Client recovery contract. Recovery is driven by getCodeFromErrorMessage(message) === codes.tokenExpired. The token-expiry case classifies via isExpectedAuthError (server/graphql/helpers/auth/classifyTokenError.ts) and recovers in expiredTokenMiddleware (PR #6408) plus the legacy QueryRenderer path in _app.js.

References

  • Client recovery: lib/auth/expiredTokenMiddleware.ts, lib/createEnvironment/client.js
  • Context special handling: lib/graphql/yogaHandler.ts:createGraphQLResolversContext, pages/api/graphql.ts
  • Classification helper: server/graphql/helpers/auth/classifyTokenError.ts
  • Masking that must preserve the message: lib/graphql/maskError.ts
  • Companion investigation (concrete application to expired-token): docs/investigations/firebase-token-refresh/2026-06-23-expired-token-sentry-anomaly-handling.md
  • Prior investigation: docs/investigations/firebase-token-refresh/2026-04-27-firebase-token-refresh-findings.md
  • Token refresh user guide: docs/user-guides/dev-manual/token-refresh.md
  • PR that introduced per-request detection: #6408
  • Try.throwThroughErrorTypes and reporter behaviour: @power-rent/try-catch