Emergency bypass for policy self-lockout

A bad CEL policy can deny policies:write for the very role that needs to fix it. The save path’s three-stage preflight (saver subject + per-admin subject + synthetic role-only subject) catches the obvious cases at write time, but a policy already-live can still strand admins — for example, a valid_from that activated a deny earlier than intended.

This page documents the break-glass CLI bypass. Use it when you understand exactly what is happening and why; do not use it as a routine workaround for preflight failures.

Security note. The bypass requires direct shell access to a node running the Ironflow binary with platform credentials. It is not exposed via the dashboard or HTTP. There is no “remote bypass” — by design.

When to use the bypass

Bypass is the right tool only when:

A live policy denies policies:write for every admin role under the current now.
Rolling forward (a corrective edit) is itself blocked by the self-lockout preflight.
Rolling back to a known-good version is also blocked (the rollback runs the same preflight).
You can produce a written incident note explaining the change.

If any of those four conditions is unmet, do not use bypass. The preflight is doing its job.

Common false alarms:

Preflight names one admin who is on PTO. Use a different admin’s credentials. The preflight blocks because at least one admin is locked out, not necessarily because all are.
Preflight names a synthetic role-only subject. This is the third preflight stage (OV4) and means the policy denies the role itself, not just specific people. This is rare — usually the right fix is to scope the policy more narrowly. Consider whether the policy is correct and the role assignments are wrong.
You can wait for valid_until. If the offending policy has an upcoming expiry, waiting it out is safer than bypassing.

Procedure

Step 1 — Document the incident

Write the incident note first, before bypassing. The note must include:

The policy name and version that is blocking the corrective edit.
The audit row(s) demonstrating the lockout.
The corrective action you are about to take.
Your identity and timestamp.

This note goes into the change record. It is the audit trail that explains why a bypass exists.

Step 2 — Locate a node and platform key

Bypass is gated on a platform-tier credential (ifplatform_*), not a tenant API key. The bypass flag rejects tenant credentials with an explicit error so a leaked tenant key cannot bypass.

# On a node with the binary:
which ironflow
echo "$IRONFLOW_PLATFORM_KEY" | head -c 12  # confirm ifplatform_ prefix

Step 3 — Validate the corrective policy first

Run the corrective condition through policy test against the same subjects the preflight is unhappy about. You want to be sure the bypassed save is actually correct — bypass disables the preflight’s safety check, not its intent.

ironflow policy test \
  --condition '<corrective condition>' \
  --request '{"action":"policies:write","resource":"irn:ironflow:org_acme:proj_default:policy:default:*"}' \
  --subject '{"id":"user_alice","roles":["admin"]}'

Step 4 — Save with bypass

ironflow policy update <policy-id> \
  --condition '<corrective condition>' \
  --bypass-self-lockout-preflight \
  --bypass-reason "incident <id>: original v4 denied admin policies:write at 14:00 UTC"

--bypass-reason is required when --bypass-self-lockout-preflight is set. The reason is forwarded as an HTTP header and logged to stderr by the CLI. Bypassed writes are still audited normally in the policy_decisions chain.

Step 5 — Verify

# Confirm the updated policy
ironflow policy get <policy-id>

# Lockout is cleared — attempt a benign update to confirm preflight passes
ironflow policy update <policy-id> --name <policy-name>

Step 6 — Close the incident

Update the incident note with:

Confirmation the bypass succeeded.
Confirmation the lockout is cleared.
Any follow-up policy edits needed to prevent recurrence.
Whether the original policy was reverted, edited, or left in place.

What bypass does not disable

L1 RBAC. Bypass only skips the L2 self-lockout preflight. If your platform key doesn’t have policies:write at L1, bypass changes nothing — you’ll get the L1 deny.
CEL compilation. The bypassed save still runs Compile on the new condition. T1 (no empty conditions) still applies. Bypass is only about preflight, not about validity.
Audit chain. Bypassed writes are audited. The bypass reason is forwarded as an HTTP header and logged to stderr. The chain links normally; verifying the chain after a bypass should succeed.
Cache invalidation. Bypassed saves bump the tenant epoch the same way normal saves do. The cluster picks up the new policy on next lookup.

Postmortem expectations

Every bypass invocation is a small failure of the preflight model. Write a postmortem covering:

Why the original policy created the lockout (was it a valid_from typo, an unintended scope, a missing role exclusion?).
Why the per-admin preflight didn’t catch it at write time. The most common answer: the original save passed preflight against the saver but not against a future-time activation. Consider whether the policy’s valid_from should have been simulated in the preflight (currently it isn’t — that’s a known limitation).
What changes prevent recurrence: tighter resource patterns, narrower role assignments, additional preflight subjects.

Postmortems for bypass invocations feed the C6 anomaly-detection follow-up (deferred per ADR 0016). When that ships, repeated bypass patterns become alertable.

What bypass is not for

Production tenants whose admins forgot their credentials. That’s an account recovery problem, not a policy problem.
Tenant admins who don’t like the deny. L2 policies exist for a reason; if a deny is wrong, edit it normally — preflight only blocks self-lockouts, not edits in general.
Policy import or template install. Use the template bundle workflow (ironflow policy template install); templates are pre-vetted and the install path runs LintTemplate on the bundle.

Conceptual model: Authorization Policies, self-lockout section
Architectural rationale: ADR 0016, decision S3
Investigating the deny that triggered the lockout: debug-deny.md
Rolling back instead of bypassing: manage-versions.md