Skip to content

Debug a surprising deny

Someone is hitting a 403 they didn’t expect. The first instinct is to blame Layer 1 RBAC; the second-most-common cause is a Layer 2 CEL policy denying with a condition the caller doesn’t see. This guide walks the standard triage path.

For the conceptual model, read Authorization Policies.

A caller reports:

HTTP 403 Forbidden
{ "code": "permission_denied", "message": "denied by policy" }

…on a request that worked yesterday, or that they expected to work.

Every L2 deny writes a policy_decisions row. Query the table directly (or via your preferred SQL client) to locate the recent deny for the caller:

A row looks like (fields trimmed):

{
"seq": 84217,
"created_at": "2026-05-07T14:32:11.000000Z",
"principal_id": "user_alice",
"action": "functions:invoke",
"resource": "irn:ironflow:org_acme:proj_default:function:prod:fn_payments",
"decision": "deny",
"policy_id": "pol_3F7K9...",
"eval_millis": 3,
"this_hash": "b3a1...",
"prev_hash": "9e22..."
}

policy_id tells you which policy fired. Use ironflow policy versions list <policy_id> to see the version history and identify which version was active at decision time.

Step 2 — Pull the policy version history

Section titled “Step 2 — Pull the policy version history”
Terminal window
ironflow policy versions list pol_3F7K9... --json | jq '.[] | select(.version_num == 4)'

The JSON output includes effect, actions, resources, condition, valid_from, valid_until, and the saver’s identity + timestamp. The condition field is the CEL expression that returned true for the audit row’s request.

Step 3 — Reproduce the deny with policy test

Section titled “Step 3 — Reproduce the deny with policy test”

This is the smoking-gun step. Re-evaluate the same condition against the same subject and request:

Terminal window
ironflow policy test \
--policy-id pol_3F7K9... \
--request '{"action":"functions:invoke","resource":"irn:ironflow:org_acme:proj_default:function:prod:fn_payments","environment":"prod"}' \
--subject '{"id":"user_alice","roles":["developer","oncall"]}'

Output:

Condition: request.environment == "prod" && !("oncall" in subject.roles)
Matched: false

Wait — the test says Matched: false (which means ALLOW for a deny policy) but the audit row shows decision: deny. That mismatch is the next clue. The audit row’s principal_id was user_alice with only the developer role at decision time; the test above added "oncall" to the current subject map, which is why the result differs.

If policy test + audit disagree, work through these in order:

  1. Wrong policy version. The audit row’s policy_id tells you which policy fired, but not which version. Run ironflow policy versions list <policy_id> --json to see the version history and identify the version active at the audit row’s created_at. A newer version may have already corrected the bug — but the row in front of you is from before the fix.
  2. valid_from / valid_until window. A policy that was active at decision time might be out of window now. ironflow policy versions list <policy_id> --json includes valid_from and valid_until — confirm audit.created_at falls inside [valid_from, valid_until).
  3. Subject map drift. L2 sees the subject as populated by the auth path. JWT/dashboard callers get subject.user_email populated; raw API keys do not. If the audit row’s subject differs from your policy test subject, copy the audit row’s subject fields verbatim into the test.
  4. Subject-state drift. The audit row records the subject as it was at decision time. If the caller’s roles changed since (added or removed oncall, role rename, group reassignment), a policy test with the current subject will disagree with the audit row. Copy the audit row’s subject fields verbatim into the test; do not look up “current” state.

Once you’ve identified the policy and reproduced the deny:

  • Policy is wrong — edit it. The save creates a new version; subsequent denies cite the new version.
  • Policy is right, caller’s request is wrong — fix the caller. Communicate the time window or scope so the caller stops hitting the policy.
  • Policy was a mistake — roll back. See manage-versions.md. Audit rows from before the rollback retain the original policy_id; history is preserved.
  • Policy is right but blocking break-glass — use the CLI --bypass-self-lockout-preflight for the corrective edit, or revoke the policy entirely. See emergency-bypass.md.
  • Cache state. policy test reads policies fresh from PG; production reads from a layered cache. A stale in-memory cache surviving an epoch bump is theoretically possible (it shouldn’t be — every lookup checks the epoch). If you suspect cache staleness, look at ironflow_authz_cache_hit_ratio and ironflow_authz_decision_latency_seconds over the incident window.
  • L1 RBAC. policy test only evaluates the L2 condition. If L1 denies, L2 never runs and policy test won’t help. Check L1 RBAC through the server’s standard auth logs or API first if you suspect L1 is the source.
  • Concurrent edits. A policy edited mid-incident may have multiple versions matching different audit rows. Always identify the correct version from the audit row’s created_at via ironflow policy versions list, not “latest”.