Debug a surprising deny
Someone is hitting a 403 they didn’t expect. The first instinct is to blame Layer 1 RBAC; the second-most-common cause is a Layer 2 CEL policy denying with a condition the caller doesn’t see. This guide walks the standard triage path.
For the conceptual model, read Authorization Policies.
Symptom
Section titled “Symptom”A caller reports:
HTTP 403 Forbidden{ "code": "permission_denied", "message": "denied by policy" }…on a request that worked yesterday, or that they expected to work.
Step 1 — Locate the audit row
Section titled “Step 1 — Locate the audit row”Every L2 deny writes a policy_decisions row. Query the table directly (or via your preferred SQL client) to locate the recent deny for the caller:
A row looks like (fields trimmed):
{ "seq": 84217, "created_at": "2026-05-07T14:32:11.000000Z", "principal_id": "user_alice", "action": "functions:invoke", "resource": "irn:ironflow:org_acme:proj_default:function:prod:fn_payments", "decision": "deny", "policy_id": "pol_3F7K9...", "eval_millis": 3, "this_hash": "b3a1...", "prev_hash": "9e22..."}policy_id tells you which policy fired. Use ironflow policy versions list <policy_id> to see the version history and identify which version was active at decision time.
Step 2 — Pull the policy version history
Section titled “Step 2 — Pull the policy version history”ironflow policy versions list pol_3F7K9... --json | jq '.[] | select(.version_num == 4)'The JSON output includes effect, actions, resources, condition, valid_from, valid_until, and the saver’s identity + timestamp. The condition field is the CEL expression that returned true for the audit row’s request.
Step 3 — Reproduce the deny with policy test
Section titled “Step 3 — Reproduce the deny with policy test”This is the smoking-gun step. Re-evaluate the same condition against the same subject and request:
ironflow policy test \ --policy-id pol_3F7K9... \ --request '{"action":"functions:invoke","resource":"irn:ironflow:org_acme:proj_default:function:prod:fn_payments","environment":"prod"}' \ --subject '{"id":"user_alice","roles":["developer","oncall"]}'Output:
Condition: request.environment == "prod" && !("oncall" in subject.roles)Matched: falseWait — the test says Matched: false (which means ALLOW for a deny policy) but the audit row shows decision: deny. That mismatch is the next clue. The audit row’s principal_id was user_alice with only the developer role at decision time; the test above added "oncall" to the current subject map, which is why the result differs.
Step 4 — Common mismatch causes
Section titled “Step 4 — Common mismatch causes”If policy test + audit disagree, work through these in order:
- Wrong policy version. The audit row’s
policy_idtells you which policy fired, but not which version. Runironflow policy versions list <policy_id> --jsonto see the version history and identify the version active at the audit row’screated_at. A newer version may have already corrected the bug — but the row in front of you is from before the fix. valid_from/valid_untilwindow. A policy that was active at decision time might be out of window now.ironflow policy versions list <policy_id> --jsonincludesvalid_fromandvalid_until— confirmaudit.created_atfalls inside[valid_from, valid_until).- Subject map drift. L2 sees the subject as populated by the auth path. JWT/dashboard callers get
subject.user_emailpopulated; raw API keys do not. If the audit row’s subject differs from yourpolicy testsubject, copy the audit row’s subject fields verbatim into the test. - Subject-state drift. The audit row records the subject as it was at decision time. If the caller’s roles changed since (added or removed
oncall, role rename, group reassignment), apolicy testwith the current subject will disagree with the audit row. Copy the audit row’s subject fields verbatim into the test; do not look up “current” state.
Step 5 — Decide the fix
Section titled “Step 5 — Decide the fix”Once you’ve identified the policy and reproduced the deny:
- Policy is wrong — edit it. The save creates a new version; subsequent denies cite the new version.
- Policy is right, caller’s request is wrong — fix the caller. Communicate the time window or scope so the caller stops hitting the policy.
- Policy was a mistake — roll back. See manage-versions.md. Audit rows from before the rollback retain the original
policy_id; history is preserved. - Policy is right but blocking break-glass — use the CLI
--bypass-self-lockout-preflightfor the corrective edit, or revoke the policy entirely. See emergency-bypass.md.
What policy test does not test
Section titled “What policy test does not test”- Cache state.
policy testreads policies fresh from PG; production reads from a layered cache. A stale in-memory cache surviving an epoch bump is theoretically possible (it shouldn’t be — every lookup checks the epoch). If you suspect cache staleness, look atironflow_authz_cache_hit_ratioandironflow_authz_decision_latency_secondsover the incident window. - L1 RBAC.
policy testonly evaluates the L2 condition. If L1 denies, L2 never runs andpolicy testwon’t help. Check L1 RBAC through the server’s standard auth logs or API first if you suspect L1 is the source. - Concurrent edits. A policy edited mid-incident may have multiple versions matching different audit rows. Always identify the correct version from the audit row’s
created_atviaironflow policy versions list, not “latest”.
Next steps
Section titled “Next steps”- Roll back a bad policy: manage-versions.md
- Eval latency investigation: runbook-policy-eval-slow.md
- Audit batch backlog: runbook-audit-backlog.md