Debounce

Debounce lets a function ignore rapid-fire events until the source goes quiet. After a configured period elapses with no new events, the handler fires once with the most recent payload. Useful for webhook storms, search-as-you-type, noisy IoT sensors, and anywhere N physical events represent one logical change.

For the user-facing configuration reference, see the Debounce how-to. This page covers the “how and why it works” — what happens during an arm, who fires the run in a cluster, and how a crash is recovered.

Semantics

events:   A   B   C          D                     E
          │   │   │          │                     │
          ▼   ▼   ▼          ▼                     ▼
time:  ───┼───┼───┼──────────┼─fires(D)────────────┼─fires(E)──►
                 ◀── period ──▶◀────── period ──────▶
                  (B, C, D reset the timer)

First event arms. The first event for (function, key) creates a KV entry with FiresAt = now + period and the event payload.
Subsequent events reset. Each new event pushes FiresAt forward to now + period and replaces the payload. Latest-wins semantics.
Quiet period fires. When FiresAt <= now and no reset has landed, the scheduler claims the entry and creates a normal run with the queued payload.
Per-key isolation. Entries are keyed by (envID, functionID, debounceKey). Two keys debounce independently.

Storage Model

Debounce state lives in a NATS JetStream KV bucket named SYS_debounce_state:

Per-key TTL: 7 days (guards against orphaned entries from dropped events).
Replication: matches the NATS cluster replica count. In a multi-node deploy, every node has a local handle on the same logical bucket.
Bucket prefix SYS_ hides the bucket from the user-facing KV Store dashboard (which lists only APP_ buckets).

Each entry stores:

{
  environment_id, function_id, debounce_key,
  event_id, payload,                // for the firing run
  function_version,                 // invalidates if fn config changed
  period_ms,
  armed_at, fires_at,
  max_fire_at                       // optional starvation cap (#551)
}

max_fire_at is set once at first arm to armed_at + max_wait_ms and is preserved across resets — the cap is anchored to the first event in the window, not the latest. Without that anchor, resets would push the cap forward forever and the starvation guarantee would collapse.

The NATS KV revision on the entry — not a column inside the entry — is used for all CAS operations.

Arm: First-Writer + CAS Reset

           ┌───────────────────────────┐
           │  event for (fn, key)       │
           └──────────────┬─────────────┘
                          │
                  kv.Create(key, entry)
                          │
        ┌─────────────────┴─────────────────┐
        │                                   │
   success: first event                ErrKeyExists: entry exists
        │                                   │
  return nil (armed)                        │
                                  kv.Get(key) → rev
                                            │
                                  kv.Update(key, newEntry, rev)
                                            │
                              ┌─────────────┴─────────────┐
                              │                           │
                     success: reset won        ErrKeyExists: another
                              │                  node also reset
                        return nil                        │
                                                  retry Get+Update
                                                  (bounded, default 5)
                                                          │
                                                exhausted: ErrCeilingHit
                                                          │
                                                 caller falls through
                                                 to normal run creation
                                                 (event NOT lost)

Two nodes racing to arm the same key cannot both succeed at kv.Create — JetStream’s first-writer-wins contract guarantees exactly one creator. The loser falls into the CAS-update path and retries on stale-revision conflicts.

CAS-exhaustion fall-through. If a single key sees such violent contention that 5 CAS attempts all lose, Arm returns ErrCeilingHit and the caller (e.g., Emit in event_trigger.go) creates a regular run for that event instead. Events are never dropped; a pathological key just skips debounce. This is a per-key contention signal, not a per-function key-count cap — see #552 (closed YAGNI) for the history.

Global key sentinel. If the event payload has no extractable value at the configured key path, the entry is stored under a __global__ sentinel. This collapses arbitrary empty-payload events into one debounce lane so a misconfigured source cannot flood the KV bucket with distinct-looking empty-payload entries.

Sweep: Claim + Fire

A per-node scheduler ticks once per second (DefaultTickInterval = 1s):

Recovery first. recoverPending lists stale rows from pending_debounce_fires (rows claimed more than 30s ago — see recovery section below) and retries run creation for each.
Scan. ListEntries streams the KV bucket snapshot via WatchAll. Entries where FiresAt <= now are candidates for firing.
Claim. For each expired entry, claimForFire writes a row to the pending_debounce_fires outbox table first, then issues kv.Delete(key, LastRevision(rev)). Classic outbox ordering — durable side (DB) before external mutation (KV).
Fire. If the CAS delete succeeds, the sweep calls the run factory with the queued payload. On success, the pending row is deleted.

Two nodes racing to fire the same expired entry both pass the “is expired?” check, both write their own pending row, but only one DeleteRev succeeds. The loser’s CAS returns ErrKeyExists (NATS maps wrong-last-sequence to this sentinel) and best-effort deletes its own orphan pending row.

Crash Recovery

The moment between “DeleteRev succeeded” and “CreateRun returned” is the crash window. If the node dies here:

The KV entry is already gone (cannot re-arm on the same revision).
No run was created.
The pending_debounce_fires row remains.

On every sweep tick, every node runs recoverPending. Rows older than RecoveryStaleThreshold (30 seconds) are eligible for claim by any surviving node via ClaimPendingDebounceFire (Postgres: FOR UPDATE SKIP LOCKED + claimed_by/lease_expires_at lease; SQLite: single-writer claim with the same columns). The claimer loads the function, calls the factory, and deletes the row on success.

The 30s threshold is a multi-node duplicate-fire guard: a row freshly written by node A (in the middle of a normal fire) is invisible to node B’s recovery until A has had 30s to finish CreateRun. Normal operation stays dedup’d; only true crashes trigger recovery.

Multi-node claim (#561). Recovery uses a short lease (DefaultLeaseDuration = 60s) so surviving nodes don’t race to fire the same pending row. A node claims a row, performs recovery, and releases by deletion; if the claim holder dies mid-recovery, the lease expires and another node re-claims on the next tick.

Function.Version Invalidation

A debounce entry records FunctionVersion at arm time. If the function is updated mid-window (triggers changed, timeout adjusted, handler redeployed), UpdateFunction bumps Function.Version. When the scheduler later fires the entry, it compares the armed version against the current version:

Match → fire normally.
Mismatch → drop the pending row, do not call the factory. The next event arms a fresh entry against the new version.

This prevents a handler redeployed with a narrower schema from crashing on a payload queued under the previous schema.

Why Not Sync?

TriggerSync rejects debounced functions with FailedPrecondition. Sync callers block waiting for a result; debounce windows can be arbitrarily long; combining them means the sync caller either times out or holds a connection for the full period with no upside. The rejection happens at the handler call site (handler.go:TriggerSync), not at registration, because invocation mode is a per-call decision.

Observability

CLI ironflow debounce list (env-scoped) — inspect pending entries.
CLI ironflow debounce cancel <fn-id> <key> — drop an entry without firing.
Server structured log line on cancel: debounce: entry cancelled with env_id, function_id, debounce_key, source (cli/dashboard/unknown).
Audit row debounce.cancelled is written to audit_events (migration 015, 015_audit_debounce_cancelled.sql) on every cancel via emitDebounceCancelAudit in internal/server/debounce_handler.go.