#165 - feat: leader election for subscriptions and automations — Closes #26 - ash/eskit

ash commented

2026-03-02 21:16:36 +00:00

Owner

What

Leader election via Postgres advisory locks for multi-node deployments.

One line to enable: LeaderLock: pgstore.NewPgLockRegistry(pool)

How It Works

Start() → LeaderLock.Acquire(ctx, lockID) — blocks until acquired
Winner processes events, advances checkpoint
Others wait (zero CPU)
Graceful shutdown → defer release() → instant handoff (2ms)
Hard crash → Postgres releases lock (~30s) → new leader resumes from checkpoint

Critical Fix: Automation Checkpoint Safety

Previously, the subscription auto-checkpointed after the handler queued work to the TodoList. If the process crashed before execution, events were permanently lost.

Fix: Automation disables auto-checkpoint (AtomicCheckpoint=true) and saves checkpoint itself — only after executePending() succeeds. Crash before execute → events replay → side effects run.

Two Patterns

Simple (subscription + LeaderLock): trigger → execute inline in handler → checkpoint. For emails, webhooks, simple side effects.

Complex (automation + LeaderLock): trigger → TodoList → deferred execute → checkpoint. For completion tracking, batching, retry with backoff.

Benchmarks (Postgres, AMD EPYC)

Operation	Latency
Lock acquire+release	247µs
Graceful handoff	2.0ms
Steady-state overhead	0

Tests

Leader election: only one processes
Graceful handoff: instant lock transfer
Checkpoint never advances past unexecuted items
Checkpoint advances after execution
Retrying items block checkpoint
DLQ items unblock checkpoint
Handoff with no event loss
Backwards compatible (nil LeaderLock)
Red team: 10 attack vector tests
Race detector clean

Closes #26

## What Leader election via Postgres advisory locks for multi-node deployments. **One line to enable:** `LeaderLock: pgstore.NewPgLockRegistry(pool)` ## How It Works 1. `Start()` → `LeaderLock.Acquire(ctx, lockID)` — blocks until acquired 2. Winner processes events, advances checkpoint 3. Others wait (zero CPU) 4. Graceful shutdown → `defer release()` → instant handoff (2ms) 5. Hard crash → Postgres releases lock (~30s) → new leader resumes from checkpoint ## Critical Fix: Automation Checkpoint Safety Previously, the subscription auto-checkpointed after the handler queued work to the TodoList. If the process crashed before execution, events were permanently lost. **Fix:** Automation disables auto-checkpoint (`AtomicCheckpoint=true`) and saves checkpoint itself — only after `executePending()` succeeds. Crash before execute → events replay → side effects run. ## Two Patterns **Simple (subscription + LeaderLock):** trigger → execute inline in handler → checkpoint. For emails, webhooks, simple side effects. **Complex (automation + LeaderLock):** trigger → TodoList → deferred execute → checkpoint. For completion tracking, batching, retry with backoff. ## Benchmarks (Postgres, AMD EPYC) | Operation | Latency | |---|---| | Lock acquire+release | 247µs | | Graceful handoff | 2.0ms | | Steady-state overhead | 0 | ## Tests - Leader election: only one processes - Graceful handoff: instant lock transfer - Checkpoint never advances past unexecuted items - Checkpoint advances after execution - Retrying items block checkpoint - DLQ items unblock checkpoint - Handoff with no event loss - Backwards compatible (nil LeaderLock) - Red team: 10 attack vector tests - Race detector clean Closes #26

ash added 2 commits

2026-03-02 21:16:36 +00:00

feat: leader election + competing consumers for automation and subscription — Refs #26

CI / test-sqlitestore (push) Failing after 3s

Details

CI / test-pgstore (push) Failing after 3s

Details

CI / test-nats (push) Failing after 5s

Details

CI / test-integration (push) Failing after 4s

Details

CI / benchmarks (push) Failing after 4s

Details

CI / test-codec-otelkit (push) Successful in 11s

Details

CI / test-core (push) Successful in 1m14s

Details

1564a8da1a

- Add LeaderLock field to subscription.Config: acquires lock at Start(),
  only leader instance processes events, others block until leader releases
- Add LeaderLock field to automation.Config: passed through to subscription
- Add TodoStore interface for persistent competing-consumer execution
- Add MemoryTodoStore for single-node/testing
- Add TodoRecord type for cross-process serialization
- Add Marshal/Unmarshal funcs to automation.Config for TodoStore mode
- Hybrid mode: leader reads events + writes todos, ALL nodes execute via Claim
- Backwards compatible: nil LeaderLock/TodoStore = current behavior

docs: leader election & competing consumers — architecture, godoc, tests — Refs #26

CI / test-sqlitestore (push) Failing after 3s

Details

CI / test-pgstore (push) Failing after 3s

Details

CI / test-nats (push) Failing after 5s

Details

CI / test-integration (push) Failing after 4s

Details

CI / benchmarks (push) Failing after 4s

Details

CI / test-codec-otelkit (push) Successful in 12s

Details

CI / test-sqlitestore (pull_request) Failing after 6s

Details

CI / test-pgstore (pull_request) Failing after 5s

Details

CI / test-nats (pull_request) Failing after 6s

Details

CI / test-integration (pull_request) Failing after 4s

Details

CI / benchmarks (pull_request) Failing after 5s

Details

CI / test-codec-otelkit (pull_request) Successful in 14s

Details

CI / test-core (pull_request) Failing after 1m7s

Details

CI / test-core (push) Failing after 1m31s

Details

1675067734

- Add 'Leader Election & Competing Consumers' README section with:
  - Architecture decision: why hybrid (advisory lock + SKIP LOCKED)
  - Rolling deploy vs crash scenario walkthrough
  - Configuration examples: single-node, multi-node, subscription-only
- Update automation/doc.go with competing consumers overview
- Add subscription/leader_test.go: single-leader and graceful handoff tests
- Add automation/competing_test.go: TodoStore CRUD, claim timeout,
  no-duplicate competing executors, leader election, handoff, backwards compat

ash added 1 commit

2026-03-02 21:30:04 +00:00

feat: PgTodoStore + benchmarks — Postgres SKIP LOCKED competing consumers — Refs #26

CI / test-sqlitestore (push) Failing after 4s

Details

CI / test-pgstore (push) Failing after 4s

Details

CI / test-nats (push) Failing after 5s

Details

CI / test-integration (push) Failing after 4s

Details

CI / benchmarks (push) Failing after 4s

Details

CI / test-codec-otelkit (push) Successful in 16s

Details

CI / test-sqlitestore (pull_request) Failing after 10s

Details

CI / test-pgstore (pull_request) Failing after 9s

Details

CI / test-nats (pull_request) Failing after 10s

Details

CI / test-integration (pull_request) Failing after 10s

Details

CI / benchmarks (pull_request) Failing after 7s

Details

CI / test-codec-otelkit (pull_request) Successful in 23s

Details

CI / test-core (push) Has been cancelled

Details

CI / test-core (pull_request) Has been cancelled

Details

3b9d11e2d5

- PgTodoStore: Postgres-backed TodoStore using FOR UPDATE SKIP LOCKED
- EnsureTodoTable: idempotent table creation with partial index
- CTE-based atomic claim: select eligible + update in one round trip
- Claim timeout handles dead executors (reclaim expired items)
- Upsert on Put (idempotent, won't overwrite completed items)
- Delete on Complete (event store is the audit trail)
- Integration tests: put/claim/complete, fail/retry, claim timeout,
  concurrent 8-worker no-duplicate test (200 items), idempotent put
- Benchmarks: Postgres put, batch put, claim+complete, concurrent claim
- Memory TodoStore benchmarks for baseline comparison

ash added 1 commit

2026-03-02 21:31:05 +00:00

fix: PgTodoStore claim_timeout interval conversion — use make_interval — Refs #26

CI / test-sqlitestore (push) Failing after 7s

Details

CI / test-pgstore (push) Failing after 9s

Details

CI / test-nats (push) Failing after 12s

Details

CI / test-integration (push) Failing after 8s

Details

CI / benchmarks (push) Failing after 6s

Details

CI / test-codec-otelkit (push) Successful in 19s

Details

CI / test-sqlitestore (pull_request) Failing after 8s

Details

CI / test-pgstore (pull_request) Failing after 6s

Details

CI / test-nats (pull_request) Failing after 7s

Details

CI / test-integration (pull_request) Failing after 7s

Details

CI / benchmarks (pull_request) Failing after 6s

Details

CI / test-codec-otelkit (pull_request) Successful in 17s

Details

CI / test-core (push) Failing after 1m29s

Details

CI / test-core (pull_request) Failing after 1m22s

Details

39e7d09441

ash added 1 commit

2026-03-02 21:40:42 +00:00

refactor: strip TodoStore, keep only LeaderLock — simpler is better — Refs #26

CI / test-sqlitestore (push) Failing after 4s

Details

CI / test-pgstore (push) Failing after 4s

Details

CI / test-nats (push) Failing after 5s

Details

CI / test-integration (push) Failing after 4s

Details

CI / benchmarks (push) Failing after 4s

Details

CI / test-codec-otelkit (push) Successful in 12s

Details

CI / test-sqlitestore (pull_request) Failing after 5s

Details

CI / test-core (push) Has been cancelled

Details

CI / test-pgstore (pull_request) Failing after 5s

Details

CI / test-core (pull_request) Has been cancelled

Details

CI / test-nats (pull_request) Has been cancelled

Details

CI / test-integration (pull_request) Has been cancelled

Details

CI / benchmarks (pull_request) Has been cancelled

Details

CI / test-codec-otelkit (pull_request) Successful in 18s

Details

263dbc00b6

The checkpoint + TodoList already handles crash recovery. TodoStore added
complexity (extra table, Marshal/Unmarshal, SKIP LOCKED, claim timeouts)
for distributed execution that YoYoPass doesn't need.

What remains:
- LeaderLock on subscription.Config and automation.Config
- Advisory lock acquired once at startup, held for lifetime
- Graceful release on shutdown (defer) → instant handoff
- Standby nodes block on Acquire, zero CPU
- On failover: replay from shared checkpoint, rebuild TodoList, resume

Removed:
- TodoStore interface, MemoryTodoStore, PgTodoStore
- Marshal/Unmarshal funcs, ClaimLimit/ClaimInterval/ClaimTimeout config
- Competing consumer executor loop
- All TodoStore tests and benchmarks

ash added 1 commit

2026-03-02 21:41:07 +00:00

bench: Postgres advisory lock benchmarks — acquire/release, handoff, full cycle — Refs #26

CI / test-sqlitestore (push) Failing after 9s

Details

CI / test-pgstore (push) Failing after 9s

Details

CI / test-nats (push) Failing after 8s

Details

CI / test-integration (push) Failing after 6s

Details

CI / benchmarks (push) Failing after 4s

Details

CI / test-codec-otelkit (push) Successful in 16s

Details

CI / test-sqlitestore (pull_request) Failing after 5s

Details

CI / test-pgstore (pull_request) Failing after 5s

Details

CI / test-nats (pull_request) Failing after 7s

Details

CI / test-integration (pull_request) Failing after 4s

Details

CI / benchmarks (pull_request) Failing after 5s

Details

CI / test-codec-otelkit (pull_request) Successful in 14s

Details

CI / test-core (push) Failing after 1m14s

Details

CI / test-core (pull_request) Failing after 1m12s

Details

52954f347b

ash added 1 commit

2026-03-02 21:47:53 +00:00

redteam: leader election attack vector analysis with 10 tests — Refs #26

CI / test-sqlitestore (push) Failing after 4s

Details

CI / test-pgstore (push) Failing after 4s

Details

CI / test-nats (push) Failing after 6s

Details

CI / test-integration (push) Failing after 5s

Details

CI / benchmarks (push) Failing after 5s

Details

CI / test-codec-otelkit (push) Successful in 14s

Details

CI / test-sqlitestore (pull_request) Failing after 5s

Details

CI / test-pgstore (pull_request) Failing after 5s

Details

CI / test-nats (pull_request) Failing after 7s

Details

CI / test-integration (pull_request) Failing after 6s

Details

CI / benchmarks (pull_request) Failing after 6s

Details

CI / test-codec-otelkit (pull_request) Successful in 16s

Details

CI / test-core (push) Failing after 1m14s

Details

CI / test-core (pull_request) Failing after 1m13s

Details

445f4183b8

Findings:
- BUG: Double Start() panics (close of closed done channel) in both
  EventSubscription and Automation — not reusable after Stop/Wait
- BUG: Pre-existing race in TestLeaderLock_HandoffOnStop (not introduced by us)
- DESIGN: TodoList items silently lost on leader handoff when checkpoint
  has already advanced past trigger event (at-most-once execution)
- SOLID: Mutual exclusion works correctly (MemoryLockRegistry)
- SOLID: Lock released after checkpoint save on context cancel
- SOLID: Checkpoint advances independently of slow Execute
- SOLID: FNV-1a hash collision unlikely for reasonable automation names
- DOCUMENTED: At-least-once delivery on checkpoint save failure (by design)

ash added 1 commit

2026-03-02 22:06:22 +00:00

fix: automation checkpoint only advances after execution — prevents event loss — Refs #26

CI / test-sqlitestore (push) Failing after 3s

Details

CI / test-pgstore (push) Failing after 3s

Details

CI / test-nats (push) Failing after 4s

Details

CI / test-integration (push) Failing after 3s

Details

CI / benchmarks (push) Failing after 3s

Details

CI / test-codec-otelkit (push) Successful in 11s

Details

CI / test-core (push) Has been cancelled

Details

CI / test-sqlitestore (pull_request) Failing after 4s

Details

CI / test-core (pull_request) Has been cancelled

Details

CI / test-pgstore (pull_request) Has been cancelled

Details

CI / test-nats (pull_request) Has been cancelled

Details

CI / test-codec-otelkit (pull_request) Has been cancelled

Details

CI / test-integration (pull_request) Has been cancelled

Details

CI / benchmarks (pull_request) Has been cancelled

Details

d16fb4c0ea

CRITICAL FIX: Previously, the subscription auto-checkpointed after the handler
processed each event. But the handler only QUEUES work (adds to TodoList) —
it doesn't EXECUTE side effects. If the process crashed between handler and
execution, events were permanently lost (checkpoint advanced past unexecuted items).

Fix: automation sets AtomicCheckpoint=true on the subscription (disabling auto-save)
and manages checkpoint saving itself in executePending(). The checkpoint only
advances after ALL pending items execute successfully.

Guarantees:
- Crash before execute → events replay on restart → side effects run
- Failed items block checkpoint until retries exhausted → moves to DLQ → advances
- Graceful handoff → leader 2 replays from last executed position → no loss

Also fixes:
- Double Start() panic → returns error instead of closing done channel twice

Tests proving correctness:
- TestCheckpoint_NeverAdvancesPastUnexecuted: 5 events, crash before execute, checkpoint=0
- TestCheckpoint_AdvancesAfterExecution: checkpoint advances after all 3 items execute
- TestCheckpoint_DoesNotAdvanceWhileRetrying: transient failures block checkpoint
- TestCheckpoint_AdvancesAfterDLQ: exhausted items move to DLQ, checkpoint advances
- TestHandoff_NoEventLoss: leader 1 crashes, leader 2 re-processes all events

ash added 1 commit

2026-03-02 22:06:43 +00:00

docs: when to use subscription vs automation — Refs #26

CI / test-sqlitestore (push) Failing after 4s

Details

CI / test-pgstore (push) Failing after 6s

Details

CI / test-nats (push) Failing after 8s

Details

CI / test-integration (push) Failing after 5s

Details

CI / benchmarks (push) Failing after 4s

Details

CI / test-codec-otelkit (push) Successful in 14s

Details

CI / test-sqlitestore (pull_request) Failing after 5s

Details

CI / test-pgstore (pull_request) Failing after 3s

Details

CI / test-nats (pull_request) Failing after 6s

Details

CI / test-integration (pull_request) Failing after 4s

Details

CI / benchmarks (pull_request) Failing after 5s

Details

CI / test-codec-otelkit (pull_request) Successful in 13s

Details

CI / test-core (pull_request) Failing after 1m10s

Details

CI / test-core (push) Failing after 1m40s

Details

4ff4dc10c8

ash changed title from ~~feat: leader election + competing consumers for automations & subscriptions~~ to feat: leader election for subscriptions and automations — Closes #26

2026-03-02 22:08:10 +00:00

ash added 1 commit

2026-03-02 22:12:46 +00:00

fix: race condition — snapshot seq before pending in executePending — Refs #26

CI / test-sqlitestore (push) Failing after 3s

Details

CI / test-pgstore (push) Failing after 3s

Details

CI / test-nats (push) Failing after 4s

Details

CI / test-integration (push) Failing after 3s

Details

CI / benchmarks (push) Failing after 4s

Details

CI / test-codec-otelkit (push) Successful in 12s

Details

CI / test-sqlitestore (pull_request) Failing after 5s

Details

CI / test-pgstore (pull_request) Failing after 5s

Details

CI / test-nats (pull_request) Failing after 7s

Details

CI / test-integration (pull_request) Failing after 4s

Details

CI / benchmarks (pull_request) Failing after 4s

Details

CI / test-codec-otelkit (pull_request) Successful in 14s

Details

CI / test-core (push) Failing after 1m13s

Details

CI / test-core (pull_request) Failing after 1m11s

Details

4fe07e21e6

Without this fix, the handler could add an item BETWEEN executePending reading
pending (empty) and reading lastHandledSeq. This would advance the checkpoint
past an event whose side effect never executed.

Fix: read lastHandledSeq BEFORE reading pending. Worst case: checkpoint lags
by one tick (stale seq). Never skips ahead past unexecuted items.

Added test: TestCheckpoint_RaceBetweenHandlerAndExecutor (50 events with
staggered timing to stress the interleaving window).

ash added 1 commit

2026-03-02 22:15:04 +00:00

fix: update red team tests to reflect checkpoint safety fix — Refs #26

CI / test-sqlitestore (push) Failing after 3s

Details

CI / test-pgstore (push) Failing after 3s

Details

CI / test-nats (push) Failing after 5s

Details

CI / test-integration (push) Failing after 4s

Details

CI / benchmarks (push) Failing after 4s

Details

CI / test-sqlitestore (pull_request) Failing after 4s

Details

CI / test-codec-otelkit (push) Successful in 14s

Details

CI / test-pgstore (pull_request) Failing after 5s

Details

CI / test-nats (pull_request) Failing after 7s

Details

CI / test-integration (pull_request) Failing after 4s

Details

CI / benchmarks (pull_request) Failing after 5s

Details

CI / test-codec-otelkit (pull_request) Successful in 13s

Details

CI / test-core (push) Failing after 1m11s

Details

CI / test-core (pull_request) Failing after 1m38s

Details

eff3911631

Red team tests were written before the checkpoint fix and logged misleading
'FINDING: item lost' messages. After the fix, the behavior is correct:
- Leader 1 executes, checkpoint advances, leader 2 doesn't re-execute
- Items aren't 'lost' — they were successfully executed

ash merged commit 85dd0cda39 into main

2026-03-02 22:16:03 +00:00

ash referenced this pull request from a commit

2026-03-02 22:16:05 +00:00

Merge pull request 'feat: leader election for subscriptions and automations — Closes #26' (#165) from feat/leader-election-26 into main

Rows
Columns

feat: leader election for subscriptions and automations — Closes #26 #165

What

How It Works

Critical Fix: Automation Checkpoint Safety

Two Patterns

Benchmarks (Postgres, AMD EPYC)

Tests