Stateful Service

A service that maintains long-lived in-memory state: caches, in-memory aggregates, leader-elected coordinators, websocket gateways, real-time engines. Brief sketch.

A service that maintains long-lived in-memory state: caches, in-memory aggregates, leader-elected coordinators, websocket gateways, real-time engines, sticky-session servers.

The hard problems are concurrency, recovery, and unbounded growth. Stateful services fail in ways stateless services do not.

What needs covered

LayerConcernTest type
State machine logicPure transitionsSolitary unit tests
Persistence and checkpointingState survives restart or rebuilds correctlyComponent tests with real persistence
Recovery from crashRestart converges to a consistent stateComponent tests that simulate crash mid-write
Leader electionOnly one leader; transitions are observable; split-brain is impossibleCluster tests with real consensus library
ReplicationFollowers stay in sync; backpressure is documentedCluster tests
Memory boundsState doesn’t grow unbounded; eviction policy holdsLong-running soak tests
Connection lifecycleSessions clean up on disconnect; reconnect is documentedComponent tests
Stateful Service: layers and the tests that cover eachSix architectural layers stacked top to bottom. The first five (state machine logic, persistence and recovery, single-node concurrency, replication and leader election, and memory bounds and long-run behaviour) are inside the component boundary. Below the dashed component boundary, the persistence engine is drawn with a dashed border. Each band shows its name, a one-line description, and the test types that exercise it as small coloured pills. Solitary unit tests cover state transitions. Component tests cover persistence, recovery, and single-node concurrency. Cluster tests exercise replication and leader election against a multi-node testcontainer setup. Soak and chaos tests run out of band against a deployed instance.Stateful Service: Layers and the Tests That Cover EachINSIDE THE COMPONENT BOUNDARYState machine logicSolitary unitComponentPure transitions; documented machinePersistence and recoveryComponentState survives restart; consistent state after crash mid-writeSingle-node concurrencyComponentSerialized mutations; connection lifecycle and reconnectReplication and leader electionCluster testsFollowers stay in sync; minority partition steps down; no split-brainMemory bounds and long-run behaviourSoak / chaos (OOB)Eviction policy holds; no unbounded growth; replication-lag stays in budgetcomponent boundaryOUTSIDE THE BOUNDARYPersistence engine (external)ComponentAdapter integ.ClusterSoakProduction engine. Doubled or in-memory in single-node component; real in gateway and cluster.internal layerreal code under testexternal (dashed border)doubled in this test
Layered diagram of a stateful service with six architectural layers. The first five (state machine logic, persistence and recovery, single-node concurrency, replication and leader election, memory bounds and long-run behavior) are inside the component boundary. Below the dashed boundary, the persistence engine is drawn with a dashed border. Solitary unit tests cover state transitions. Component tests cover persistence, recovery, and single-node concurrency. Cluster tests exercise replication and leader election against a multi-node testcontainer setup. Out-of-band soak and chaos tests catch unbounded growth, slow leaks, and replication-lag drift against a deployed instance.

Positive test cases

Common cases to consider, not an exhaustive list. Drop items that don’t apply and add ones the pattern doesn’t mention but your component needs.

  • State transitions: follow the documented machine.
  • Restart: state rebuilds and behavior matches pre-restart.
  • Replication lag under expected load: stays within budget.

Negative test cases

Common cases to consider, not an exhaustive list. Drop items that don’t apply and add ones the pattern doesn’t mention but your component needs.

  • Crash mid-write: consistent state on restart. No torn writes.
  • Network partition: minority replicas step down with documented reconciliation on heal.
  • Slow replication: applies backpressure rather than silent divergence.
  • Memory pressure: evicts oldest entries per policy without OOM.
  • Idle long-running connections: close cleanly with documented reconnect behavior.
  • Concurrent state mutations: serialize without lost updates.

Test double validation

Persistence doubles validated by adapter integration tests against the real production engine. Consensus library doubles validated by cluster tests against a multi-node testcontainer setup. Soak tests run out of pipeline against a deployed instance to catch slow leaks and unbounded growth.

Pipeline placement

State machine unit tests, recovery component tests, and single-node concurrency tests run in CI Stage 1; cluster tests with real consensus library in CI Stage 2; soak and chaos tests out of pipeline.