All writing
9 Sept 2025 · 1 min read
Pipelines/Infra

The case for resumable pipelines

Why I think long-running workflows should be able to fail without sending you back to the beginning.

Long-running pipelines fail. They fail because of bad inputs, transient API errors, infrastructure flakes, or rare upstream changes. The question is never whether — it's how cheaply you recover.

Resumability is much cheaper to design in than to retrofit. The mental model is simple: every stage produces a content-addressed artifact. Restarting reads what already exists and skips it. Failure is just the absence of an artifact.

Plan for failure first. Decide what is a checkpoint before you write a single worker.

The trap is enforcing idempotency at the wrong level. Job-level idempotency lies — two runs of the "same" job can still produce different outputs if the world has shifted. Artifact-level idempotency is honest: an artifact either exists at this hash, or it does not.

End · 9 Sept 2025
Next post
22 Jul 2025 · 1 min

On long walks

Some thoughts only seem to arrive once I am away from the desk.