Long-running pipelines fail. They fail because of bad inputs, transient API errors, infrastructure flakes, or rare upstream changes. The question is never whether — it's how cheaply you recover.
Resumability is much cheaper to design in than to retrofit. The mental model is simple: every stage produces a content-addressed artifact. Restarting reads what already exists and skips it. Failure is just the absence of an artifact.
Plan for failure first. Decide what is a checkpoint before you write a single worker.
The trap is enforcing idempotency at the wrong level. Job-level idempotency lies — two runs of the "same" job can still produce different outputs if the world has shifted. Artifact-level idempotency is honest: an artifact either exists at this hash, or it does not.