Long-form to short-form, automated
A Django, FastAPI, Celery, and RabbitMQ media pipeline that repurposes long-form video into near-ready short-form deliverables, using durable run/job/output records, artifact-first stage boundaries, PydanticAI-driven analysis, and Remotion on AWS Lambda for parallel rendering.
Manual social clipping took about four weeks per campaign. This pipeline turns a two-hour podcast into 30 to 40 branded social clips in roughly one hour.
Before this, enterprises were paying editors to sit through hours of recording, find the usable moments, cut them down, caption them, and export multiple aspect ratios by hand. The target here was not draft quality. It was near-ready output at production throughput.
The system now lands around 95% ready-to-post quality by pushing the work through twelve cooperating stages that can fail, resume, and reuse work without losing their place.
Multiple transcription providers produce timestamped text. Segment detection and scene analysis establish candidate boundaries. PydanticAI-driven LLM passes look for high-energy moments, speaker changes, topic completeness, hook strength, quotability, and likely engagement. Preparation and face tracking then shift to clip-local assets, with the heavier vision path deployed separately on Cloud Run. Composition writes the render index. Rendering fans out through Remotion on AWS Lambda, and the analysis boundary closes before delivery does.
The run is the top-level control object. Each stage becomes a durable job. Each rendered short becomes its own output. The tight job state model — pending, in_progress, completed, failed — keeps orchestration predictable while still allowing retries, subtree resets, and reuse of completed work.
Each stage reads the artifacts it depends on, computes, writes new structured data or media to durable storage, and marks its job complete. The orchestrator triggers what comes next. Delegated work, retries, and reuse all fall out of that contract. Debugging becomes reading the directory.
The composition index spawns one render job per short. Remotion packages client branding, captions, and aspect-ratio variants, then AWS Lambda fans those renders out in parallel. Each output completes on its own schedule, signalled either by an async webhook or by background polling. Finalization is modeled as a terminal-state transition so a missed callback is recovered by the poll, and a duplicate signal becomes a no-op.
The style and enhancement control plane is a separate subsystem. It authors and publishes immutable creative contracts: subtitle systems, overlays, audio treatment, and visual rules. The pipeline pins them by version at runtime. Creative quality control stays out of the production DAG.
The trigger layer can attach a new execution to compatible completed jobs for the same source asset and stage type. No separate cache. The orchestrator keys off completed contracts, not task invocations. Latency drops, cost drops, and the graph continues from the reused boundary.
Style override reruns enter at the enhancement-planning boundary rather than at the start. The override lives in the run metadata, and downstream composition and rendering regenerate. The DAG is not just executable. It is partially replayable.
Workflow completion = analysis + composition done. Delivery completion = every output reached a terminal render outcome. Any failed required stage marks the workflow failed. Per-clip render failures roll up into delivery state, not workflow state.
Strange codecs, broken containers, missing audio. Transcoding has to be defensive enough that downstream stages can assume a stable input.
An hour of analysis cannot collapse on a render error. Resumability has to live at the artifact boundary, not at the task level.
Stitch Prepare and Rendering both depend on external systems. Polling is the safety net so a dropped webhook doesn't strand a run.
Plans can be technically valid but editorially uneven. Treating the plan as an explicit contract, with style binding and enhancement separation, keeps reruns cheap when only the creative layer needs to move.
What used to take a campaign-long editorial loop now compresses into a single processing run, because transcription, planning, preparation, and rendering all execute as resumable async stages.
Candidate clips are not just cut on timestamps. They are scored for hook strength, emotional tone, quotability, and completion, then rendered with client branding and captions already applied.
Parallel Lambda export turns per-clip rendering into fanout instead of a queue of serial renders, so dozens of outputs land in minutes rather than hours.
When stages publish to durable storage, retries, reuse, and delegated execution all become trivial. The orchestrator stops caring about who computed what.
Holding clip selection and creative treatment in different stages made enhancement-only reruns possible without rebuilding the whole graph.
Webhooks reduce time-to-finalize on the happy path; polling survives missed callbacks. Treat finalization as a terminal-state transition and the race resolves itself.
Style is not a render flag. It is a versioned contract published from a separate control plane. The pipeline pins it for consistency, auditability, and safer reruns.
Video intelligence platform