Video intelligence platform
A video AI product for searching footage, asking questions against a library, and turning responses into edits people can actually work with.
A video upload triggers a parallel processing pipeline that produces preview assets, video derivatives, transcription, and indexed scene intelligence.
The system separates fast local work (metadata, thumbnails, scrubs) from heavy GPU-gated work (transcoding, AI analysis) using dedicated background workers. What leaves the pipeline is a structured, searchable asset, not just a stored file.
The engineering challenge was keeping these parallel pipelines observable, failure-isolated, and composable while preserving the ability to route to different analysis paths and turn branches on or off without touching the core upload path.
The upload finalizer fans out into independent background tasks. Local processing runs immediately. The GPU-gated AI path and optional branches run in parallel, and none of them block the upload response.
The thumbnail generator samples frames across the usable middle of the video, scores candidates by visual quality, and picks the strongest one. The goal is a representative frame that avoids the weak openings and endings common in uploaded footage.
Evenly distributed frames are assembled into contact sheets and hover-preview sprites so the library can show visual timeline previews without loading the full video. Very short videos can skip the heavier preview outputs.
Technical metadata is extracted on first download so duration, dimensions, frame rate, and codec details appear immediately. That makes the asset usable in the library before any heavy downstream processing completes.
The orchestration task runs an integrity check before starting transcription, face detection, or segment detection. If the file is corrupted, the video is marked and the AI pipeline is skipped. No partial job state to clean up.
When transcoding runs, downstream analysis switches to a normalized working copy instead of the raw upload. That keeps later stages consistent and reduces edge cases in chunked and GPU-heavy processing.
Runtime configuration decides which analysis branches the orchestration task creates. Organizations can be on different combinations without code changes or upload-path rewrites.
After download and transcoding, the orchestration task fans out into four independent jobs: transcription, voice-activity detection, chunked face detection, and full-video segment detection. Scene assembly starts only after both transcription and segment detection complete.
Each background step creates a durable job record with its type, status, progress context, and failure details. Active work is queryable, and completion updates the record rather than disappearing into the task runner.
Face detection fans out into independent time-based chunks so long videos can be processed in parallel. Each chunk carries its own tracking record, and the parent job only resolves when all chunk work completes cleanly.
If the video has no audio track, no transcription request is sent. Instead, the pipeline creates placeholder scene structure so downstream assembly can still complete without failing on a missing dependency.
Scene assembly aligns transcript-derived scene boundaries to shot-detection boundaries, fills any gaps so the full video is covered, cuts each scene into its own clip, and runs per-scene structured extraction. The result is indexed for search and persisted into relational detection tables.
The transcription processor identifies large silence gaps and uses them as scene chunk boundaries. This produces semantically coherent transcript scenes before shot detection is available, so the two inputs to scene assembly need careful alignment.
Transcription completion also schedules subtitle generation as a side effect. Caption artifacts are persisted for downstream use without blocking the scene-assembly join.
The transcription completion path also triggers local content categorization. A frame grid is generated, classified, and written back to the video record. This runs without the main GPU path and without a separate orchestration branch.
Transcription, transcoding, and face detection cannot sit on the upload request. The finalizer returns before any background task completes, and all heavy work moves to the queue.
Face detection dispatches multiple independent chunk jobs per video. Without parent-child job tracking, completion becomes ambiguous. Any chunk failure needs to propagate to the parent without losing the others.
Transcription and segment detection run independently. Scene assembly starts only when both complete. A failure in one blocks scene assembly indefinitely without proper completion semantics.
Parallel progress reporters can race. The persisted progress model has to prevent later updates from making the asset look less complete than it already is.
Local processing and heavy analysis run in separate worker pools. A slow transcode does not starve lightweight preview work.
Integrity validation happens before any AI job is created. A bad file fails cheaply without spawning orphaned work across multiple services.
Chunked analysis uses a parent-child job model with explicit completion semantics. The parent advances only when all chunk work resolves.
Videos without an audio track still produce placeholder scene structure. Scene assembly runs on the same code path with no special cases.
The main AI path is only added to the task list when the environment is configured for it. Routing decisions happen at fanout, not inside tasks.
Newer and older analysis approaches overlap conceptually. The current upload path is clear, but the codebase still shows its evolutionary history.
Transcription and segment detection must both complete before scene assembly starts. If one is slow, assembly waits. No partial scene output.
Some supporting processors exist in the codebase without being part of the default library pipeline. Scope decisions left them disconnected rather than removed.
Thumbnail, scrub, and technical metadata are available seconds after upload. The library shows a usable asset while the GPU path runs in the background.
Scene assembly produces structured, indexed scene payloads for every video that completes the AI path, supporting search, tagging, comments, and repurpose workflows.
Feature flags let organizations opt into different pipeline branches independently. New analysis paths ship without touching the upload path or affecting organizations that don't need them.
Routing local processing and GPU work to different queues is not overengineering. Without it, a single slow transcode job can delay thumbnail generation for every other upload.
A corruption check that skips the entire AI pipeline on a bad file is one of the highest-leverage things in the orchestration task. The alternative is orphaned jobs across four providers and no clean terminal state.
Reading flags when tasks are enqueued, not inside the task, keeps worker logic simple and makes the active pipeline visible from the scheduling call alone.
Scene assembly waiting on two independent async results is only tractable because each result has a clear terminal state. Without explicit job records, the join becomes a polling loop against unstable state.
Multi-track timeline in the browser