All work
P / 03 · 2025
Case study

Tessact AI

Video intelligence platform

A video AI product for searching footage, asking questions against a library, and turning responses into edits people can actually work with.

Product preview
(I) — Core premise

Video enters the queue.

A video upload triggers a parallel processing pipeline that produces preview assets, video derivatives, transcription, and indexed scene intelligence.

The system separates fast local work (metadata, thumbnails, scrubs) from heavy GPU-gated work (transcoding, AI analysis) using dedicated background workers. What leaves the pipeline is a structured, searchable asset, not just a stored file.

The engineering challenge was keeping these parallel pipelines observable, failure-isolated, and composable while preserving the ability to route to different analysis paths and turn branches on or off without touching the core upload path.

Output groups
4
Processing tiers
2
Join point
1
Asset classes
3
Execution model
Async
Branch control
Config
(II) — Pipeline outputs

What leaves the other end.

IPreview assets
  • Thumbnail (quality-scored)
  • Scrub contact sheet
  • Hover-preview sprites
  • Lightweight web formats
IIVideo derivatives
  • Normalized transcode
  • Watermarked copy
  • Streaming package
  • Playback-ready output
IIITranscription + timing
  • Audio extraction
  • Managed transcription
  • Speaker labels + subtitles
  • Voice activity segments
IVVision + intelligence
  • Face tracks (chunked detection)
  • Shot / segment boundaries
  • Per-scene structured extraction
  • Content categorization
(III) — Pipeline fan-out

One upload,
five parallel branches.

The upload finalizer fans out into independent background tasks. Local processing runs immediately. The GPU-gated AI path and optional branches run in parallel, and none of them block the upload response.

triggerUpload completeupload finalizerLocal processingmetadata · thumbnail · scrub · spritesalways-onAnalysis pipelineGPU-gated · transcription · face · segment · scenegpu-gatedWatermark derivativeGPU-gated · runs parallel to AI pathgpu-gatedStreaming packageadaptive streaming when enabledoptionalExternal search indexexternal indexing · optional branchoptional
(IV) — Local processing

Fast work, always.

  1. 01

    Quality-scored thumbnails

    The thumbnail generator samples frames across the usable middle of the video, scores candidates by visual quality, and picks the strongest one. The goal is a representative frame that avoids the weak openings and endings common in uploaded footage.

  2. 02

    Scrub and sprite sheets

    Evenly distributed frames are assembled into contact sheets and hover-preview sprites so the library can show visual timeline previews without loading the full video. Very short videos can skip the heavier preview outputs.

  3. 03

    Technical metadata on arrival

    Technical metadata is extracted on first download so duration, dimensions, frame rate, and codec details appear immediately. That makes the asset usable in the library before any heavy downstream processing completes.

(V) — GPU-gated path

Heavy work, conditional.

  1. 01

    Corruption check before anything expensive

    The orchestration task runs an integrity check before starting transcription, face detection, or segment detection. If the file is corrupted, the video is marked and the AI pipeline is skipped. No partial job state to clean up.

  2. 02

    Transcoding shifts the processing path

    When transcoding runs, downstream analysis switches to a normalized working copy instead of the raw upload. That keeps later stages consistent and reduces edge cases in chunked and GPU-heavy processing.

  3. 03

    Feature flags control the AI fan-out

    Runtime configuration decides which analysis branches the orchestration task creates. Organizations can be on different combinations without code changes or upload-path rewrites.

(VI) — AI analysis

Four jobs, in parallel,
one join.

After download and transcoding, the orchestration task fans out into four independent jobs: transcription, voice-activity detection, chunked face detection, and full-video segment detection. Scene assembly starts only after both transcription and segment detection complete.

01Download + integritycorrupted → skip pipeline02Normalized transcodeworking copy preparedTranscriptionlanguage + speakers + subtitlesSpeech timingspeech segments + silence gapsFace analysischunked video windowsShot detectionscene boundaries across the full videoall completejoinScene assemblytranscript + shots → structured extractionindex
(VII) — Job orchestration

Every async step leaves a trail.

  1. 01

    One record per logical job

    Each background step creates a durable job record with its type, status, progress context, and failure details. Active work is queryable, and completion updates the record rather than disappearing into the task runner.

  2. 02

    Chunked face processing

    Face detection fans out into independent time-based chunks so long videos can be processed in parallel. Each chunk carries its own tracking record, and the parent job only resolves when all chunk work completes cleanly.

  3. 03

    No-audio fallback by design

    If the video has no audio track, no transcription request is sent. Instead, the pipeline creates placeholder scene structure so downstream assembly can still complete without failing on a missing dependency.

(VIII) — Scene assembly

Transcript meets shots.
Structured extraction fills the gaps.

Scene assembly aligns transcript-derived scene boundaries to shot-detection boundaries, fills any gaps so the full video is covered, cuts each scene into its own clip, and runs per-scene structured extraction. The result is indexed for search and persisted into relational detection tables.

Transcript scenessentence chunks · silence gapsShot boundariescamera change detectionjoin01Align + fill gapsshots → scene boundaries02Per-scene clipscene clips persisted03Structured extractionscene metadata payload04Scene indexsearch index + detectionsscene assembly starts only after both transcription and segment detection complete
(IX) — Transcription pipeline

Silence as a boundary.

  1. 01

    Sentence chunks on silence gaps

    The transcription processor identifies large silence gaps and uses them as scene chunk boundaries. This produces semantically coherent transcript scenes before shot detection is available, so the two inputs to scene assembly need careful alignment.

  2. 02

    Parallel subtitle generation

    Transcription completion also schedules subtitle generation as a side effect. Caption artifacts are persisted for downstream use without blocking the scene-assembly join.

  3. 03

    Content categorization from frames

    The transcription completion path also triggers local content categorization. A frame grid is generated, classified, and written back to the video record. This runs without the main GPU path and without a separate orchestration branch.

(X) — Observability

Status all the way down.

File status progression
  • Upload accepted
  • Queued for processing
  • Transcoding or analysis in progress
  • Completed, failed, or corrupted
Job-level tracking
  • One record per logical job
  • Chunk-level tracking for parallel work
  • Progress persisted outside the worker
  • Live progress updates back to clients
Failure handling
  • Automatic retry on transient failures
  • Error tracking for background work
  • Operational alerts on hard failures
  • Clean bypass when corruption is detected
(XI) — Performance risks

Where it could break.

  1. 01

    Heavy work on the upload hot path

    Transcription, transcoding, and face detection cannot sit on the upload request. The finalizer returns before any background task completes, and all heavy work moves to the queue.

  2. 02

    Chunk completion coordination

    Face detection dispatches multiple independent chunk jobs per video. Without parent-child job tracking, completion becomes ambiguous. Any chunk failure needs to propagate to the parent without losing the others.

  3. 03

    Scene assembly joining two async results

    Transcription and segment detection run independently. Scene assembly starts only when both complete. A failure in one blocks scene assembly indefinitely without proper completion semantics.

  4. 04

    Progress moving backwards

    Parallel progress reporters can race. The persisted progress model has to prevent later updates from making the asset look less complete than it already is.

(XII) — Tradeoffs

What I locked, what I left.

Strong choices
  • Queue separation for independent pipelines

    Local processing and heavy analysis run in separate worker pools. A slow transcode does not starve lightweight preview work.

  • Corruption check before expensive work

    Integrity validation happens before any AI job is created. A bad file fails cheaply without spawning orphaned work across multiple services.

  • Parent/child job model for chunked work

    Chunked analysis uses a parent-child job model with explicit completion semantics. The parent advances only when all chunk work resolves.

  • No-audio fallback is a first-class path

    Videos without an audio track still produce placeholder scene structure. Scene assembly runs on the same code path with no special cases.

Deliberate tradeoffs
  • GPU gate at scheduling time, not runtime

    The main AI path is only added to the task list when the environment is configured for it. Routing decisions happen at fanout, not inside tasks.

  • Multiple analysis paths coexist

    Newer and older analysis approaches overlap conceptually. The current upload path is clear, but the codebase still shows its evolutionary history.

  • Scene assembly requires both async results

    Transcription and segment detection must both complete before scene assembly starts. If one is slow, assembly waits. No partial scene output.

  • Not every processor is on the main upload path

    Some supporting processors exist in the codebase without being part of the default library pipeline. Scope decisions left them disconnected rather than removed.

(XIII) — Outcomes

What the pipeline produced.

Outcome
Assets usable before AI completes

Thumbnail, scrub, and technical metadata are available seconds after upload. The library shows a usable asset while the GPU path runs in the background.

Outcome
Scene-level search on every video

Scene assembly produces structured, indexed scene payloads for every video that completes the AI path, supporting search, tagging, comments, and repurpose workflows.

Outcome
Per-org AI feature rollout

Feature flags let organizations opt into different pipeline branches independently. New analysis paths ship without touching the upload path or affecting organizations that don't need them.

(XIV) — Learnings

What stayed true.

  1. 01

    Separate queues for separate concerns

    Routing local processing and GPU work to different queues is not overengineering. Without it, a single slow transcode job can delay thumbnail generation for every other upload.

  2. 02

    Validate before you spend

    A corruption check that skips the entire AI pipeline on a bad file is one of the highest-leverage things in the orchestration task. The alternative is orphaned jobs across four providers and no clean terminal state.

  3. 03

    Feature flags belong at scheduling time

    Reading flags when tasks are enqueued, not inside the task, keeps worker logic simple and makes the active pipeline visible from the scheduling call alone.

  4. 04

    Async joins need explicit completion semantics

    Scene assembly waiting on two independent async results is only tractable because each result has a clear terminal state. Without explicit job records, the join becomes a polling loop against unstable state.

Next case study
P / 04 · 2025

Web Video Editor

Multi-track timeline in the browser