AI pipeline: incoherent scene generation — prompts generated blind, hardcoded count, no narrative arc #2

Open
opened 2026-05-19 14:19:34 +00:00 by casper-stevens · 0 comments
Member

Problem

The current scene generation pipeline produces scenes that do not form a coherent story. Three distinct root causes combine to produce this:


1. Scene count is hardcoded to 5

In workers/mod.rs, generate_scenes_work calls:

providers.ai.generate_scenes(&intent, 5).await

The generate_scenes RPC method takes no count parameter, and the UI shows no control for it.

Users have no way to say "I want a 3-scene short" vs "I want a 12-scene video". All projects start with exactly 5 scenes regardless of intent length, complexity, or target duration.

Fix needed: accept a scene_count parameter from the client (reasonable range: 3–20); expose it as a number input in the New Project modal and in the Planning step.


2. Image prompt and video prompt are generated in a single AI call — before any image exists

In providers/ai.rs, generate_scenes() sends one request to the LLM that returns both image_prompt and video_prompt in the same JSON array:

let system = format!(
    "Generate exactly {count} scenes … \
    Each scene needs an image_prompt … \
    and a video_prompt … \
    Return only JSON: {{\"scenes\": [{{\"image_prompt\": \"...\", \"video_prompt\": \"...\"}}]}}"
);

This means the video prompt (camera movement, subject action, atmosphere) is written before any image exists. The video prompt cannot account for:

  • What the image actually looks like (composition, lighting, subject position)
  • Whether the AI chose a wide angle or tight close-up
  • The actual colour palette or environment that was generated

When a video model receives a prompt that doesn't match the reference image, the result is visually incoherent motion or ignored prompts.

Fix needed: split generation into two separate AI calls:

  1. Generate image prompts only (one call, all scenes at once for narrative coherence — see below)
  2. After images are generated and a candidate is selected, generate the video prompt per scene, with the selected image URL or description as additional context

This also means the video_prompt field on Scene should not be populated during scene planning — it should be derived later.


3. No narrative coherence — scenes are generated as independent parallel items

The system prompt asks the LLM to "generate N scenes" in a single JSON array. The LLM treats each array entry as an independent creative unit with no enforced relationship to the others. There is no instruction to:

  • Maintain a consistent visual style, lighting, or location across scenes
  • Follow a narrative arc (setup → tension → resolution, or intro → body → CTA)
  • Ensure character or subject continuity between scenes
  • Reference what happened in the previous scene

The result is a set of scenes that look like they come from different videos.

Fix needed: the scene generation prompt needs explicit narrative structure:

  • A story arc instruction (e.g. "scenes should form a coherent [story / product demo / documentary sequence]")
  • A visual consistency directive (same environment, same subject, consistent style)
  • Either a chain-of-thought pass (generate the story arc first, then the scenes) or few-shot examples of coherent multi-scene outputs

Alternatively: generate a narrative outline first (one short sentence per scene describing its role in the story), then generate image prompts grounded in that outline. This two-pass approach produces dramatically more coherent results.


Summary of required changes

Area Current Needed
Scene count Hardcoded 5 in worker User-controlled input, passed through RPC
Prompt generation Both prompts in one call, before images Image prompts first (with narrative), video prompts after image selection
Narrative coherence None — independent parallel items Story arc pass, visual consistency directive
Video prompt timing Generated at scene creation Generated after image is selected, using image as context

See issue #1 for the user-facing side of this (field naming, structured brief input). The improvements here are backend/pipeline; they will also require a UI change to remove the video prompt textarea from the Planning step.

## Problem The current scene generation pipeline produces scenes that do not form a coherent story. Three distinct root causes combine to produce this: --- ### 1. Scene count is hardcoded to 5 In `workers/mod.rs`, `generate_scenes_work` calls: ```rust providers.ai.generate_scenes(&intent, 5).await ``` The `generate_scenes` RPC method takes no count parameter, and the UI shows no control for it. Users have no way to say "I want a 3-scene short" vs "I want a 12-scene video". All projects start with exactly 5 scenes regardless of intent length, complexity, or target duration. **Fix needed:** accept a `scene_count` parameter from the client (reasonable range: 3–20); expose it as a number input in the New Project modal and in the Planning step. --- ### 2. Image prompt and video prompt are generated in a single AI call — before any image exists In `providers/ai.rs`, `generate_scenes()` sends one request to the LLM that returns both `image_prompt` and `video_prompt` in the same JSON array: ```rust let system = format!( "Generate exactly {count} scenes … \ Each scene needs an image_prompt … \ and a video_prompt … \ Return only JSON: {{\"scenes\": [{{\"image_prompt\": \"...\", \"video_prompt\": \"...\"}}]}}" ); ``` This means the video prompt (camera movement, subject action, atmosphere) is written **before any image exists**. The video prompt cannot account for: - What the image actually looks like (composition, lighting, subject position) - Whether the AI chose a wide angle or tight close-up - The actual colour palette or environment that was generated When a video model receives a prompt that doesn't match the reference image, the result is visually incoherent motion or ignored prompts. **Fix needed:** split generation into two separate AI calls: 1. Generate **image prompts only** (one call, all scenes at once for narrative coherence — see below) 2. After images are generated and a candidate is selected, generate the **video prompt** per scene, with the selected image URL or description as additional context This also means the `video_prompt` field on `Scene` should not be populated during scene planning — it should be derived later. --- ### 3. No narrative coherence — scenes are generated as independent parallel items The system prompt asks the LLM to "generate N scenes" in a single JSON array. The LLM treats each array entry as an independent creative unit with no enforced relationship to the others. There is no instruction to: - Maintain a consistent visual style, lighting, or location across scenes - Follow a narrative arc (setup → tension → resolution, or intro → body → CTA) - Ensure character or subject continuity between scenes - Reference what happened in the previous scene The result is a set of scenes that look like they come from different videos. **Fix needed:** the scene generation prompt needs explicit narrative structure: - A story arc instruction (e.g. "scenes should form a coherent [story / product demo / documentary sequence]") - A visual consistency directive (same environment, same subject, consistent style) - Either a chain-of-thought pass (generate the story arc first, then the scenes) or few-shot examples of coherent multi-scene outputs Alternatively: generate a **narrative outline** first (one short sentence per scene describing its role in the story), then generate image prompts grounded in that outline. This two-pass approach produces dramatically more coherent results. --- ## Summary of required changes | Area | Current | Needed | |---|---|---| | Scene count | Hardcoded `5` in worker | User-controlled input, passed through RPC | | Prompt generation | Both prompts in one call, before images | Image prompts first (with narrative), video prompts after image selection | | Narrative coherence | None — independent parallel items | Story arc pass, visual consistency directive | | Video prompt timing | Generated at scene creation | Generated after image is selected, using image as context | ## Related See issue #1 for the user-facing side of this (field naming, structured brief input). The improvements here are backend/pipeline; they will also require a UI change to remove the video prompt textarea from the Planning step.
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_videos#2
No description provided.