Homework 7: Agentic Video Lecture Pipeline

In this assignment you will implement a multi-stage pipeline that turns a PDF slide deck into a single narrated video when run locally: one still per slide, synchronized audio, concatenated into one .mp4. You will use AI agents for structured steps (style profile, slide descriptions, premise, arc, narration), then text-to-speech and video assembly (for example with ffmpeg). Your narration should match the speaking style derived from the lecture transcript.

Deliverables

Create a Git repository with Python code that implements the following pipeline:

Style file — Read the lecture transcript file and produce style.json in the repository root: a structured description of the instructor’s speaking style (tone, pacing, fillers, how they frame ideas—fields you choose). This file informs the narration agent.
projects/ folder — At the repo root, include a projects/ directory. Each new project goes here.
Slide description agent — Create a new project folder projects/project_<current date and time>. Rasterize the Lecture 17 slides to one PNG per slide under slide_images/ inside that project. For each slide, call the AI model with the current slide image and all previous slide descriptions as input to generate the current slide description. Write every slide description to slide_description.json in the project folder.
Premise agent — Call the AI model with slide_description.json as input and write premise.json: a structured lecture premise (thesis, scope, learning objectives, audience, or fields you define).
Arc agent — Call the AI model with premise.json and slide_description.json as input and write arc.json: structured arc (flow, phases or acts, how ideas build—consistent with the premise).
Narration agent — For each slide, call the AI model with the current slide image, style.json, premise.json, arc.json, slide_description.json, and all prior slide narrations to generate the current slide narration. Write slide_description_narration.json in the project folder, containing both narrations and the associated slide descriptions. On the title slide, the narration should have the speaker introduce themselves and give a short summary of the lecture topic.
Audio step — Use a text-to-speech model (Gemini, ElevenLabs, etc.) to synthesize each narration into audio/slide_001.mp3, audio/slide_002.mp3, etc. (merge chunked API responses into one MP3 per slide if needed). Store files under an audio/ folder inside the project folder.
Video assembly step — For each slide, mux the PNG with the matching MP3 into a video segment, then concatenate into one .mp4 whose basename matches the PDF (e.g. Lecture_17_AI_screenplays.pdf → Lecture_17_AI_screenplays.mp4). Segment duration should track the audio (avoid long silent tails).

Submission

Canvas: Submit the URL of a GitHub repository with your Python implementation. Do not include any image, MP3, or MP4 files in your repo (they are not needed for grading and the files are too large).

Your repository should have this structure (image, audio, and video files should not be included, but empty slide_images/ and audio/ folders are OK). Put Lecture_17_AI_screenplays.pdf in the repository root so the grader can run your pipeline without hunting for the deck.

your-repo/
├── README.md
├── style.json
├── Lecture_17_AI_screenplays.pdf
├── requirements.txt
├── run_lecture_pipeline.py    (your entrypoint for the agentic flow)
├── lecture_agents/            (your agent code)
└── projects/
    └── project_YYYYMMDD_HHMMSS/
        ├── premise.json
        ├── arc.json
        ├── slide_description.json
        └── slide_description_narration.json

slide_images/, audio/, and the final .mp4 are produced when the pipeline runs; they should be listed in .gitignore to avoid committing large files.

Grading breakdown (100 points)

Each row states what that stage takes as input and what it produces. Points may be deducted if required inputs are missing (per code review or when the grader runs the pipeline).

Component	Points	Requirement
1. Style file	8	Agent takes instructor lecture caption/transcript file (linked “Lecture 11 Section 2” captions) as input. Agent produces `style.json` at repo root; content reflects that transcript.
2. Slide description agent	18	Agent takes the lecture deck rasterized from the PDF to per-slide images, the current slide image, and all previous slide descriptions in the model context each time as input. Agent produces `slide_description.json` covering all slides; prior-description chaining must be real, not missing or trivial.
3. Premise agent	10	Agent takes the entire `slide_description.json` document as input. Agent produces `premise.json`: a structured premise grounded in the deck, used by later stages.
4. Arc agent	10	Agent takes `premise.json` and `slide_description.json` as input. Agent produces `arc.json` that is consistent with the premise and slide content and that supports coherent progression across the deck.
5. Narration agent	18	Agent takes the current slide image, `style.json`, `premise.json`, `arc.json`, `slide_description.json`, and all prior slide narrations (none for slide 1) as input. Agent produces `slide_description_narration.json` with per-slide narration and associated slide descriptions. The title slide uses a specialized narration in which the speaker introduces themselves and gives an overview of the lecture topic.
6. Audio step	14	Audio step takes per-slide narration strings from `slide_description_narration.json` as input. Audio step produces `audio/slide_NNN.mp3` for each slide (merge chunked API responses into one file per slide if needed); any TTS provider allowed.
7. Video assembly	12	Video assembly takes matching per-slide PNGs under `slide_images/` and per-slide MP3s under `audio/` (same indices) as input. Video assembly produces one `.mp4` for the whole lecture with basename matching the PDF, one segment per slide, with duration following the audio (no long silent tail after speech on each segment).
8. Repo & JSON package	10	GitHub repo contains all code, all JSON files for the project, and a `README.md` explaining how to set up and run the agents. The repo does not include any image, audio, or video files.

Total: 100 points.