Homework 7: Agentic Video Lecture Pipeline
In this assignment you will implement a multi-stage pipeline that turns a PDF slide deck
into a single narrated video when run locally: one still per slide, synchronized audio, concatenated into one
.mp4. You will use AI agents for structured steps (style profile, slide descriptions, premise, arc, narration), then text-to-speech and video assembly (for example with ffmpeg). Your narration should match the speaking style derived from the lecture transcript.
Deliverables
Create a Git repository with Python code that implements the following pipeline:
-
Style file — Read the
lecture transcript file
and produce
style.jsonin the repository root: a structured description of the instructor’s speaking style (tone, pacing, fillers, how they frame ideas—fields you choose). This file informs the narration agent. -
projects/folder — At the repo root, include aprojects/directory. Each new project goes here. -
Slide description agent — Create a new project folder
projects/project_<current date and time>. Rasterize the Lecture 17 slides to one PNG per slide underslide_images/inside that project. For each slide, call the AI model with the current slide image and all previous slide descriptions as input to generate the current slide description. Write every slide description toslide_description.jsonin the project folder. -
Premise agent — Call the AI model with
slide_description.jsonas input and writepremise.json: a structured lecture premise (thesis, scope, learning objectives, audience, or fields you define). -
Arc agent — Call the AI model with
premise.jsonandslide_description.jsonas input and writearc.json: structured arc (flow, phases or acts, how ideas build—consistent with the premise). -
Narration agent — For each slide, call the AI model with the current slide image,
style.json,premise.json,arc.json,slide_description.json, and all prior slide narrations to generate the current slide narration. Writeslide_description_narration.jsonin the project folder, containing both narrations and the associated slide descriptions. On the title slide, the narration should have the speaker introduce themselves and give a short summary of the lecture topic. -
Audio step — Use a text-to-speech model (Gemini, ElevenLabs, etc.) to synthesize each narration into
audio/slide_001.mp3,audio/slide_002.mp3, etc. (merge chunked API responses into one MP3 per slide if needed). Store files under anaudio/folder inside the project folder. -
Video assembly step — For each slide, mux the PNG with the matching MP3 into a video segment, then concatenate into one
.mp4whose basename matches the PDF (e.g.Lecture_17_AI_screenplays.pdf→Lecture_17_AI_screenplays.mp4). Segment duration should track the audio (avoid long silent tails).
Submission
Canvas: Submit the URL of a GitHub repository with your Python implementation. Do not include any image, MP3, or MP4 files in your repo (they are not needed for grading and the files are too large).
Your repository should have this structure (image, audio, and video files should not be included, but empty slide_images/ and audio/ folders are OK). Put Lecture_17_AI_screenplays.pdf in the repository root so the grader can run your pipeline without hunting for the deck.
your-repo/
├── README.md
├── style.json
├── Lecture_17_AI_screenplays.pdf
├── requirements.txt
├── run_lecture_pipeline.py (your entrypoint for the agentic flow)
├── lecture_agents/ (your agent code)
└── projects/
└── project_YYYYMMDD_HHMMSS/
├── premise.json
├── arc.json
├── slide_description.json
└── slide_description_narration.json
slide_images/, audio/, and the final .mp4 are produced when the pipeline runs; they should be listed in .gitignore to avoid committing large files.
Grading breakdown (100 points)
Each row states what that stage takes as input and what it produces. Points may be deducted if required inputs are missing (per code review or when the grader runs the pipeline).
| Component | Points | Requirement |
|---|---|---|
| 1. Style file | 8 |
Agent takes instructor lecture caption/transcript file (linked “Lecture 11 Section 2” captions) as input. Agent produces style.json at repo root; content reflects that transcript.
|
| 2. Slide description agent | 18 |
Agent takes the lecture deck rasterized from the PDF to per-slide images, the current slide image, and all previous slide descriptions in the model context each time as input. Agent produces slide_description.json covering all slides; prior-description chaining must be real, not missing or trivial.
|
| 3. Premise agent | 10 |
Agent takes the entire slide_description.json document as input. Agent produces premise.json: a structured premise grounded in the deck, used by later stages.
|
| 4. Arc agent | 10 |
Agent takes premise.json and slide_description.json as input. Agent produces arc.json that is consistent with the premise and slide content and that supports coherent progression across the deck.
|
| 5. Narration agent | 18 |
Agent takes the current slide image, style.json, premise.json, arc.json, slide_description.json, and all prior slide narrations (none for slide 1) as input. Agent produces slide_description_narration.json with per-slide narration and associated slide descriptions. The title slide uses a specialized narration in which the speaker introduces themselves and gives an overview of the lecture topic.
|
| 6. Audio step | 14 |
Audio step takes per-slide narration strings from slide_description_narration.json as input. Audio step produces audio/slide_NNN.mp3 for each slide (merge chunked API responses into one file per slide if needed); any TTS provider allowed.
|
| 7. Video assembly | 12 |
Video assembly takes matching per-slide PNGs under slide_images/ and per-slide MP3s under audio/ (same indices) as input. Video assembly produces one .mp4 for the whole lecture with basename matching the PDF, one segment per slide, with duration following the audio (no long silent tail after speech on each segment).
|
| 8. Repo & JSON package | 10 |
GitHub repo contains all code, all JSON files for the project, and a README.md explaining how to set up and run the agents. The repo does not include any image, audio, or video files.
|
Total: 100 points.