Homework 2: Multimodal Insight Observer

In this assignment, you will build a sophisticated multimodal application that allows an AI to evaluate a user's response to a YouTube video through two data streams: real-time facial expressions and a qualitative interview. You are moving beyond simple data display into creating an interactive "feedback agent."

Instructions

Your app must allow a user to watch a specific video while the AI "watches" them. After the video ends, the AI should use its observations to conduct a live interview with the user, culminating in a final comprehensive sentiment report. The app must have the following features:

YouTube Video Metadata: The app must include an input box for a YouTube URL. When a URL is provided, the app must programmatically retrieve the Video Title, Duration (in seconds), Description, and Transcript.
Visual Evaluation: Use the webcam to capture the user's reactions while they watch the video.
- Model: Use gpt-5-nano for all AI processing.
- Sampling: Capture and send a maximum of 20 images to the AI for the initial visual evaluation. Display the visual evaluation in a nicely formatted way.
The Interviewer (Chatbot): After the video ends, initialize a chatbot interface through a button "Start Interview". The AI's System Prompt must be "injected" with the video metadata and the results of the visual evaluation. The AI should ask the user what they liked/disliked and reference their facial expressions (e.g., "I noticed you smiled at [Time X], what caused that?").
Final Synthesis: Include an "End Chat" button. When clicked, the app sends the full chat history, the video metadata, and the visual evaluation to the AI to write a final summary of how the user truly felt about the content. The report should be displayed in a nicely formatted way.

Test Video

Please test your application using the official trailer for The Mandalorian and Grogu:

https://www.youtube.com/watch?v=_pa1KLXuW0Y

Submission Checklist

Your submission should include the following files zipped into a file called hw2.zip:

app.py - Your full Streamlit application code.
requirements.txt - All dependencies (including libraries for YouTube scraping).
final_prompt.txt - A text file containing the exact final prompt used to generate the Final Synthesis Report (which should include the YouTube video metadata, the visual evaluation, and the chat history generated when you test the app on the test video).

Grading Breakdown (100 Points)

Component	Points	Description
YouTube Video Metadata	20	Successfully extracts Title, Duration, Description, and Transcript from the live YouTube URL.
Visual Evaluation	20	Correctly captures up to 20 frames and uses gpt-5-nano to create a visual evaluation.
Interview Logic	20	"Start Interview" button starts a chatbot that successfully uses the initial evaluation and video metadata within its system prompt.
Final Synthesis	20	"End Chat" button triggers a coherent report that integrates chat history with visual/video data. The report is displayed in a nicely formatted way.
Final Prompt Completeness	15	`final_prompt.txt` contains the exact final prompt used in the app and clearly incorporates YouTube video metadata, the visual evaluation, and the chat history, as described in the assignment.
Proper Submission Zip	5	`hw2.zip` is submitted and correctly contains `app.py`, `requirements.txt`, and `final_prompt.txt` with the expected structure and filenames.