Homework 2: Multimodal Insight Observer

In this assignment, you will build a sophisticated multimodal application that allows an AI to evaluate a user's response to a YouTube video through two data streams: real-time facial expressions and a qualitative interview. You are moving beyond simple data display into creating an interactive "feedback agent."

Instructions

Your app must allow a user to watch a specific video while the AI "watches" them. After the video ends, the AI should use its observations to conduct a live interview with the user, culminating in a final comprehensive sentiment report. The app must have the following features:

  1. YouTube Video Metadata: The app must include an input box for a YouTube URL. When a URL is provided, the app must programmatically retrieve the Video Title, Duration (in seconds), Description, and Transcript.
  2. Visual Evaluation: Use the webcam to capture the user's reactions while they watch the video.
    • Model: Use gpt-5-nano for all AI processing.
    • Sampling: Capture and send a maximum of 20 images to the AI for the initial visual evaluation. Display the visual evaluation in a nicely formatted way.
  3. The Interviewer (Chatbot): After the video ends, initialize a chatbot interface through a button "Start Interview". The AI's System Prompt must be "injected" with the video metadata and the results of the visual evaluation. The AI should ask the user what they liked/disliked and reference their facial expressions (e.g., "I noticed you smiled at [Time X], what caused that?").
  4. Final Synthesis: Include an "End Chat" button. When clicked, the app sends the full chat history, the video metadata, and the visual evaluation to the AI to write a final summary of how the user truly felt about the content. The report should be displayed in a nicely formatted way.

Test Video

Please test your application using the official trailer for The Mandalorian and Grogu:

Submission Checklist

Your submission should include the following files zipped into a file called hw2.zip:

Grading Breakdown (100 Points)

Component Points Description
YouTube Video Metadata 20 Successfully extracts Title, Duration, Description, and Transcript from the live YouTube URL.
Visual Evaluation 20 Correctly captures up to 20 frames and uses gpt-5-nano to create a visual evaluation.
Interview Logic 20 "Start Interview" button starts a chatbot that successfully uses the initial evaluation and video metadata within its system prompt.
Final Synthesis 20 "End Chat" button triggers a coherent report that integrates chat history with visual/video data. The report is displayed in a nicely formatted way.
Final Prompt Completeness 15 final_prompt.txt contains the exact final prompt used in the app and clearly incorporates YouTube video metadata, the visual evaluation, and the chat history, as described in the assignment.
Proper Submission Zip 5 hw2.zip is submitted and correctly contains app.py, requirements.txt, and final_prompt.txt with the expected structure and filenames.