基於LLM的互動式口述影像系統 LLM-based Interactive Video Description System
This study aims to convert video content into video descriptions using Large Language
Models (LLMs) and to explore various approaches to video processing and LLMgenerated image and video descriptions, creating an automated video description system. Traditionally, video descriptions are designed for visually impaired individuals; however, this study will adapt the system for broader public use, making it easier and more efficient for anyone to understand video content. The video description system developed in this study utilizes the strengths of LLMs in processing visual information and natural language by segmenting videos into multiple frames, generating individual descriptions for each frame, and then consolidating them into a cohesive narration of the entire video. Additionally, by taking advantage of LLMs' ability to retain conversation history, users can ask follow-up questions and interact with the system for deeper clarification and detail. To identify the best solution for generating video descriptions, this study also explores the integration of different LLMs. By converting video content into video descriptions through LLMs, this study offers an interactive video description service that not only fulfills the needs of visually impaired audiences but also enables general users with limited time to quickly understand the content and details of videos.