A Human-Machine Collaborative Prompt Model for Audio Description of Local Cultural Promotional Videos

Wenyan Shao; Lingqian Zheng; Xiaoshan Lin; Lirong Yan

doi:10.26689/ssr.v7i9.12409

Wenyan Shao College of Foreign Languages, Minjiang University, Fuzhou 350108, Fujian, China
Lingqian Zheng College of Foreign Languages, Minjiang University, Fuzhou 350108, Fujian, China
Xiaoshan Lin College of Foreign Languages, Minjiang University, Fuzhou 350108, Fujian, China
Lirong Yan College of Foreign Languages, Minjiang University, Fuzhou 350108, Fujian, China

DOI: https://doi.org/10.26689/ssr.v7i9.12409

Keywords: Audio description, Human-machine collaboration, Multimodal large language models, Cultural heritage

Abstract

This study explores the development of an automated audio description (AD) framework for local cultural promotional videos using a human-machine collaborative approach. The proposed framework integrates a multimodal large language model, Doubao, with human expertise to enhance AD production, particularly for videos featuring culturally rich content. By focusing on the example of the Fujian-based video “Where There Are Dreams, There Is Fu”, the study addresses two primary challenges in AD: cross-frame coherence and accurate cultural symbol interpretation. Through iterative human-machine collaboration, the model generates coherent, culturally grounded AD scripts that align with the cognitive patterns of visually impaired audiences. This research highlights the potential of GenAI-driven solutions in creating accessible content for public welfare organizations while maintaining cultural authenticity. The proposed framework offers a scalable, cost-effective approach to improving accessibility and promoting cultural heritage for visually impaired individuals.

References

Rohrbach A, Torabi A, Rohrbach M, et al., 2017, Movie Description. International Journal of Computer Vision, 123(1): 94–120.

Wei L, 2025, Narration, Identity, and Immersion: Strategies for Leveraging Multimodal Large Language Models for Enhancing Cultural Heritage Protection and Inheritance in the New Era. Journal of Yunnan Minzu University (Philosophy and Social Sciences Edition), 42(1): 31–39.

Liu XB, Hu BT, Chen KH, et al., 2023, Key Technologies and Future Development Directions of Large Language Models: Insights from ChatGPT. Bulletin of National Natural Science Foundation of China, 37(5): 758–766.

Campos VP, de Araújo TMU, de Souza Filho GL, et al., 2020, CineAD: A System for Automated Audio Description Script Generation for the Visually Impaired. Universal Access in the Information Society, 19(1): 99–111.

Yuan MT, Ye SC, 2025, Starting from “Audiovisual Translation”: Understanding the Cross-cultural Auditory Communication of Audio Description in Accessible Filmmaking. Film and Television Industry Research, 2(1): 68–78.

Chu P, Wang J, Abrantes A, 2024, LLM-AD: Large Language Model-based Audio Description System. Arxiv, 2405(983): 1–9.

Braun S, Starr K, Delfani J, et al., 2021, When Worlds Collide: AI-created, Human-mediated Video Description Services and the User Experience. Lecture Notes in Computer Science, 13096(1): 147–167.

Sun BL, Wu L, 2025, Research on the Internal Logic and Evolution of Human-Machine Collaborative Creation. Chinese Editor, 24(8): 26–33.