Yan, Wei QiZheng, Anni2025-09-152025-09-152025http://hdl.handle.net/10292/19800The challenge of achieving robust video understanding has become increasingly significant with the emergence of Multimodal Large Language Models (MLLMs). While MLLMs have demonstrated significant promise, effectively capturing and reasoning about complex temporal dynamics and object-level interactions in videos remains an active area of research. This project introduces a novel framework designed to enhance video understanding capabilities. We propose a new model architecture featuring a Temporal Context Gated Attention (TCGA) encoder layer, combined with a fine-tuned MLLM, demonstrates improved performance in video event retrieval and understanding tasks. Furthermore, we present the design and implementation of a real-time system application built upon our proposed model. This work aims to contribute a specialized video processing module and system design insights, offering a valuable step towards more sophisticated and applicable video understanding within MLLMs. We hope our findings provide a foundation for future research in temporal-aware multimodal learning.enVideo Understanding with Attention Encoder and Multimodal Large Language ModelThesisOpenAccess