Video Understanding with Attention Encoder and Multimodal Large Language Model

Zheng, Anni

Video Understanding with Attention Encoder and Multimodal Large Language Model

Files

Thesis(8.15 MB)

Date

2025

Authors

Zheng, Anni

Supervisor

Yan, Wei Qi

Item type

Thesis

Degree name

Master of Computer and Information Sciences

Publisher

Auckland University of Technology

Abstract

The challenge of achieving robust video understanding has become increasingly significant with the emergence of Multimodal Large Language Models (MLLMs). While MLLMs have demonstrated significant promise, effectively capturing and reasoning about complex temporal dynamics and object-level interactions in videos remains an active area of research. This project introduces a novel framework designed to enhance video understanding capabilities. We propose a new model architecture featuring a Temporal Context Gated Attention (TCGA) encoder layer, combined with a fine-tuned MLLM, demonstrates improved performance in video event retrieval and understanding tasks. Furthermore, we present the design and implementation of a real-time system application built upon our proposed model. This work aims to contribute a specialized video processing module and system design insights, offering a valuable step towards more sophisticated and applicable video understanding within MLLMs. We hope our findings provide a foundation for future research in temporal-aware multimodal learning.

Permanent link

http://hdl.handle.net/10292/19800

Collections

Masters Theses

Full item page

Video Understanding with Attention Encoder and Multimodal Large Language Model

Files

Date

Authors

Supervisor

Item type

Degree name

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Source

DOI

Publisher's version

Rights statement

Permanent link

Collections