Repository logo
 

Video Understanding with Attention Encoder and Multimodal Large Language Model

Date

Supervisor

Yan, Wei Qi

Item type

Thesis

Degree name

Master of Computer and Information Sciences

Journal Title

Journal ISSN

Volume Title

Publisher

Auckland University of Technology

Abstract

The challenge of achieving robust video understanding has become increasingly significant with the emergence of Multimodal Large Language Models (MLLMs). While MLLMs have demonstrated significant promise, effectively capturing and reasoning about complex temporal dynamics and object-level interactions in videos remains an active area of research. This project introduces a novel framework designed to enhance video understanding capabilities. We propose a new model architecture featuring a Temporal Context Gated Attention (TCGA) encoder layer, combined with a fine-tuned MLLM, demonstrates improved performance in video event retrieval and understanding tasks. Furthermore, we present the design and implementation of a real-time system application built upon our proposed model. This work aims to contribute a specialized video processing module and system design insights, offering a valuable step towards more sophisticated and applicable video understanding within MLLMs. We hope our findings provide a foundation for future research in temporal-aware multimodal learning.

Description

Keywords

Source

DOI

Publisher's version

Rights statement

Collections