Repository logo
 

SG-DTAM: Joint Staged Generation and Dynamic Time Alignment for Missing and Unaligned Modalities in Sentiment Analysis

Supervisor

Item type

Journal Article

Degree name

Journal Title

Journal ISSN

Volume Title

Publisher

Elsevier BV

Abstract

Multimodal Sentiment Analysis (MSA) aims to infer users’ emotional states by integrating information from multiple modalities, such as language, audio, and visual data. However, real-world multimodal data often presents two critical challenges: missing modalities and unaligned multimodal sequences. Missing sources can lead to information loss, while temporal misalignment introduces inconsistencies—both of which significantly degrade analytical accuracy. While a plethora of existing approaches effectively address each challenge in isolation, few can tackle both simultaneously without resorting to complex architectures or incurring substantial computational costs. To overcome these limitations, we propose SG-DTAM, a novel framework that combines staged generation with multi-head dynamic temporal alignment. In the first stage, conditional mutual information is employed to guide a hierarchical series of cross modal attention modules that sequentially reconstruct each missing modality. In the following alignment stage, a set of attention heads with adaptive weighting reconciles temporal discrepancies across all modalities without any reliance on external synchronization labels. Throughout the process, we innovatively introduce a dual supervision objective that combines an InfoNCE based contrastive loss and a reconstruction loss ensures both precise modality synthesis and the development of resilient feature representations. We evaluate SG-DTAM on four benchmark MSA datasets—CMU-MOSI, CMU-MOSEI, IEMOCAP, and MELD. Experimental results demonstrate that our framework achieves competitive or state-of-the-art performance with relatively few learnable parameters. Notably, SG-DTAM exhibits robust performance in scenarios involving both missing and misaligned modalities, underscoring its effectiveness in real-world multimodal sentiment analysis tasks.

Description

Source

Expert Systems with Applications, ISSN: 0957-4174 (Print), Elsevier BV, 129750-129750. doi: 10.1016/j.eswa.2025.129750

Rights statement

© 2025 Published by Elsevier Ltd. This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.