Cross-Modal Attention Fusion: A Deep Learning and Affective Computing Model for Emotion Recognition

Kumar, Himanshu, Aruldoss, Martin and Wynn, Martin G ORCID logoORCID: https://orcid.org/0000-0001-7619-6079 (2025) Cross-Modal Attention Fusion: A Deep Learning and Affective Computing Model for Emotion Recognition. Multimodal Technologies and Interaction, 9 (116). pp. 1-33. doi:10.3390/mti9120116

[thumbnail of 15593 Kumar, h. et al (2025) Cross-Modal Attention Fusion - A Deep Learning and Affective Computing Model for Emotion Recognition.pdf]
Preview
Text
15593 Kumar, h. et al (2025) Cross-Modal Attention Fusion - A Deep Learning and Affective Computing Model for Emotion Recognition.pdf - Draft Version
Available under License Creative Commons Attribution 4.0.

Download (2MB) | Preview

Abstract

Artificial emotional intelligence is a sub-domain of human–computer interaction research that aims to develop deep learning models capable of detecting and interpreting human emotional states through various modalities. A major challenge in this domain is identifying meaningful correlations between heterogeneous modalities—for example, between audio and visual data—due to their distinct temporal and spatial properties. Traditional fusion techniques used in multimodal learning to combine data from different sources often fail to adequately capture meaningful and less computational cross-modal interactions, and struggle to adapt to varying modality reliability. Following a review of the relevant literature, this study adopts an experimental research method to develop and evaluate a mathematical cross-modal fusion model, thereby addressing a gap in the extant research literature. The framework uses the Tucker tensor decomposition to analyse the multi-dimensional array of data into a set of matrices to support the integration of temporal features from audio and spatiotemporal features from visual modalities. A crossattention mechanism is incorporated to enhance cross-modal interaction, enabling each modality to attend to the relevant information from the other. The efficacy of the model is rigorously evaluated on three publicly available datasets and the results conclusively demonstrate that the proposed fusion technique outperforms conventional fusion methods and several more recent approaches. The findings break new ground in this field of study and will be of interest to researchers and developers in artificial emotional intelligence.

Item Type: Article
Article Type: Article
Uncontrolled Keywords: Artificial emotional intelligence; Human–computer interaction; Cross-attention mechanism; Categorical emotions; Spatiotemporal features; Tucker decomposition; Cross-modal framework
Subjects: T Technology > T Technology (General)
Divisions: Schools and Research Institutes > School of Business, Computing and Social Sciences
Depositing User: Martin Wynn
Date Deposited: 01 Dec 2025 12:29
Last Modified: 01 Dec 2025 12:30
URI: https://eprints.glos.ac.uk/id/eprint/15593

University Staff: Request a correction | Repository Editors: Update this record

University Of Gloucestershire

Bookmark and Share

Find Us On Social Media:

Social Media Icons Facebook Twitter YouTube Pinterest Linkedin

Other University Web Sites

University of Gloucestershire, The Park, Cheltenham, Gloucestershire, GL50 2RH. Telephone +44 (0)844 8010001.