Kumar, Himanshu, Aruldoss, Martin and Wynn, Martin G ORCID: https://orcid.org/0000-0001-7619-6079
(2025)
Cross-Modal Attention Fusion: A Deep Learning and Affective Computing Model for Emotion Recognition.
Multimodal Technologies and Interaction, 9 (116).
pp. 1-33.
doi:10.3390/mti9120116
Preview |
Text
15593 Kumar, h. et al (2025) Cross-Modal Attention Fusion - A Deep Learning and Affective Computing Model for Emotion Recognition.pdf - Draft Version Available under License Creative Commons Attribution 4.0. Download (2MB) | Preview |
Abstract
Artificial emotional intelligence is a sub-domain of human–computer interaction research that aims to develop deep learning models capable of detecting and interpreting human emotional states through various modalities. A major challenge in this domain is identifying meaningful correlations between heterogeneous modalities—for example, between audio and visual data—due to their distinct temporal and spatial properties. Traditional fusion techniques used in multimodal learning to combine data from different sources often fail to adequately capture meaningful and less computational cross-modal interactions, and struggle to adapt to varying modality reliability. Following a review of the relevant literature, this study adopts an experimental research method to develop and evaluate a mathematical cross-modal fusion model, thereby addressing a gap in the extant research literature. The framework uses the Tucker tensor decomposition to analyse the multi-dimensional array of data into a set of matrices to support the integration of temporal features from audio and spatiotemporal features from visual modalities. A crossattention mechanism is incorporated to enhance cross-modal interaction, enabling each modality to attend to the relevant information from the other. The efficacy of the model is rigorously evaluated on three publicly available datasets and the results conclusively demonstrate that the proposed fusion technique outperforms conventional fusion methods and several more recent approaches. The findings break new ground in this field of study and will be of interest to researchers and developers in artificial emotional intelligence.
| Item Type: | Article |
|---|---|
| Article Type: | Article |
| Uncontrolled Keywords: | Artificial emotional intelligence; Human–computer interaction; Cross-attention mechanism; Categorical emotions; Spatiotemporal features; Tucker decomposition; Cross-modal framework |
| Subjects: | T Technology > T Technology (General) |
| Divisions: | Schools and Research Institutes > School of Business, Computing and Social Sciences |
| Depositing User: | Martin Wynn |
| Date Deposited: | 01 Dec 2025 12:29 |
| Last Modified: | 01 Dec 2025 12:30 |
| URI: | https://eprints.glos.ac.uk/id/eprint/15593 |
University Staff: Request a correction | Repository Editors: Update this record

Tools
Tools