A Multi-Dilated Convolution Network for Speech Emotion Recognition
| aut.relation.articlenumber | 8254 | |
| aut.relation.issue | 1 | |
| aut.relation.journal | Scientific Reports | |
| aut.relation.startpage | 8254 | |
| aut.relation.volume | 15 | |
| dc.contributor.author | Madanian, Samaneh | |
| dc.contributor.author | Adeleye, Olayinka | |
| dc.contributor.author | Templeton, John Michael | |
| dc.contributor.author | Chen, Talen | |
| dc.contributor.author | Poellabauer, Christian | |
| dc.contributor.author | Zhang, Enshi | |
| dc.contributor.author | Schneider, Sandra L | |
| dc.date.accessioned | 2025-03-19T22:46:43Z | |
| dc.date.available | 2025-03-19T22:46:43Z | |
| dc.date.issued | 2025-03-10 | |
| dc.description.abstract | Speech emotion recognition (SER) is an important application in Affective Computing and Artificial Intelligence. Recently, there has been a significant interest in Deep Neural Networks using speech spectrograms. As the two-dimensional representation of the spectrogram includes more speech characteristics, research interest in convolution neural networks (CNNs) or advanced image recognition models is leveraged to learn deep patterns in a spectrogram to effectively perform SER. Accordingly, in this study, we propose a novel SER model based on the learning of the utterance-level spectrogram. First, we use the Spatial Pyramid Pooling (SPP) strategy to remove the size constraint associated with the CNN-based image recognition task. Then, the SPP layer is deployed to extract both the global-level prominent feature vector and multi-local-level feature vector, followed by an attention model to weigh the feature vectors. Finally, we apply the ArcFace layer, typically used for face recognition, to the SER task, thereby obtaining improved SER performance. Our model achieved an unweighted accuracy of 67.9% on IEMOCAP and 77.6% on EMODB datasets. | |
| dc.identifier.citation | Scientific Reports, ISSN: 2045-2322 (Print); 2045-2322 (Online), Nature Portfolio, 15(1), 8254-. doi: 10.1038/s41598-025-92640-2 | |
| dc.identifier.doi | 10.1038/s41598-025-92640-2 | |
| dc.identifier.issn | 2045-2322 | |
| dc.identifier.issn | 2045-2322 | |
| dc.identifier.uri | http://hdl.handle.net/10292/18922 | |
| dc.language | eng | |
| dc.publisher | Nature Portfolio | |
| dc.relation.uri | https://www.nature.com/articles/s41598-025-92640-2 | |
| dc.rights | Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. | |
| dc.rights.accessrights | OpenAccess | |
| dc.rights.uri | http://creativecommons.org/licenses/by-nc-nd/4.0/ | |
| dc.subject | Convolution neural network | |
| dc.subject | Deep learning | |
| dc.subject | Emotion recognition | |
| dc.subject | Loss layer | |
| dc.subject | Spectrogram | |
| dc.subject | Speech emotion recognition | |
| dc.subject | Machine Learning and Artificial Intelligence | |
| dc.subject | Bioengineering | |
| dc.subject | Speech | |
| dc.subject | Neural Networks, Computer | |
| dc.subject | Humans | |
| dc.subject | Algorithms | |
| dc.subject.mesh | Algorithms | |
| dc.subject.mesh | Emotions | |
| dc.subject.mesh | Humans | |
| dc.subject.mesh | Neural Networks, Computer | |
| dc.subject.mesh | Speech | |
| dc.title | A Multi-Dilated Convolution Network for Speech Emotion Recognition | |
| dc.type | Journal Article | |
| pubs.elements-id | 595538 |
Files
Original bundle
1 - 1 of 1
Loading...
- Name:
- Madanian et al_2025_A multi-dilated convolution network.pdf
- Size:
- 2.31 MB
- Format:
- Adobe Portable Document Format
- Description:
- Journal article
