Repository logo
 

Deepfake Voice Detection: An Approach Using End-to-End Transformer With Acoustic Feature Fusion by Cross-Attention

aut.relation.endpage2040
aut.relation.issue10
aut.relation.journalElectronics Switzerland
aut.relation.startpage2040
aut.relation.volume14
dc.contributor.authorGong, LJ
dc.contributor.authorLi, XJ
dc.date.accessioned2025-06-17T02:41:47Z
dc.date.available2025-06-17T02:41:47Z
dc.date.issued2025-05-16
dc.description.abstractDeepfake technology uses artificial intelligence to create highly realistic but fake audio, video, or images, often making it difficult to distinguish from real content. Due to its potential use for misinformation, fraud, and identity theft, deepfake technology has gained a bad reputation in the digital world. Recently, many works have reported on the detection of deepfake videos/images. However, few studies have concentrated on developing robust deepfake voice detection systems. Among most existing studies in this field, a deepfake voice detection system commonly requires a large amount of training data and a robust backbone to detect real and logistic attack audio. For acoustic feature extractions, Mel-frequency Filter Bank (MFB)-based approaches are more suitable for extracting speech signals than applying the raw spectrum as input. Recurrent Neural Networks (RNNs) have been successfully applied to Natural Language Processing (NLP), but these backbones suffer from gradient vanishing or explosion while processing long-term sequences. In addition, the cross-dataset evaluation of most deepfake voice recognition systems has weak performance, leading to a system robustness issue. To address these issues, we propose an acoustic feature-fusion method to combine Mel-spectrum and pitch representation based on cross-attention mechanisms. Then, we combine a Transformer encoder with a convolutional neural network block to extract global and local features as a front end. Finally, we connect the back end with one linear layer for classification. We summarized several deepfake voice detectors’ performances on the silence-segment processed ASVspoof 2019 dataset. Our proposed method can achieve an Equal Error Rate (EER) of 26.41%, while most of the existing methods result in EER higher than 30%. We also tested our proposed method on the ASVspoof 2021 dataset, and found that it can achieve an EER as low as 28.52%, while the EER values for existing methods are all higher than 28.9%.
dc.identifier.citationElectronics Switzerland, ISSN: 2079-9292 (Print); 2079-9292 (Online), MDPI AG, 14(10), 2040-2040. doi: 10.3390/electronics14102040
dc.identifier.doi10.3390/electronics14102040
dc.identifier.issn2079-9292
dc.identifier.issn2079-9292
dc.identifier.urihttp://hdl.handle.net/10292/19328
dc.languageen
dc.publisherMDPI AG
dc.relation.urihttps://www.mdpi.com/2079-9292/14/10/2040
dc.rights© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
dc.rights.accessrightsOpenAccess
dc.rights.urihttps://creativecommons.org/licenses/by/4.0/
dc.subject40 Engineering
dc.subject4009 Electronics, Sensors and Digital Hardware
dc.subjectNetworking and Information Technology R&D (NITRD)
dc.subjectBioengineering
dc.subjectMachine Learning and Artificial Intelligence
dc.subject0906 Electrical and Electronic Engineering
dc.subject4009 Electronics, sensors and digital hardware
dc.subjectend-to-end
dc.subjecttransformer
dc.subjectcross attention
dc.subjectfeature fusion
dc.subjectsupervised learning
dc.subjectdeepfake voice recognition
dc.titleDeepfake Voice Detection: An Approach Using End-to-End Transformer With Acoustic Feature Fusion by Cross-Attention
dc.typeJournal Article
pubs.elements-id607153

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Deepfake voice detection.pdf
Size:
1.65 MB
Format:
Adobe Portable Document Format
Description:
Journal article