Deepfake Voice Detection: An Approach Using End-to-End Transformer With Acoustic Feature Fusion by Cross-Attention

Gong, LJ; Li, XJ

Deepfake Voice Detection: An Approach Using End-to-End Transformer With Acoustic Feature Fusion by Cross-Attention

aut.relation.endpage	2040
aut.relation.issue	10
aut.relation.journal	Electronics Switzerland
aut.relation.startpage	2040
aut.relation.volume	14
dc.contributor.author	Gong, LJ
dc.contributor.author	Li, XJ
dc.date.accessioned	2025-06-17T02:41:47Z
dc.date.available	2025-06-17T02:41:47Z
dc.date.issued	2025-05-16
dc.description.abstract	Deepfake technology uses artificial intelligence to create highly realistic but fake audio, video, or images, often making it difficult to distinguish from real content. Due to its potential use for misinformation, fraud, and identity theft, deepfake technology has gained a bad reputation in the digital world. Recently, many works have reported on the detection of deepfake videos/images. However, few studies have concentrated on developing robust deepfake voice detection systems. Among most existing studies in this field, a deepfake voice detection system commonly requires a large amount of training data and a robust backbone to detect real and logistic attack audio. For acoustic feature extractions, Mel-frequency Filter Bank (MFB)-based approaches are more suitable for extracting speech signals than applying the raw spectrum as input. Recurrent Neural Networks (RNNs) have been successfully applied to Natural Language Processing (NLP), but these backbones suffer from gradient vanishing or explosion while processing long-term sequences. In addition, the cross-dataset evaluation of most deepfake voice recognition systems has weak performance, leading to a system robustness issue. To address these issues, we propose an acoustic feature-fusion method to combine Mel-spectrum and pitch representation based on cross-attention mechanisms. Then, we combine a Transformer encoder with a convolutional neural network block to extract global and local features as a front end. Finally, we connect the back end with one linear layer for classification. We summarized several deepfake voice detectors’ performances on the silence-segment processed ASVspoof 2019 dataset. Our proposed method can achieve an Equal Error Rate (EER) of 26.41%, while most of the existing methods result in EER higher than 30%. We also tested our proposed method on the ASVspoof 2021 dataset, and found that it can achieve an EER as low as 28.52%, while the EER values for existing methods are all higher than 28.9%.
dc.identifier.citation	Electronics Switzerland, ISSN: 2079-9292 (Print); 2079-9292 (Online), MDPI AG, 14(10), 2040-2040. doi: 10.3390/electronics14102040
dc.identifier.doi	10.3390/electronics14102040
dc.identifier.issn	2079-9292
dc.identifier.issn	2079-9292
dc.identifier.uri	http://hdl.handle.net/10292/19328
dc.language	en
dc.publisher	MDPI AG
dc.relation.uri	https://www.mdpi.com/2079-9292/14/10/2040
dc.rights	© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
dc.rights.accessrights	OpenAccess
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/
dc.subject	40 Engineering
dc.subject	4009 Electronics, Sensors and Digital Hardware
dc.subject	Networking and Information Technology R&D (NITRD)
dc.subject	Bioengineering
dc.subject	Machine Learning and Artificial Intelligence
dc.subject	0906 Electrical and Electronic Engineering
dc.subject	4009 Electronics, sensors and digital hardware
dc.subject	end-to-end
dc.subject	transformer
dc.subject	cross attention
dc.subject	feature fusion
dc.subject	supervised learning
dc.subject	deepfake voice recognition
dc.title	Deepfake Voice Detection: An Approach Using End-to-End Transformer With Acoustic Feature Fusion by Cross-Attention
dc.type	Journal Article
pubs.elements-id	607153

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Deepfake voice detection.pdf
Size:: 1.65 MB
Format:: Adobe Portable Document Format
Description:: Journal article

Download

Collections

School of Engineering, Computer and Mathematical Sciences - Te Kura Mātai Pūhanga, Rorohiko, Pāngarau