Deepfake Voice Detection: An Approach Using End-to-End Transformer With Acoustic Feature Fusion by Cross-Attention
| aut.relation.endpage | 2040 | |
| aut.relation.issue | 10 | |
| aut.relation.journal | Electronics Switzerland | |
| aut.relation.startpage | 2040 | |
| aut.relation.volume | 14 | |
| dc.contributor.author | Gong, LJ | |
| dc.contributor.author | Li, XJ | |
| dc.date.accessioned | 2025-06-17T02:41:47Z | |
| dc.date.available | 2025-06-17T02:41:47Z | |
| dc.date.issued | 2025-05-16 | |
| dc.description.abstract | Deepfake technology uses artificial intelligence to create highly realistic but fake audio, video, or images, often making it difficult to distinguish from real content. Due to its potential use for misinformation, fraud, and identity theft, deepfake technology has gained a bad reputation in the digital world. Recently, many works have reported on the detection of deepfake videos/images. However, few studies have concentrated on developing robust deepfake voice detection systems. Among most existing studies in this field, a deepfake voice detection system commonly requires a large amount of training data and a robust backbone to detect real and logistic attack audio. For acoustic feature extractions, Mel-frequency Filter Bank (MFB)-based approaches are more suitable for extracting speech signals than applying the raw spectrum as input. Recurrent Neural Networks (RNNs) have been successfully applied to Natural Language Processing (NLP), but these backbones suffer from gradient vanishing or explosion while processing long-term sequences. In addition, the cross-dataset evaluation of most deepfake voice recognition systems has weak performance, leading to a system robustness issue. To address these issues, we propose an acoustic feature-fusion method to combine Mel-spectrum and pitch representation based on cross-attention mechanisms. Then, we combine a Transformer encoder with a convolutional neural network block to extract global and local features as a front end. Finally, we connect the back end with one linear layer for classification. We summarized several deepfake voice detectors’ performances on the silence-segment processed ASVspoof 2019 dataset. Our proposed method can achieve an Equal Error Rate (EER) of 26.41%, while most of the existing methods result in EER higher than 30%. We also tested our proposed method on the ASVspoof 2021 dataset, and found that it can achieve an EER as low as 28.52%, while the EER values for existing methods are all higher than 28.9%. | |
| dc.identifier.citation | Electronics Switzerland, ISSN: 2079-9292 (Print); 2079-9292 (Online), MDPI AG, 14(10), 2040-2040. doi: 10.3390/electronics14102040 | |
| dc.identifier.doi | 10.3390/electronics14102040 | |
| dc.identifier.issn | 2079-9292 | |
| dc.identifier.issn | 2079-9292 | |
| dc.identifier.uri | http://hdl.handle.net/10292/19328 | |
| dc.language | en | |
| dc.publisher | MDPI AG | |
| dc.relation.uri | https://www.mdpi.com/2079-9292/14/10/2040 | |
| dc.rights | © 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). | |
| dc.rights.accessrights | OpenAccess | |
| dc.rights.uri | https://creativecommons.org/licenses/by/4.0/ | |
| dc.subject | 40 Engineering | |
| dc.subject | 4009 Electronics, Sensors and Digital Hardware | |
| dc.subject | Networking and Information Technology R&D (NITRD) | |
| dc.subject | Bioengineering | |
| dc.subject | Machine Learning and Artificial Intelligence | |
| dc.subject | 0906 Electrical and Electronic Engineering | |
| dc.subject | 4009 Electronics, sensors and digital hardware | |
| dc.subject | end-to-end | |
| dc.subject | transformer | |
| dc.subject | cross attention | |
| dc.subject | feature fusion | |
| dc.subject | supervised learning | |
| dc.subject | deepfake voice recognition | |
| dc.title | Deepfake Voice Detection: An Approach Using End-to-End Transformer With Acoustic Feature Fusion by Cross-Attention | |
| dc.type | Journal Article | |
| pubs.elements-id | 607153 |
Files
Original bundle
1 - 1 of 1
Loading...
- Name:
- Deepfake voice detection.pdf
- Size:
- 1.65 MB
- Format:
- Adobe Portable Document Format
- Description:
- Journal article
