This repository contains a list of audio deepfake resources. We also have a survey report on Audio Deepfake Detection (ADD). We include sections on ADD Datasets, Audio Preprocessing, Feature Extraction and Network Training to introduce beginners to carefully selected material to learn the ADD domain. We will endeavour to maintain this repository on an ongoing basis for a fixed period.

Audio Large Model
Datasets
Audio Preprocessing
- Commonly Used Noise Datasets
- Audio Enhancement Methods
Feature Extraction
Network Training
- Multi-task Learning-based Forgery Detection
Reference
Statement
Contact

Audio Large Model

Model	Publisher	Years	Achievable Tasks
AudioLM Paper Website Code	Google	2022.09	1. AudioLM maps the input audio to a sequence of discrete tokens and casts audio generation as a language modeling task in this representation space. 2. Speech continuation, Acoustic generation, Unconditional generation, Generation without semantic tokens, and Piano continuation.
VALL-E Paper Website	Microsoft	2023.01	1. Simply record a 3-second registration of an unseen speaker to create a high-quality personalised speech. 2. VALL-E X: Cross-lingual speech synthesis.
USM Website	Google	2023.03	1. ASR beyond 100 languages. 2. Downstream ASR tasks. 3. Automated Speech Translation (AST).
SpeechGPT Website	Fudan University	2023.05	1. Perceive and generate multi-modal contents. 2. Spoken dialogue LLM with strong human instruction.
Pengi Paper Website	Microsoft	2023.05	1. an Audio Language Model that leverages Transfer Learning by framing all audio tasks as text-generation tasks. 2. The unified architecture of Pengi enables open-ended tasks and close-ended tasks without any additional fine-tuning or task-specific extensions.
VoiceBox Website	Meta	2023.06	1. Synthesize speech across six languages. 2. Remove transient noise. 3. Edit content. 4. Transfer audio style within and across languages. 5. Generate diverse speech samples.
AudioPaLM Paper Website	Google	2023.06	1. Speech-to-speech translation. 2. Automatic Speech Recognition (ASR).

Datasets

Attack Types	Years	Dataset	Number of Audio （Subdataset：Real/Fake）	Language
TTS	2021	WaveFake Paper Dataset	16283/117985	English, Japanese
TTS	2021	HAD Paper	53612/107224	Chinese
TTS	2022	ADD 2022 Paper	LF: 5619/46067 PF: 5319/46419 FG-D: 5319/46206	Chinese
TTS	2022	CMFD Paper Dataset	Chinese: 1800/1000 English: 1800/1000	English, Chinese
TTS	2022	In-the-Wild Paper Dataset	19963/11816	English
TTS	2022	FAD Paper Dataset	115800/115800	Chinese
Replay	2017	ASVspoof 2017 Paper Dataset	3565/14465	English
Replay	2019	ReMASC Paper Dataset	9240/45472	English, Chinese, Hindi
TTS和VC	2015	AVspoof Paper Dataset	LA: 15504/120480 PA: 15504/14465	English
TTS和VC	2015	ASVspoof 2015 Paper Dataset	16651/246500	English
TTS和VC	2021	FMFCC-A Paper Dataset	10000/40000	Chinese
TTS和VC	2022	SceneFake Paper Dataset	19838/64642	English
TTS和VC	2022	EmoFake Paper	35000/53200	English, Chinese
TTS和VC	2023	PartialSpoof Paper Dataset	12483/108978	English
TTS和VC	2023	ADD 2023 Paper	FG-D: 172819/113042 RL: 55468/65449 AR: 14907/95383	Chinese
TTS和VC	2023	DECRO Paper Dataset	Chinese: 21218/41880 English: 12484/42799	English, Chinese
TTS、VC和Replay	2019	ASVspoof 2019 Paper Dataset	LA: 12483/108978 PA: 28890/189540	English
TTS、VC和Replay	2021	ASVspoof 2021 Paper Dataset	LA: 18452/163114 PA: 126630/816480 PF: 14869/519059	English

Audio Preprocessing

Commonly Used Noise Datasets

Dataset	Description
MUSAN Dataset	A corpus of music, speech and noise
RIR Dataset	A database of simulated and real room impulse responses, isotropic and point-source noises. The audio files in this data are all in 16k sampling rate and 16-bit precision.
NOIZEUS Dataset	Contains 30 IEEE sentences (generated by three male and three female speakers) corrupted by eight different real-world noises at different SNRs. Noises include suburban train noise, murmur, car, exhibition hall, restaurant, street, airport and train station noise.
NoiseX-92 Dataset	All noises are obtained with a duration of 235 seconds, a sampling rate of 19.98 KHz, an analogue-to-digital converter (A/D) with 16 bits, an anti-alias filter and no pre-emphasis stage. Fifteen noise types are included.
DEMAND Dataset	Multi-channel acoustic noise database for diverse environments.
ESC-50 Dataset	A tagged collection of 2000 environmental audios obtained from clips in Freesound.org, suitable for environmental sound classification. The dataset consists of 5-second-long recordings organised into 5 broad categories, each with 10 subcategories (40 examples per subcategory).
ESC Dataset	Including the ESC-50, ESC-10, and ESC-US.
FSD50K Dataset	An open dataset of human tagged sound events containing 51,197 Freesound clips totalling 108.3 hours of multi-labeled audio, unequally distributed across 200 classes from the AudioSet Ontology.

Audio Enhancement Methods

Method	Description
SpecAugment Paper Code	Enhancement strategies include time warping, frequency masking and time masking
WavAugment Paper Code	Enhancement strategies include pitch randomization, reverberation, additive noise, time dropout (temporal masking), band reject and clipping
RawBoost Paper Code	Enhancement strategies include linear and non-linear convolutive noise, impulsive signal-dependent additive noise and stationary signal-independent additive noise

Feature Extraction

Handcrafted Feature-based Forgery Detection

Paper	Audio Deepfake Detection				Results
Paper	Data Augmentation	Feature Extraction	Network Framework	Loss Function	EER (%)	t-DCF
Detecting spoofing attacks using VGG and SincNet: BUT-Omilia submission to ASVspoof 2019 challenge Paper Code	—	CQT, Power Spectrum	VGG, SincNet	CE	LA: 8.01 (4) PA: 1.51 (2)	LA: 0.208 (4) PA: 0.037 (1)
Long-term high frequency features for synthetic speech detection Paper	Cafe, White and Street Noise	ICQC, ICQCC, ICBC, ICLBC	DNN	CE	LA: 7.78 (3)	LA: 0.187 (3)
Voice spoofing countermeasure for logical access attacks detection Paper	—	ELTP-LFCC	DBiLSTM	—	LA: 0.74 (1)	LA: 0.008 (1)
Voice spoofing detector: A unified anti-spoofing framework Paper	—	ATP-GTCC	SVM	Hamming Distance	LA: 0.75 (2) PA: 1.00 (1)	LA: 0.050 (2) PA: 0.064 (2)

Note: "—" indicates not mentioned in the paper. Values in brackets in the experimental results are the ranking of each column in the LA or PA scenario, and bolded values are the best results for that scenario.

Hybrid Feature-based Forgery Detection

Paper	Audio Deepfake Detection				Results
Paper	Data Augmentation	Feature Extraction	Network Structure	Loss Function	EER (%)	t-DCF
Light convolutional neural network with feature genuinization for detection of synthetic speech attacks Paper	—	CQT-based LPS	LCNN	—	LA: 4.07 (11)	LA: 0.102 (10)
Siamese convolutional neural network using gaussian probability feature for spoofing speech detection Paper	—	LFCC	Siamese CNN	CE	LA: 3.79 (10) PA: 7.98 (5)	LA: 0.093 (5) PA: 0.195 (2)
Generalization of audio deepfake detection Paper	RIR and MUSAN	LFB	ResNet18	LCML	LA: 1.81 (4)	LA: 0.052 (4)
Continual learning for fake audio detection Paper	—	LFCC	LCNN, DFWF	Similarity Loss	LA: 7.74 (15) PA: 8.85 (6)	—
Partially-connected differentiable architecture search for deepfake and spoofing detection Paper Code	Frequency Mask	LFCC	PC-DARTS	WCE	LA: 4.96 (12)	LA: 0.091 (8)
One-class learning towards synthetic voice spoofing detection Paper Code	—	LFCC	ResNet18	OC-Softmax	LA: 2.19 (7)	LA: 0.059 (5)
Replay and synthetic speech detection with res2net architecture Paper Code	—	CQT	SE-Res2Net50	BCE	LA: 2.50 (8) PA: 0.46 (2)	LA: 0.074 (7) PA: 0.012 (2)
An empirical study on channel effects for synthetic voice spoofing countermeasure systems Paper Code	Telephone Codecs, and Device/Room Impulse Responses (IRs).	LFCC	LCNN, ResNet-OC	OC-Softmax, CE	LA: 3.92 (10)	—
Efficient attention branch network with combined loss function for automatic speaker verification spoof detection Paper Code	SpecAug, Attention Mask	LFCC	EfficientNet-A0, SE-Res2Net50	WCE, Triplet Loss	LA: 1.89 (6) PA: 0.86 (4)	LA: 0.507 (11) PA: 0.024 (4)
Resmax: Detecting voice spoofing attacks with residual network and max feature map Paper	—	CQT	ResMax	BCE	LA: 2.19 (7) PA: 0.37 (1)	LA: 0.060 (6) PA: 0.009 (1)
Synthetic voice detection and audio splicing detection using se-res2net-conformer architecture Paper	Adding noise according to a signal-to-noise ratio of 15dB or 25dB	CQT	SE-Res2Net34-Confromer	CE	LA: 1.85 (5)	LA: 0.060 (6)
Fastaudio: A learnable audio front-end for spoof speech detection Paper Code	—	L-VQT	L-DenseNet	NLLLoss	LA: 1.54 (3)	LA: 0.045 (3)
Learning from yourself: A self-distillation method for fake speech detection Paper	—	LPS, F0	ECANet, SENet	A-Softmax	LA: 1.00 (2) PA: 0.65 (3)	LA: 0.031 (2) PA: 0.017 (3)
How to boost anti-spoofing with x-vectors Paper	—	LFCC, MFCC	TDNN, SENet34	LCML	LA: 0.83 (1)	LA: 0.024 (1)

End-to-end Forgery Detection

Paper	Audio Deepfake Detection				Results
Paper	Data Augmentation	Feature Extraction	Network Structure	Loss Function	EER (%)	t-DCF
A light convolutional GRU-RNN deep feature extractor for asv spoofing detection Paper	—	LC-GRNN	PLDA	—	LA: 6.28 (13) PA: 2.23	LA: 0.152 (10) PA: 0.061
Rw-resnet: A novel speech anti-spoofing model using raw waveform Paper	—	1D Convolution Residual Block	ResNet	CE	LA: 2.98 (11)	LA: 0.082 (9)
Raw differentiable architecture search for speech deepfake and spoofing detection Paper Code	Masking Filter	Sinc Filter	PC-DARTS	P2SGrad	LA: 1.77 (10)	LA: 0.052 (7)
Towards end-to-end synthetic speech detection Paper Code	—	DNN	Res-TSSDNet, Inc-TSSDNet	WCE	LA: 1.64 (9)	LA: 0.048 (6)
End-to-end anti-spoofing with RawNet2 Paper Code	—	Sinc Filter	RawNet2	CE	LA: 1.12 (5)	LA: 0.033 (3)
Long-term variable Q transform: A novel time-frequency transform algorithm for synthetic speech detection Paper	—	FastAudio filter	X-vector, ECAPA-TDNN	—	LA: 1.54 (7)	LA: 0.045 (5)
Fully automated end-to-end fake audio detection Paper	Sinc Filter	Wav2Vec2	light-DARTS	Comparative loss	LA: 1.08 (4)	—
Audio anti-spoofing using a simple attention module and joint optimization based on additive angular margin loss and meta-learning Paper	—	Sinc Filter	RawNet2, SimAM	AAM Softmax, MSE	LA: 0.99 (3)	LA: 0.029 (2)
AASIST: Audio anti-spoofing using integrated spectro-temporal graph attention networks Paper Code	—	Sinc Filter	RawNet2, MGO, HS-GAL	CE	LA: 0.83 (2)	LA: 0.028 (1)
Ai-synthesized voice detection using neural vocoder artifacts Paper Code	Resampling, Noise Addition	Sinc Filter	RawNet2	CE, Softmax	LA: 4.54 (12)	—
To-RawNet: Improving rawnet with tcn and orthogonal regularization for fake audio detection Paper	RawBoost	Sinc Filter	RawNet2, TCN	CE, Orthogonal Loss	LA: 1.58 (8)	—
Speaker-Aware Anti-spoofing Paper	—	Sinc Filter	AASIST, M2S Converter	CE	LA: 1.13 (6)	LA: 0.038 (4)
Spoofing attacker also benefits from self-supervised pretrained model Paper	—	HuBERT, WavLM	Residual block, Conv-TasNet	AAM softmax	LA: 0.44 (1)	—

Note: "—" indicates not mentioned in the paper. Values in brackets in the experimental results are the ranking of each column in the LA scenario, and bolded values are the best results for that scenario.

Feature Fusion-based Forgery Detection

Paper	Audio Deepfake Detection			Results
Paper	Feature Extraction	Network Structure	Loss Function	EER (%)
Voice spoofing countermeasure for synthetic speech detection Paper	GTCC, MFCC, Spectral Flux, Spectral Centroid	Bi-LSTM	—	LA: 3.05 (4)
Combining automatic speaker verification and prosody analysis for synthetic speech detection Paper	MFCC, Mel-Spectrogram	ECAPA-TDNN, Prosody Encoder	BCE	LA: 5.39 (5)
Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation Paper	Sinc Filter, Wav2Vec2	AASIST	Contrastive Loss, WCE	—
Overlapped frequency-distributed network: Frequency-aware voice spoofing countermeasure Paper	Mel-Spectrogram, CQT	LCNN, ResNet	—	LA: 1.35 (2) PA: 0.35
Detection of cross-dataset fake audio based on prosodic and pronunciation features Paper	Phoneme Feature, Prosody Feature, Wav2Vec2	LCNN, Bi-LSTM	CTC	LA: 1.58 (3)
Betray oneself: A novel audio deepfake detection model via mono-to-stereo conversion Paper Code	Sinc Filter	AASIST, M2S Converter	CE	LA: 1.34 (1)

Network Training

Multi-task Learning-based Forgery Detection

Paper	Audio Deepfake Detection			Results
Paper	Feature Extraction	Network Structure	Loss Function	EER (%)	t-DCF
Multi-task learning in utterance-level and segmental-level spoof detection Paper	LFCC	SELCNN, Bi-LSTM	P2SGrad	—	—
SA-SASV: An end-to-end spoof-aggregated spoofing-aware speaker verification system Paper Code	Fbanks, Sinc Filter	ECAPA-TDNN, ARawNet	BCE, AAM Softmax, CE	LA: 4.86 (4)	—
STATNet: Spectral and temporal features based multi-task network for audio spoofing detection Paper	Sinc Filter	RawNet2, TCM, SCM	CE	LA: 2.45 (3)	LA: 0.062 (2)
A probabilistic fusion framework for spoofing aware speaker verification Paper Code	Mel Filter, Sinc Filter	ECAPA-TDNN, AASIST	BCE	LA: 1.53 (2)	—
DSVAE: Interpretable disentangled representation for synthetic speech detection Paper	Spectrogram	VAE	KL Divergence Loss, BCE	LA: 6.56 (5)	—
End-to-end dual-branch network towards synthetic speech detection Paper Code	LFCC, CQT	Dual-Branch Network	Classification Loss, Fake Type Classification Loss	LA: 0.80 (1)	LA: 0.021 (1)

Reference

More details about on the above, you may check the following this papers: //: (```python)

Statement

The purpose of this project is to establish a database based on audio deepfake detection, solely for the purpose of communication and learning. All the content collected in this project is sourced from journals and the internet, and we express sincere gratitude to the researchers and authors who have published related research achievements. In the event of a complaint of copyright infringement, the content will be removed as appropriate.

Contact

We are glad to hear from you. If you have any questions, please feel free to contact [email protected].

npaviliuc / audio-deepfake-detection Goto Github PK

audio-deepfake-detection's Introduction

Table of contents