This repository is related to our Dataset and Detection code from the paper: AI-Synthesized Voice Detection Using Neural Vocoder Artifacts accepted in CVPR Workshop on Media Forensic 2023.
I have some questions regarding the evaluation metrics and results presented in Sections 4.4 and 4.5.
Intra-dataset Evaluation (Section 4.4)
The paper reports a very low EER of 0.19% on the WaveFake dataset using the RawNet2 model.
To confirm my understanding, was this evaluation performed with the model being trained and tested on the same WaveFake dataset?
Cross-dataset Evaluation (Section 4.5)
On the other hand, the EER significantly increased to 26.95% when the model trained on the LibriSeVoc dataset was tested on the WaveFake dataset. This suggests poor generalization to unseen data.
Are there any ongoing efforts to improve this aspect of the model, perhaps through domain adaptation techniques or exposure to a more diverse set of vocoders during training?