https://doi.org/10.5370/KIEE.2025.74.10.1731
(Oralbek Bayazov) ; (Anel Aidos) ; 강정원(Jeong Won Kang) ; (Assel Mukasheva)
Voice biometrics is emerging as a secure, intuitive, and contactless method of identity verification, offering key advantages over traditional PIN- or password-based systems. However, its effectiveness is often reduced by real-world factors such as background noise, device variability, and spoofing attacks including replay and synthetic voice input. This paper presents a comparative analysis of three neural network architectures-Convolutional Neural Networks (CNN), Long Short-Term Memory (LSTM), and transformer-based Wav2Vec 2.0-for voice biometric authentication under both clean and adverse conditions. Experiments were conducted using two large-scale datasets, Mozilla Common Voice and VoxCeleb, with audio represented as mel spectrograms, mel-frequency cepstral soefficients (MFCCs), and raw waveforms. Data augmentation included Gaussian noise, reverberation, background speech, and spoofing via text-to-speech (TTS) synthesis. Results show that Wav2Vec 2.0 consistently outperforms CNN and LSTM in terms of accuracy, robustness to noise, and partial resistance to spoofing, reaching up to 92% accuracy in clean scenarios. Despite these gains, none of the models proved fully resistant to high-fidelity synthetic voice attacks. To address this, we propose integrating explicit spoof detection modules and adversarial training techniques. Additionally, privacy-preserving frameworks such as federated learning and the use of multimodal biometrics are discussed as future directions for secure and ethical deployment.