Voice recognition by deep transfer learning and vision transformers to secure voice authentication

Nayem Uddin Prince ^{1, *}, Abdullah Al Masum ², Salman Mohammad Abdullah ³ and Touhid Bhuiyan ⁴

1 Information Technology (2022), Washington University of Science and Technology, USA.
2 Information Technology (2024), Westcliff University Irvine, USA.
3 Information Technology (2023), Washington University of science and technology, USA.
4 Cyber Security School of Information Technology Washington University of Science and Technology Virginia, USA.

Research Article

World Journal of Advanced Research and Reviews, 2024, 23(03), 1365–1377
Article DOI: 10.30574/wjarr.2024.23.3.2781
DOI url: https://doi.org/10.30574/wjarr.2024.23.3.2781

Publication history

Received on 02 August 2024; revised on 10 September 2024; accepted on 12 September 2024

Abstract

Speech recognition is crucial for ensuring the security of personal devices and financial transactions. Attaining high accuracy and robustness in voice authentication is challenging due to the presence of voice and environmental variability. Recent advancements in the field of deep learning, particularly in transfer learning and visual transformers, have the potential to enhance voice recognition systems. This study employs advanced deep transfer learning techniques, including Vision Transformers (ViT), VGG16, and a customized Convolutional Neural Network (CNN), to enhance the accuracy and security of speech authentication. The objective is to evaluate and contrast various solutions' voice recognition and authentication accuracy. The experiment included 3000 voice samples, with an equal distribution of 1500 samples from male participants and 1500 from female participants. The dataset was used to train Vision Transformers, VGG16 with transfer learning, and a custom CNN. The models were assessed based on their accuracy in identifying and authenticating voice samples. The VGG16 model achieved the highest level of accuracy in speech recognition, with a precision rate of 95%. The Vision Transformer and custom CNN exhibited satisfactory performance. However, VGG16 demonstrated higher accuracy. The most accurate voice authentication model studied is the VGG16 model based on transfer learning. This study suggests that the security and reliability of voice recognition systems can be enhanced through the use of deep learning techniques.

Keywords

Voice recognition; VGG16; CustomCNN; Vit; honey trap; webform; Cybercrime; Vision Transform; MFCCs

Download Article PDF

https://wjarr.co.in/sites/default/files/fulltext_pdf/WJARR-2024-2781.pdf

Get Your e Certificate of Publication using below link

Download Certificate

How to cite this article

Nayem Uddin Prince, Abdullah Al Masum, Salman Mohammad Abdullah and Touhid Bhuiyan. Voice recognition by deep transfer learning and vision transformers to secure voice authentication. World Journal of Advanced Research and Reviews, 2024, 23(03), 1365–1377. Article DOI: https://doi.org/10.30574/wjarr.2024.23.3.2781

Copyright information

Developed & Designed by VS Infosolution

Voice recognition by deep transfer learning and vision transformers to secure voice authentication

Nayem Uddin Prince 1, *, Abdullah Al Masum 2, Salman Mohammad Abdullah 3 and Touhid Bhuiyan 4

Nayem Uddin Prince ^{1, *}, Abdullah Al Masum ², Salman Mohammad Abdullah ³ and Touhid Bhuiyan ⁴