Implementasi Automatic Speech Recognition Bacaan Al-Qur’an Menggunakan Metode Wav2Vec 2.0 dan OpenAI-Whisper

Danny Ferdiansyah, Christian Sri Kusuma Aditya

Abstract


Implementasi Pengenalan Ucapan Otomatis untuk memprediksi bacaan sering digunakan dalam kehidupan sehari-hari. Salah satu tujuan yang dilakukan penelitian ini adalah untuk mengurangi angka buta mengaji Al-Qur'an pada umat Islam dengan mengimplementasikan ASR sebagai prediksi huruf hijaiyah dan membaca dengan teks ayat-ayat suci Al-Qur'an sebagai target. Data diambil dari platform YouTube dengan suara-suara murottal dari Syeikh Mahmoud Al-Hussary. Ada banyak metode deep learning ASR yang dapat digunakan untuk memprediksi kata ( transcribing ), contohnya adalah Wav2vec 2.0 dan OpenAI-Whisper . Hasil dari metode Wav2vec 2.0 menunjukkan nilai Character Error Rate (CER) dalam memprediksi ayat suci Al-Qur'an dari jarak 0.226 (23%) ~ 0.677 (68%). Hasil dari metode OpenAI-Whisper menunjukkan performa yang lebih bagus daripada Wav2vec 2.0 dengan nilai Character Error Rate (CER) dari rentang 0.064 (6%) ~ 0.172 (17%). Hasil dari kedua metode yang telah diusulkan mengimplikasikan bahwa nilai error yang rendah menjadi metode yang terbaik dengan kesalahan yang minimal.

Keywords


Automatic Speech Recognition; Wav2vec 2.0; OpenAI-Whisper; Al-Qur'an

References


A. J. Muhammad Yasir, Studi Al-Quran, vol. 53, no. 9. 2016.

I. Sri Maharani, “Pembelajaran Baca Tulis Al- Qur ’ an Anak Usia Dini,” vol. 4, no. 2, pp. 1288–1298, 2020.

D. I. Fitriani and F. Hayati, “Penerapan Metode Tahsin untuk Meningkatkan Kemampuan Membaca Al-Qur’an Siswa Sekolah Menengah Atas,” J. Pendidik. Islam Indones., vol. 5, no. 1, pp. 15–31, 2020, doi: 10.35316/jpii.v4i1.227.

R. Gretter et al., “ETLT 2021: Shared task on automatic speech recognition for non-native children’s speech,” Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 3, pp. 1923–1927, 2021, doi: 10.21437/Interspeech.2021-1237.

S. Chen et al., “Continuous speech separation with conformer,” ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. - Proc., vol. 2021–June, pp. 5749–5753, 2021, doi: 10.1109/ICASSP39728.2021.9413423.

N. Kanda et al., “Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings,” Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 2022–September, pp. 521–525, 2022, doi: 10.21437/Interspeech.2022-253.

R. De Mori, “Recent advances in automatic speech recognition,” Signal Processing, vol. 1, no. 2, pp. 95–123, 1979, doi: 10.1016/0165-1684(79)90013-6.

S. Feng, O. Kudina, B. M. Halpern, and O. Scharenborg, “Quantifying Bias in Automatic Speech Recognition,” 2021, [Online]. Available: http://arxiv.org/abs/2103.15122

T. Novela, Martin; Basaruddin, “Dataset Suara Dan Teks Berbahasa Indonesia Pada Rekaman,” vol. 11, no. 2, pp. 61–66, 2021.

O. Iosifova, I. Iosifov, V. Sokolov, O. Romanovskyi, and I. Sukaylo, “Analysis of automatic speech recognition methods,” CEUR Workshop Proc., vol. 2923, pp. 252–257, 2021.

A. Al Harere and K. Al Jallad, “Quran Recitation Recognition using End-to-End Deep Learning,” pp. 1–22, 2023, [Online]. Available: https://arxiv.org/abs/2305.07034v1

L. R. S. Gris, R. Marcacini, A. C. Junior, E. Casanova, A. Soares, and S. M. Aluísio, “Evaluating OpenAI’s Whisper ASR for Punctuation Prediction and Topic Modeling of life histories of the Museum of the Person,” 2023, [Online]. Available: http://arxiv.org/abs/2305.14580

S. Wang, C.-H. H. Yang, J. Wu, and C. Zhang, “Can Whisper perform speech-based in-context learning,” 2023, [Online]. Available: https://arxiv.org/abs/2309.07081v1

L. Pepino, P. Riera, and L. Ferrer, “Emotion recognition from speech using wav2vec 2.0 embeddings,” Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 1, pp. 551–555, 2021, doi: 10.21437/Interspeech.2021-703.

S. Schneider, A. Baevski, R. Collobert, and M. Auli, “WAV2vec: Unsupervised pre-training for speech recognition,” Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 2019–September, pp. 3465–3469, 2019, doi: 10.21437/Interspeech.2019-1873.

A. Baevski, S. Schneider, and M. Auli, “Vq-Wav2Vec: Self-Supervised Learning of Discrete Speech Representations,” 8th Int. Conf. Learn. Represent. ICLR 2020, pp. 1–12, 2020.

S. Siriwardhana, A. Reis, R. Weerasekera, and S. Nanayakkara, “Jointly fine-tuning ‘BERT-like’ self supervised models to improve multimodal speech emotion recognition,” Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 2020–October, pp. 3755–3759, 2020, doi: 10.21437/Interspeech.2020-1212.

M. MacAry, M. Tahon, Y. Esteve, and A. Rousseau, “On the Use of Self-Supervised Pre-Trained Acoustic and Linguistic Features for Continuous Speech Emotion Recognition,” 2021 IEEE Spok. Lang. Technol. Work. SLT 2021 - Proc., pp. 373–380, 2021, doi: 10.1109/SLT48900.2021.9383456.

J. Boigne, B. Liyanage, and T. Östrem, “Recognizing More Emotions with Less Data Using Self-supervised Transfer Learning,” 2020, [Online]. Available: http://arxiv.org/abs/2011.05585

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust Speech Recognition via Large-Scale Weak Supervision,” 2022, [Online]. Available: http://arxiv.org/abs/2212.04356

H. Heriyanto, H. Jayadianti, and J. Juwairiah, “The Implementation Of Mfcc Feature Extraction And Selection of Cepstral Coefficient for Qur’an Recitation in TPA (Qur’an Learning Center) Nurul Huda Plus Purbayan,” RSF Conf. Ser. Eng. Technol., vol. 1, no. 1, pp. 453–478, 2021, doi: 10.31098/cset.v1i1.417.

A. Khumaidi and R. L. Pradana, “Identifikasi Penyebab Cacat Pada Hasil Pengelasan Dengan Image Processing Menggunakan Metode Yolo,” J. Tek. Elektro dan Komput. TRIAC, vol. 9, no. 3, pp. 107–112, 2022, [Online]. Available: https://journal.trunojoyo.ac.id/triac/article/view/15997

L. Ou, X. Gu, and Y. Wang, “Transfer Learning of wav2vec 2.0 for Automatic Lyric Transcription,” 2022, [Online]. Available: http://arxiv.org/abs/2207.09747

A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Adv. Neural Inf. Process. Syst., vol. 2020–December, pp. 1–19, 2020.

Q. Xu et al., “Self-training and pre-training are complementary for speech recognition,” ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. - Proc., vol. 2021–June, pp. 3030–3034, 2021, doi: 10.1109/ICASSP39728.2021.9414641.

A. Vaswani et al., “Attention is all you need,” Adv. Neural Inf. Process. Syst., vol. 2017–December, no. Nips, pp. 5999–6009, 2017.




DOI: https://doi.org/10.21107/triac.v11i1.24332

Refbacks

  • There are currently no refbacks.


Copyright (c) 2024 Jurnal Teknik Elektro dan Komputer TRIAC

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Creative Commons License
Indexed by: