Perbandingan Metode Term Weighting terhadap Hasil Klasifikasi Teks pada Dataset Terjemahan Kitab Hadis

Ana Tsalitsatun Ni'mah, Agus Zainal Arifin

Abstract

Hadis adalah sumber rujukan agama Islam kedua setelah Al-Qur’an. Teks Hadis saat ini diteliti dalam bidang teknologi untuk dapat ditangkap nilai-nilai yang terkandung di dalamnya secara pegetahuan teknologi. Dengan adanya penelitian terhadap Kitab Hadis, pengambilan informasi dari Hadis tentunya membutuhkan representasi teks ke dalam vektor untuk mengoptimalkan klasifikasi otomatis. Klasifikasi Hadis diperlukan untuk dapat mengelompokkan isi Hadis menjadi beberapa kategori. Ada beberapa kategori dalam Kitab Hadis tertentu yang sama dengan Kitab Hadis lainnya. Ini menunjukkan bahwa ada beberapa dokumen Kitab Hadis tertentu yang memiliki topik yang sama dengan Kitab Hadis lain. Oleh karena itu, diperlukan metode term weighting yang dapat memilih kata mana yang harus memiliki bobot tinggi atau rendah dalam ruang Kitab Hadis untuk optimalisasi hasil klasifikasi dalam Kitab-kitab Hadis. Penelitian ini mengusulkan sebuah perbandingan beberapa metode term weighting, yaitu: Term Frequency Inverse Document Frequency (TF-IDF), Term Frequency Inverse Document Frequency Inverse Class Frequency (TF-IDF-ICF), Term Frequency Inverse Document Frequency Inverse Class Space Density Frequency (TF-IDF-ICSδF), dan Term Frequency Inverse Document Frequency Inverse Class Space Density Frequency Inverse Hadith Space Density Frequency (TF-IDF-ICSδF-IHSδF). Penelitian ini melakukan perbandingan hasil term weighting terhadap dataset Terjemahan 9 Kitab Hadis yang diterapkan pada mesin klasifikasi Naive Bayes dan SVM. 9 Kitab Hadis yang digunakan, yaitu: Sahih Bukhari, Sahih Muslim, Abu Dawud, at-Turmudzi, an-Nasa'i, Ibnu Majah, Ahmad, Malik, dan Darimi. Hasil uji coba menunjukkan bahwa hasil klasifikasi menggunakan metode term weighting TF-IDF-ICSδF-IHSδF mengungguli term weighting lainnya, yaitu mendapatkan Precission sebesar 90%, Recall sebesar 93%, F1-Score sebesar 92%, dan Accuracy sebesar 83%.


Comparison of a term weighting method for the text classification in Indonesian hadith

Hadith is the second source of reference for Islam after the Qur’an. Currently, hadith text is researched in the field of technology for capturing the values of technology knowledge. With the research of the Book of Hadith, retrieval of information from the hadith certainly requires the representation of text into vectors to optimize automatic classification. The classification of the hadith is needed to be able to group the contents of the hadith into several categories. There are several categories in certain Hadiths that are the same as other Hadiths. Shows that there are certain documents of the hadith that have the same topic as other Hadiths. Therefore, a term weighting method is needed that can choose which words should have high or low weights in the Hadith Book space to optimize the classification results in the Hadith Books. This study proposes a comparison of several term weighting methods, namely: Term Frequency Inverse Document Frequency (TF-IDF), Term Frequency Inverse Document Frequency Inverse Class Frequency (TF-IDF-ICF), Term Frequency Inverse Document Frequency Inverse Class Space Density Frequency (TF-IDF-ICSδF) and Term Frequency Inverse Document Frequency Inverse Class Space Density Frequency Inverse Hadith Space Density Frequency (TF-IDF-ICSδF-IHSδF). This research compares the term weighting results to the 9 Hadith Book Translation dataset applied to the Naive Bayes classification engine and SVM. 9 Books of Hadith are used, namely: Sahih Bukhari, Sahih Muslim, Abu Dawud, at-Turmudzi, an-Nasa’i, Ibn Majah, Ahmad, Malik, and Darimi. The trial results show that the classification results using the TF-IDF-ICSδF-IHSδF term weighting method outperformed another term weighting, namely getting a Precession of 90%, Recall of 93%, F1-Score of 92%, and Accuracy of 83%.

Keywords

Term Weighting; Classification; Hadith; TF-IDF; TF-IDF-ICF; TF-IDF-ICSδF; TF-IDF-ICSδF-IHSδF.

References

Adriani, M., Nazief, B., Asian, J., Tahaghoghi, S. M. M., and Williams, H. E. (2007). Stemming Indonesian. ACM J. Educ. Resour. Comput. 6, 4, 38(December), 307–314. https://doi.org/10.1145/1316457.1316459

Andita Dwiyoga Tahitoe, D. P. (2010). Implementasi Modifikasi Enhanced Confix Stripping Stemmer Untuk Bahasa Indonesia Dengan Metode Corpus Based Stemming. Jurnal Ilmiah, 1–15.

Arifin, Z. (2013). Studi Kitab Hadis. In Srudi. https://doi.org/10.1364/OL.36.000127

Azalia, F. Y., Bijaksana, M. A., & Huda, A. F. (2019). Name indexing in Indonesian translation of hadith using named entity recognition with naïve Bayes classifier. Procedia Computer Science, 157, 142–149. https://doi.org/10.1016/j.procs.2019.08.151

Azmi, A. M., Al-Qabbany, A. O., & Hussain, A. (2019). Computational and natural language processing based studies of hadith literature: a survey. Artificial Intelligence Review, 52(2), 1369–1414. https://doi.org/10.1007/s10462-019-09692-w

Deposit, C., Shi, L. I., & Jianping, C. (2018). Prospecting Information Extraction by Text Mining Based on Convolutional Neural Networks — A Case Study of the Lala. IEEE Access, 6, 52286–52297. https://doi.org/10.1109/ACCESS.2018.2870203

Dogan, T., & Uysal, A. K. (2019). Improved inverse gravity moment term weighting for text classification. Expert Systems with Applications, 130, 45–59. https://doi.org/10.1016/j.eswa.2019.04.015

Kim, S. W., & Gil, J. M. (2019). Research paper classification systems based on TF ‑ IDF and LDA schemes. Human-Centric Computing and Information Sciences. https://doi.org/10.1186/s13673-019-0192-7

Mahendra. (2007). Enhanced Confix Stripping Stemmer And Ants Algorithm For Classifying News Document in Representation of Textual. Technology, (April), 149–158.

Ren, F., & Sohrab, M. G. (2013). Class-indexing-based term weighting for automatic text classification. 236, 109–125.

Rostam, N. A. P., & Malim, N. H. A. H. (2019). Text categorisation in Quran and Hadith: Overcoming the interrelation challenges using machine learning and term weighting. Journal of King Saud University - Computer and Information Sciences, (xxxx). https://doi.org/10.1016/j.jksuci.2019.03.007

Sabbah, T., Selamat, A., Selamat, M. H., Al-Anzi, F. S., Viedma, E. H., Krejcar, O., & Fujita, H. (2017). Modified frequency-based term weighting schemes for text classification. Applied Soft Computing Journal, 58, 193–206. https://doi.org/10.1016/j.asoc.2017.04.069

Saloot, M. A., Idris, N., Mahmud, R., Ja’afar, S., Thorleuchter, D., & Gani, A. (2016). Hadith data mining and classification: a comparative analysis. Artificial Intelligence Review, 46(1), 113–128. https://doi.org/10.1007/s10462-016-9458-x

Tala, F. Z. (2003). A Study of Stemming Effects on Information Retrieval in Bahasa Indonesia. M.Sc. Thesis, Appendix D, pp, 39–46.

Uysal, A. K., & Gunal, S. (2014). The impact of preprocessing on text classification. Information Processing and Management, 50(1), 104–112. https://doi.org/10.1016/j.ipm.2013.08.006

Yang, K., Cai, Y., Leung, H. Fung, Lau, R. Y. K., & Li, Q. (2019). ITWF: A framework to apply term weighting schemes in topic model. Neurocomputing, 350, 248–260. https://doi.org/10.1016/j.neucom.2019.02.048

Yusup, F. A., Bijaksana, M. A., & Huda, A. F. (2019). Narrator’s name recognition with support vector machine for indexing Indonesian hadith translations. Procedia Computer Science, 157, 191–198. https://doi.org/10.1016/j.procs.2019.08.157

DOI

https://doi.org/10.21107/rekayasa.v13i2.6412

Metrics

Refbacks

  • There are currently no refbacks.


Copyright (c) 2020 Ana Tsalitsatun Ni'mah, Agus Zainal Arifin

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.