Clusterisasi Dokumen Web (Berita) Bahasa Indonesia Menggunakan Algoritma K-Means

Husni i; Yudha Dwi Putra Negara; M. Syarief

doi:10.21107/simantec.v4i3.1383

Clusterisasi Dokumen Web (Berita) Bahasa Indonesia Menggunakan Algoritma K-Means

Husni i, Yudha Dwi Putra Negara, M. Syarief

Abstract

Abstrak

Informasi yang tersedia pada halaman-halaman web trunojoyo.ac.id semakin besar, belum tertata dengan baik, belum terstruktur atau terkategori mengikuti kaidah tertentu dan tersebar pada banyak sub-domain. Sejauh ini, tidak ada gerbangatau portal web yang menyediakan akses ke berbagai situs webyang dihosting oleh data center PTIK Universitas Trunojoyo. Salah satu masalah yang telah diselesaikan adalah pengelompokan informasi atau berita web tersebut secara otomatis menggunakan algoritma clustering K-Means. Search engine RISE yang telah berjalan menghimpun semua halaman web yang ditulis dalam bahasa Indonesia di bawah domain trunojoyo.ac.id menggunakan teknik crawling. Halaman-halaman tersebut kemudian dipre-processing menggunakan teknik standar dalam text minig (informationm retrieval). Proses utamanya adalah penerapan teknik k-menas sehingga terbentuk kelompok berita otonom. Pengujian yang telah dilakukan menunjukkan bahwa teknik clustering yang diterapkan mampu bekerja dengan baik dan memberikan akurasi yang memuaskan. Ada sekitar 300 halaman web yang dilibatkan dalam proses clustering dimana diperoleh ukuran rata-rata F-Measure sebesar 0.6129192 dan Purity bernilai 0.67294195. Faktor yang cukup berpengaruh dalam clustering dan klasifikasi teks bahasa Indonesia adalah fase pre-processing, terutama pada pendekatan stemming. Perbaikan terhadap teknik stemming diyakini akan meningkatkan akurasi pengelompokan dokumen.

Kata Kunci : Clustering, K-Means, F-Measure, Purity

Abstract

The information available on the web pages trunojoyo.ac.id getting bigger, not well ordered, yet structured or terkategori follow certain rules and scattered in many subdomains. So far, no gerbangatau web portal that provides access to a variety of sites hosted by the data center webyang PTIK Trunojoyo University. One problem that has been solved is the grouping of information or news Web site automatically using the KMeans clustering algorithm. RISE search engines that have been running together all the web pages are written in Indonesian under trunojoyo.ac.id domain using crawling techniques. The pages are then dipre-processing using standard techniques in text Minig (informationm retrieval). The main process is the application of K-menas technique to form groups of autonomous news. Tests have shown that the clustering technique applied
is able to work well and give satisfactory accuracy. There are about 300 web pages that are involved in the process of clustering which gained an average size of F-Measure 0.67294195 and Purity 0.6129192. Factors influential in clustering and classification Indonesian text is pre-processing phase, especially on the stemming approach. Repairs to stemming technique is believed to improve the accuracy of the document grouping.

Keywords : Clustering, K-Means, F-Measure, Purity

Full Text:

PDF (Bahasa Indonesia)

References

Husni.2011.Web Portal + Search Engine trunojoyo.ac.id.

http://komputasi.wordpress.com/2011/01/03/true-se-web-portal-searchengine-trunojoyo-ac-id/> . Diakses 3 Januari 2011

Wibisono,Y. 2005. “Clustering Berita Berbahasa Indonesia”.KK Informatika Sekolah Teknik Elektro dan Informatika I T B. . Diakses 01 Januari 2011.

Jain, M. N. Murty, and P. J. Flynn. 1999. Data clustering: a review.ACM Computing Surveys, 31(3):264–323. URL http://www.csc.kth.se/~rosell/under

visning/sprakt/irintro090824.pdf.> Diakses 31 Januari 2011.

Manning, P. Raghavan, and H. Schütze. 2008. Introduction to

Information Retrieval. Cambridge University Press. ISBN 978-

/irintro090824.pdf.>Diakses 31 Januari 2011

Hand, H. Mannila, and P. Smyth. 2001. Principles of data mining. MIT Press, Cambridge, MA, USA. ISBN 0-262-08290-X.

EVERITT, B.S. 1993. Cluster Analysis (3ed). London: Edward Arnold

Zhao and G. Karypis. 2004. Empirical and theoretical comparisons of selected criterion functions for document clustering. Mach. Learn., 55(3):311–331. ISSN 0885-6125.

Dempster, A., Laird, N., and Rubin, D. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1):1–38.

Feldman, R., dan Sanger, J. 2007. The Text Mining Handbook Advanced Approaches in Analyzing Unstructured Data. Cambridge University Press.

Rismawan, T., dan Kusumadewi, S., 1991. “Aplikasi kmeans untuk pengelompokkan mahasiswa Berdasarkan nilai body mass index (bmi) & ukuran kerangka”. UII. 1907-5022

Cios, Krzysztof J. Etc. 2007. Data Mining A Knowledge Discovery Approach. Springer.

Murad, Azmi MA., Martin, Trevor. .2007. “Word Similarity for Document Gouping using Soft Computing”. IJCSNS International Journal of Computer Science and Network Security, Vol.7 No.8, August 2007,

pp. 20- 27

Roy. 2007. Berita. < URL:http://www.beritanet.com/Education/Berita-Jurnalistik/berita.html>.Diakses 15 Maret 2010.

Wikipedia. 2010. KNN < URLhttp://id.wikipedia.org/wiki/KNN>.

Diakses 27 juli 2011.

Rachli. 2007. Email Filtering Menggunakan Naive Bayesian.

URLwww.cert.or.id/~budi/courses/security/2006.../Report-Muhamad-

Rachli.doc>Diakses 27 Juli 2011

DOI: https://doi.org/10.21107/simantec.v4i3.1383