PEMBOBOTAN DINAMIS BERBASIS INFORMATION GAIN PADA TEMU KEMBALI INFORMASI

Hasan Dwi Cahyono

Abstract


ABSTRAK

Meningkatnya konten multimedia dan teks seiring berkembangnya internet mengakibatkan Temu KembaliInformasi(TKI) menjadi topik yang menarik dikembangkan. Tingkat heterogensi informasi yang tinggi serta distorsi informasi tekstual menjadi tantangan yang menarik untuk dipecahkan. TKIberbasis pencarian tekstual berbasis menggunakan weighted tree simlarity (W-Tree) terbukti dapat mengatasi perbedaan heterogensi informasi tekstual dengan memecah informasi kedalam cabang-cabang informasi. Namun, menentukan bobot setiap cabang dari tree menjadi sebuah kendala dimana setiap cabang belum tentu memberikan kontribusi informasi yang tepat. Hal ini dikarenakan cabang cabang informasi tekstual tersebut justru memberikan distorsi, atau bahkan memberikan noise terhadap cabang-cabang tree lainnya.Oleh karena itu, dalam penelitian ini diusulkan metode dengan pembobotan tree dinamis menggunakan Information Gain (IG) dan cosine similarity. Pada tahap pertama, dilakukan proses pembentukan W-Tree dari database dengan W-Tree dari queryuserserta dilakukan pencocokan dengan cosine similarity dimana IG digunakan untuk memilah dan mengatur bobot informasi yang akan digunakan oleh W-Tree. Sistem akan menampilkan keluaran berupa daftar dokumen beserta nilai kemiripannya. Dari percobaan pada dataset ImageCLEF 2011sebanyak 9516 dokumen, pencarian tekstual berbasis cosine similarity dan W-Tree dengan pembobotan dinamis berbasis IG mampu meningkatkan f-measure 73% dibanding pencarian tekstual tanpa mempertimbangkan nilai IG.

Kata kunci: Temu Kembali Teks, Cosine similarity, W-tree, information gain.

 

ABSTRACT

Rapid increasing of multimedia content over a massive development of internet makes information retrieval (IR) become an interesting topic to be investigated. The level of data heterogeneity and the distortion of textual information remain widely open to solve. IR with a textual search using Weighted Tree Similarity (W-Tree) proved able to overcome differences of textual information heterogeneity by breaking down such information into branches of information. However, determining the weight of each branch becomes an obstacle since they do not always give a proper contribution to the right information; meanwhile in a particular condition, some branches of textual information give distortion, or even provide noise to the other branches. Stated thus, the method of dynamic w-tree using Information Gain (IG) is proposed. For the first level, to form a process of W-Tree based on user’s queries with W-Tree database and to use cosine similarity to conduct a document matching while the second level is employing IG to sort and arrange the weight information to be utilized by the W-Tree. The system will display a list of papers and their output value of similarity. From the experiments on as many as 9516 Image CLEF 2011 datasets, textual search based cosine similarity, and W-Tree with a dynamic weighting based IG are able to increase the f-measure of 73% compared to textual without considering their IG values.

Keywords: Information retrieval, Cosine Similarity, W-tree, information gain


References


R. Sarno and F. Rahutomo, “Penerapan Algoritma Weighted Tree Similarity,” J. Teknol. Inf., no. January, pp. 39–46, 2008.

C. E. Shannon, “A Mathematical Theory of Communication,” Bell Syst. Tech. J., vol. 27, pp. 379–423, 1948.

M. T. Martin-Valdivia and M. C. . Diaz-Galiano, “Using information gain to improve multi-modal information retrieval systems,” vol. 44, pp. 1146–1158, 2008.

G. Sidorov, A. Gelbukh, H. Gómez-Adorno, and D. Pinto, “Soft similarity and soft cosine measure: Similarity of features in vector space model,” Comput. y Sist., vol. 18, no. 3, pp. 491–504, 2014.

S. Niwattanakul, J. Singthongchai, E. Naenudorn, and S. Wanapu, “Using of Jaccard Coefficient for Keywords Similarity,” Int. MultiConference Eng. Comput. Sci., vol. I, pp. 380–384, 2013.

V. Bhavsar, H. Boley, and L. Yang, “A weighted-tree similarity algorithm for multi-agent systems in e-business environments,” Comput. Intell., vol. 20, no. 4, pp. 584–602, 2004.

D. M. W. Powers, “Evaluation: From Precision, Recall and F-Factor to ROC, Informedness, Markedness & Correlation,” no. December, p. 24, 2007.




DOI: https://doi.org/10.21107/simantec.v5i2.1630

Refbacks

  • There are currently no refbacks.


Copyright (c) 1970 Hasan Dwi Cahyono

Indexed By