DECISION TREE C4.5 ALGORITHM FOR TUITION AID GRANT PROGRAM CLASSIFICATION (CASE STUDY: DEPARTMENT OF INFORMATION SYSTEM, UNIVERSITAS TEKNOKRAT INDONESIA)

In pandemic era, almost everyone struggles for their life. College students are such example. They have difficulty in paying tuition fee to continue their study. Based on this problematic situation, Universitas Teknokrat Indonesia grants the students who have good academic performance with tuition fee aid program. Many featrues used for determining the grant made it hard to make a decision in a short time or even takes very long time. To make it easier for management to decide who is the right student to get grant, it needs classification model. The purpose of this study is the to classify the grant recipients by using decision tree C4.5 algorithm. The C4.5 algorithm is used because it can directly show the pattern of data in the decision tree form that can determine whether a potential student can be accepted as an awardee or not. The result of the study is classification model with 87 % accuracy, precision and recall for all part. It means the model perform quite well to be implemented into system. Keyword: Data Mining, Classification, Decision Tree, C4.5 Algorithm Jurnal Ilmiah Edutic/Vol.7, No.1, November 2020 p-ISSN 2407-448 e-ISSN2528-7303 4


INTRODUCTION
The high cost of tuition for some students is an obstacle in continuing to study in college especially during Covid-19 even for on-going students. This can result in students frequently postponing their study or even stopping midway. Granting scholarships has several criteria that are used as consideration in making a decision on whether to award scholarships or not. Universitas Teknokrat Indonesia provides tuition aid grant to its students who have good academic performance, so that the students can continue their studies and relieve the burden of parents. The number of criteria selected in determining scholarship award decisions results in management having difficulty in making a decision and the time it takes to be longer (Chermiti, 2019).
Classification is a process for finding a model or function that describes and distinguishes a data class or concept that has the purpose of using that classification result model to make predictions of object classes where the label class is unknown (Nowak, 2017). One of the popular classification techniques used is Decision Tree (Alsagheer, et al, 2017). The number of criteria used for the determination of tuition aid grant makes it difficult to make decisions on awarding scholarships. In addition, it takes a long time. The purpose of this research is to make classification of awardee using Decision Tree using C4.5 Algorithm.
Classification results are evaluated and validated with Confusion Matrix and Ten-fold Cross Validation to determine the accuracy, precision, and recall of Decision Tree in making scholarship classifications (Nowak, 2017).

Decision Tree
Decision tree is a classification method that uses a tree structure, where each node represents the attribute and its branch represents the value of the attribute, while the leaf is used to represent the class. The top node of this decision tree is called root. Rismayanti (2018) states that this method is a very popular method to use because the results of the model formed are easy to understand. It is named after the decision tree because the rules formed are similar to the shape of the tree. Trees are formed from binary recursive sorting processes in data groups, so the value of response variables in each data group makes sorting results more homogeneous (Sharma and Kumar, 2016). The concept of the decision tree is to turn the data into a decision tree and the rules of the decision. The main benefit of using the decision tree is the ability to simplify complex decision-making processes so that decision makers can interpret solutions to problems (Mesarić and Šebalj, 2016). 2.2 C4.5 Algorithm The C4.5 algorithm is one of the algorithms used to classify or group datasets. The basis of the C4.5 algorithm is decision tree formation (Petropoulos et al, 2018). The branches of the decision tree are a question of classification while for the leaves are classes or groups. Because the purpose of C4.5 algorithm is to perform classification, so the result of processing datasets is in the form of grouping data into certain classes (Jabeur et al, 2020). The C4.5 algorithm is a development of the ID3 algorithm, where the flaws that the ID3 algorithm has can be covered by the C4.5 algorithm (Bekesiene and Hoskova-Mayerova, 2018).
Some things that distinguish the C4.5 algorithm from ID3 are: 1. Robust to noise data 2. Able to handle variables with both discrete and continuous types 3. Able to handle variables that have missing value 4. Can trim branches from the decision tree In general, the C4.5 algorithm for building decision tree model are as follow: 1. Select a variable as the root 2. Create branches for each value 3. Divide cases into branches 4. Repeat the process for each branch until all cases on the branch have the same class.
The first thing to do to form a decision tree is to determine which attributes/variables are at the root of the decision tree. The way to determine which variable is rooted is to use entropy, gain, split info, and gain ratio (Mittal et al, 2017).
Entropy is a parameter for measuring the level of diversity (heterogeneity) of the data set. If the value of the entropy gets greater, then the level of diversity of a data set is greater (Al-Barrak and Al-Razgan, 2016). The formula for calculating entropy is as follows: ………(1) Where: S : case set n : number of classification classes p i : number of sample proportions for class i Gain is a measure of the effectiveness of a variable in classifying data. The gain of a variable is the difference between the total entropy value and the entropy of that variable (Hamoud et al, 2018). Gain can be formulated by: ……… (2) Where: A : Variable |S i | : Number of samples for i |S| : Number of samples for the entire data In the C4.5 algorithm, gain values are used to determine which variables are nodes of a decision tree. A variable that has the highest gain will be a node in the decision tree (Patel and Prajapati, 2018).

RESEARCH METHODS
The Research Method used in the application of the C4.5 algorithm for Tuition Assistance Program Classification, using the research design addressed in Figure 1 follows: Data collection is collecting data that will be used in the C4.5 classification algorithm process.

Data Preprocessing
Data Preprocessing is the process of transforming, merging, or converting data into the appropriate form, in order to be processed with the calculation of the C4.5 algorithm.

Feature Selection
Feature Selection is selecting the data to be used in the C4.5 classification algorithm process. The purpose of data selection is to create a target data set, select a data set, or focus on a subset of variables or sample data, where discovery will be made.

Model Calculation
Calculation of all attributes/variables, entropy uses formula (1) and information gain using formula (2) to find out the highest information gain to be used as a root node in decision tree creation.

Decision Tree
Tree decisions are the result of the process of calculating entropy and information gain, after repeated calculations until all tree attributes have a class and can no longer be done the calculation process.

Rule Model
Rules model is an explanatory description that represents a decision tree.

Validation and Testing
Validation and testing are the test done to know whether all functions work well or not. Validation is done using confusion matrix and Ten-fold Cross Validation done by dividing a data set into ten segments that are equally large by randomize data. Validation and testing are performed to determine the accuracy, precision, and recall of classification prediction results. Accuracy is the percentage of records that are properly classified in dataset testing. Precision is the percentage of data classified as a good model which is actually also good. Recall is a measurement of the actual level of positive recognition (Tsami et al, 2018).

RESULT AND DISCUSSION
The results and stages of the discussion process in the C4.5 classification algorithm are as follows:   After all the Values of Entropy and Information Gain are calculated, then the result of that calculation is entered into table 3. From the calculation of entropy value and information gain above, it can be known that the largest information gain value is GPA with a value of 0,2013 and the smallest is Organization, with a value of 0,01759. Then, grab the largest Information Gain value make it the root node. Remove the previously selected attribute and repeat the calculation of the Entropy value, Information Gain, by selecting the largest Information Gain and being used as the internal node of the tree.
Repeat the calculation until all tree attributes have a class. The table below shows the entire set of calucation result. GPA has the number of information gain so that is will be the root of the decision tree. Then we continue the calculation to find the node. In the following shows the result of calculation.   decisions and rules resulting from this research relate to the following: Students with good GPA will be rejected from getting scholarship. Students with good GPA are considered to get scholarship if they join competition and organization. Students with excellent GPA will be accepted as awardee without considering competition or organization.

Validation and Testing
Testing is done with cross-validation. One type of cross validation is tenfold cross validation. Here are the results of confusion matrix and tenfold cross validation using Python:

Figure 2. Confusion matrix and ten-fold cross validation results
Based on testing using the Ten-Fold Cross Validation method, the accuracy value is 87 %, the precision is 87 % and the recall is 87 %. This indicates that from the classification process carried out will be able to be applied to the recommendation of acceptance of prospective sales partners.

CONCLUSION
The conclusions of the research are as follows: 1. C4.5 classification algorithm will be implemented on the decision of tuition aid grant program at Universitas Teknokrat Indonesia, judging by the accuracy, recall, and precision level of 87 % simultaneously, the calculations carried out will be able to predict and recommend the model well.
2. 8 rules model can be used as a reference in the design and creation of GUI applications.

3.
This algorithm can be used to determine which students will get scholarship.