文章基本信息

标题：Improving IF-THEN rules with genetic algorithm.
作者：Muntean, Maria ; Ileana, Ioan ; Joldes, Remus 等
期刊名称：Annals of DAAAM & Proceedings
印刷版ISSN：1726-9679
出版年度：2010
期号：January
语种：English
出版社：DAAAM International Vienna
摘要：Classification, one of the primary tasks in data mining, consists in assigning objects to one of several predefined categories [Nadal et al., 1990]. Many approaches have been proposed aiming at the construction of an accurate classifier. Among them, the following approaches are well-known in the literature: decision-tree, rule-induction, association-based, and hybrid ones. AQ [Michalski, 1969] and CN2 [Clark & Niblett, 1989] are two methods based on rule-induction approach where the methods induce one rule at a time and all the data examples covered by that rule are removed from the search space. The more recent rule induction using a sequential covering algorithm, RIPPER [Cohen, 1995], [Huang et al., 2007], grows rules by adding a test of an attribute to that rule if using that attribute will result in a more accurate separation of the training data. Since the IF-THEN rules can be easily represented by bit strings of fixed length, we applied the Genetic Algorithms (GAs) in rule induction [Fidelis et al., 2000], [Wan & Zhao, 2009], in order to increase the classifier's performance.
关键词：Classification;Classification (Library Science);Data processing;Electronic data processing;Genetic algorithms

Improving IF-THEN rules with genetic algorithm.

Muntean, Maria ; Ileana, Ioan ; Joldes, Remus 等

1. INTRODUCTION

Classification, one of the primary tasks in data mining, consists in assigning objects to one of several predefined categories [Nadal et al., 1990]. Many approaches have been proposed aiming at the construction of an accurate classifier. Among them, the following approaches are well-known in the literature: decision-tree, rule-induction, association-based, and hybrid ones. AQ [Michalski, 1969] and CN2 [Clark & Niblett, 1989] are two methods based on rule-induction approach where the methods induce one rule at a time and all the data examples covered by that rule are removed from the search space. The more recent rule induction using a sequential covering algorithm, RIPPER [Cohen, 1995], [Huang et al., 2007], grows rules by adding a test of an attribute to that rule if using that attribute will result in a more accurate separation of the training data. Since the IF-THEN rules can be easily represented by bit strings of fixed length, we applied the Genetic Algorithms (GAs) in rule induction [Fidelis et al., 2000], [Wan & Zhao, 2009], in order to increase the classifier's performance.

2. COST-SENSITIVE CLASSIFICATION

The percentage of instances on a given test set instances that are correctly classified by the classifier represents the accuracy of a classifier. A very useful tool, used for analyzing how well a classifier can recognize instances of different classes, is the confusion matrix. A confusion matrix for two classes is shown in Table 1.

The true positives (TP) and true negatives (TN) are correct classifications. A false positive (FP) occurs when the outcome is incorrectly predicted as 1 (or positive) when it is actually 0 (negative). A false negative (FN) occurs when the outcome is incorrectly predicted as negative when it is actually positive.

In order to access how well the model can classify the instances, defined two new measures, sensitivity and specificity:

sensitivity = TP / TP + FN (1)

specificity = TN / TN + FP (2)

3. DESCRIPTION OF THE GA USED

Genetic algorithms have been used for classification as well as other optimization problems. In data mining, they may be used to evaluate the fitness of other algorithms. In the context of data mining, the population represents the solution search space, where a solution is an IF-THEN rule. A chromosome is an IF-THEN rule and a gene represents a sub-condition in the IF-THEN rule. Based on the notion of survival of the fittest, a new population is formed to consist of the fittest rules in the current population, as well as offspring of these rules. Typically, the fitness of a rule is assessed by its classification accuracy on a set of training samples.

In order to cover a considerable percent of the instances in the test set, we have used a sequential covering algorithm. First, we have created rules in a genetic manner. Second, to compute the fitness measure of the rules, we developed a fitness method. Then, in the fitness method the distribution of the learned rule was computed. For each instance, the real and predicted distribution was calculated. The predicted and real distributions are compared and the instances are classified as TP and FN. Then, the sensitivity and specificity were determined using the true and false positives and negatives measures calculated beforehand. The fitness value was computed by simply multiplying the sensitivity and specificity.

This step represented a preprocessing stage before the rule was learned and its fitness evaluated. Thus, we made a correspondence between the genetic data types and the instances' data types.

In the genetic component, the instances were parsed to chromosomes, which were ranked, recombined and evaluated. The classifier component provided the means of evaluation and test, and measured how good the classifier performed. Our gene and chromosome had a special representation (Fig. 1 and Fig. 2). It was divided in 3 fields: weight, relational operator and value. The weight was used in order to determine whether the corresponding gene was part of the rule. The genes that had below a certain threshold were not considered anymore as part of the chromosome. The relational operator's task was to verify if the weight was below that threshold. The value represented the value of the gene.

[FIGURE 1 OMITTED]

[FIGURE 2 OMITTED]

4. EXPERIMENTAL RESULTS

For the evaluation, we used 4 data sets: Australian, Bupa, Cleveland and Pima, obtained from the online UCI Machine Learning Repository. The majority of these data sets have accuracy levels which are not very high, so improvement is possible.

Each test included 10 runs, with an 80/20 percentage split and with a seed from 1 to 10 for each run. For the tests, the genetic parameters were tuned and different behaviors for different parameters were observed. The tests were validated against existing rule learners: JRip and PART, Weka implementation of RIPPER algorithm. Parameters used for default test were: population size (50 individuals), parent selection technique (tournament technique), crossover (in 2 points, random) and mutation (0.3 mutation rate).

A. The first test set

The first of the test series was tested on all the four data sets by varying the seed of the randomization process (from 1 to 10) and keeping the other parameters constant.

The table above (Table 2) shows that the best results were attained for the Bupa dataset. Compared to the JRip algorithm, the genetic classifier had a better median accuracy and also a better standard deviation. Although, the average accuracy for the PART algorithm was better on the Cleveland data set than in the case of the new classifier.

B. The second test set

In this test, the number of randomly generated crossover points was changed from to 2 to 3 and for each of the two cases the seed was again changed. All other parameters are kept constant. The average accuracies were obtained for the Australian and Cleveland data sets (Table 3) when the number of crossover points is set to 3 and for the Bupa and Pima data set when the number of crossover points was equal to 2. The average accuracy of the Bupa data set seems to be careless to the parametric changes. Comparing the values obtained for this data set with the values in the first test set, it can be observed that Australian and Cleveland perform 1% better. The deviations calculated in the table are rather high, except for the Pima data set, which obtains 4.3 %, when the number of crossover points is 3.

C. The third test set

In the last test set, we varied the number of evaluation cycles starting from 100 to 1000 and ending up to see the results of the classifier in the case when we have 10000 evaluation cycles. Again, the seed for each setting of the number of evaluation cycles was varied (Table 4).

5. CONCLUSIONS

The experiments proved that learning classification rules using genetic algorithms can have similar accuracy measures in comparison with the results obtained from the JRip and PART algorithms (Table 5).

The Bupa data set behaved well throughout all the test sets. Its accuracy remained constant when evaluated within all the three experiments.

The experiments performed tried to obtain a confirmation of the fact that by combining a genetic component with a heuristic component, high accuracy levels can be obtained in comparison to other classification algorithms.

6. REFERENCES

Clark, P. & Niblett, T. (1989). The CN2 induction algorithm, Machine Learning, Vol.3(4), pp.261-283, ISSN 0885-6125

Cohen, W. W. (1995). Fast Effective Rule Induction, Machine Learning Proceedings of the 12th International Conference, pp.115-123, Tahoe City, Morgan Kaufmann.

Fidelis, M. V., Lopes, H. S. & Freitas A. A. (2000). Discovering Comprehensible Classification Rules with a Genetic Algorithm, Proceedings of the 2000 Congress on Evolutionary Comp., Vol.1, pp.805-810, La Jolla, USA

Huang, J., Cai, Y. & Xu, X. (2007). A hybrid genetic algorithm for feature selection wrapper based on mutual information, Pattern Recognition Letters, Vol. 28(13), pp.1825-1844

Michalski, R. S. (1969). On the quasi-minimal solution of the general covering problem, Proceedings of the V International Symposium on Information Processing, Switching Circuits, Vol.[[DELTA].sub.3], Yugoslavia, Bled, pp.125-128

Nadal, C.; Legault, R. & Suen, C. Y. (1990). Complementary algorithms for the recognition of totally uncontrained handwritten numerals, Proceedings of the 10th Int. Conf. on Pattern Recognition, Vol.A, pp.434^49, Atlantic City

Wan, L. & Zhao, C. (2009). Rule Acquisition with an Entropybased Hybrid Genetic Algorithm, 2009 International Conference on Networking and Digital Society, pp.275-278

Tab. 1. Confusion matrix for a two-class problem

 Predicted Class

 Class = 1 Class = 0

Actual Class Class = 1 TP FN
 Class = 0 FP TN

Tab. 2. The results given by the first test set

Algorithm Genetic Classifier (%)

Data set aed acc std dev best acc

Australian 74,49 8,01 86,95
Bupa 61,73 5,16 68,11
Cleveland 49,67 5,01 60,65
Pima 69,35 4,41 75,97

Tab. 3. The values for the second test set

No of crossover points 2 3

Data set avg acc std dev avg acc std dev

Australian 74.78 9.94 52.62 9.81
Bupa 61.01 6.67 60 7.91
Cleveland 48.36 13.16 75.21 5.21
Pima 70.06 6.99 68.89 4.31

Tab. 4. The values resulted from the third experiment

No of cycles 100 1000

Data set avg acc std dev avg acc std dev

Australian 72.39 6.53 74.13 6.57
Bupa 61.73 9.49 60.28 8.17
Cleveland 51.96 7.16 50 7.73
Pima 69.87 8.22 72.07 5.62

No of cycles 10000

Data set avg acc std dev

Australian 72.31 7.47
Bupa 60.43 7.03
Cleveland 49.67 12.62
Pima 69.09 5.59

Tab. 5. Comparative study between JRip, PART and the
Genetic Classifier

 Bupa (%) Australian (%) Cleveland (%) Pima (%)

JRip 59.42 75.52 53.3 74.53
PART 60.72 79.53 52.45 75.04
Genetic 64.05 79.03 54.59 73.96
 Classifier