This notebook illustrates the use of Random Forests in Shogun for classification and regression. We will understand the functioning of Random Forests, discuss about the importance of its various parameters and appreciate the usefulness of this learning method.
Random Forest is an ensemble learning method in which a collection of decision trees are grown during training and the combination of the outputs of all the individual trees are considered during testing or application. The strategy for combination can be varied but generally, in case of classification, the mode of the output classes is used and, in case of regression, the mean of the outputs is used. The randomness in the method, as the method's name suggests, is infused mainly by the random subspace sampling done while training individual trees. While choosing the best split during tree growing, only a small randomly chosen subset of all the features is considered. The subset size is a user-controlled parameter and is usually the square root of the total number of available features. The purpose of the random subset sampling method is to decorrelate the individual trees in the forest, thus making the overall model more generic; i.e. decrease the variance without increasing the bias (see bias-variance trade-off). The purpose of Random Forest, in summary, is to reduce the generalization error of the model as much as possible.
In this section, we will appreciate the importance of training a Random Forest over a single decision tree. In the process, we will also learn how to use Shogun's Random Forest class. For this purpose, we will use the letter recognition dataset. This dataset contains pixel information (16 features) of 20000 samples of the English alphabet. This is a 26-class classification problem where the task is to predict the alphabet given the 16 pixel features. We start by loading the training dataset.
from modshogun import CSVFile,RealFeatures,MulticlassLabels def load_file(feat_file,label_file): feats=RealFeatures(CSVFile(feat_file)) labels=MulticlassLabels(CSVFile(label_file)) return (feats, labels) trainfeat_file='../../../../data/uci/letter/train_fm_letter.dat' trainlab_file='../../../../data/uci/letter/train_label_letter.dat' train_feats,train_labels=load_file(trainfeat_file,trainlab_file)
Next, we decide the parameters of our Random Forest.
from modshogun import RandomForest, MajorityVote from numpy import array def setup_random_forest(num_trees,rand_subset_size,combination_rule,feature_types): rf=RandomForest(rand_subset_size,num_trees) rf.set_combination_rule(combination_rule) rf.set_feature_types(feature_types) return rf comb_rule=MajorityVote() feat_types=array([False]*16) rand_forest=setup_random_forest(10,4,comb_rule,feat_types)
In the above code snippet, we decided to create a forest using 10 trees in which each split in individual trees will be using a randomly chosen subset of 4 features. Note that 4 here is the square root of the total available features (16) and is hence the usually chosen value as mentioned in the introductory paragraph. The strategy for combination chosen is Majority Vote which, as the name suggests, chooses the mode of all the individual tree outputs. The given features are all continuous in nature and hence feature types are all set false (i.e. not nominal). Next, we train our Random Forest and use it to classify letters in our test dataset.
# train forest rand_forest.set_labels(train_labels) rand_forest.train(train_feats) # load test dataset testfeat_file='../../../../data/uci/letter/test_fm_letter.dat' testlab_file='../../../../data/uci/letter/test_label_letter.dat' test_feats,test_labels=load_file(testfeat_file,testlab_file) # apply forest output_rand_forest_train=rand_forest.apply_multiclass(train_feats) output_rand_forest_test=rand_forest.apply_multiclass(test_feats)
We have with us the labels predicted by our Random Forest model. Let us also get the predictions made by a single tree. For this purpose, we train a CART-flavoured decision tree.
from modshogun import CARTree, PT_MULTICLASS def train_cart(train_feats,train_labels,feature_types,problem_type): c=CARTree(feature_types,problem_type,2,False) c.set_labels(train_labels) c.train(train_feats) return c # train CART cart=train_cart(train_feats,train_labels,feat_types,PT_MULTICLASS) # apply CART model output_cart_train=cart.apply_multiclass(train_feats) output_cart_test=cart.apply_multiclass(test_feats)
With both results at our disposal, let us find out which one is better.
from modshogun import MulticlassAccuracy accuracy=MulticlassAccuracy() rf_train_accuracy=accuracy.evaluate(output_rand_forest_train,train_labels)*100 rf_test_accuracy=accuracy.evaluate(output_rand_forest_test,test_labels)*100 cart_train_accuracy=accuracy.evaluate(output_cart_train,train_labels)*100 cart_test_accuracy=accuracy.evaluate(output_cart_test,test_labels)*100 print('Random Forest training accuracy : '+str(round(rf_train_accuracy,3))+'%') print('CART training accuracy : '+str(round(cart_train_accuracy,3))+'%') print print('Random Forest test accuracy : '+str(round(rf_test_accuracy,3))+'%') print('CART test accuracy : '+str(round(cart_test_accuracy,3))+'%')
Random Forest training accuracy : 99.887% CART training accuracy : 100.0% Random Forest test accuracy : 93.44% CART test accuracy : 86.48%
As it is clear from the results above, we see a significant improvement in the predictions. The reason for the improvement is clear when one looks at the training accuracy. The single decision tree was over-fitting on the training dataset and hence was not generic. Random Forest on the other hand appropriately trades off training accuracy for the sake of generalization of the model. Impressed already? Let us now see what happens if we increase the number of trees in our forest.
In the last section, we trained a forest of 10 trees. What happens if we make our forest with 20 trees? Let us try to answer this question in a generic way.
def get_rf_accuracy(num_trees,rand_subset_size): rf=setup_random_forest(num_trees,rand_subset_size,comb_rule,feat_types) rf.set_labels(train_labels) rf.train(train_feats) out_test=rf.apply_multiclass(test_feats) acc=MulticlassAccuracy() return acc.evaluate(out_test,test_labels)
The method above takes the number of trees and subset size as inputs and returns the evaluated accuracy as output. Let us use this method to get the accuracy for different number of trees keeping the subset size constant at 4.
import matplotlib.pyplot as plt % matplotlib inline num_trees4=[5,10,20,50,100] rf_accuracy_4=[round(get_rf_accuracy(i,4)*100,3) for i in num_trees4] print('Random Forest accuracies (as %) :' + str(rf_accuracy_4)) # plot results x4= y4=[86.48] # accuracy for single tree-CART x4.extend(num_trees4) y4.extend(rf_accuracy_4) plt.plot(x4,y4,'--bo') plt.xlabel('Number of trees') plt.ylabel('Multiclass Accuracy (as %)') plt.xlim([0,110]) plt.ylim([85,100]) plt.show()
/usr/local/lib/python2.7/dist-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment. warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.')
Random Forest accuracies (as %) :[90.32, 93.82, 95.1, 95.78, 96.4]
We see from the above plot that the accuracy of the model keeps on increasing as we increase the number of trees on our Random Forest and eventually satarates at some value. Extrapolating the above plot qualitatively, the saturation value will be somewhere around 96.5%. The jump of accuracy from 86.48% for a single tree to 96.5% for a Random Forest with about 100 trees definitely highlights the importance of the Random Forest algorithm.
The inevitable question at this point is whether it is possible to achieve higher accuracy saturation by working with lesser (or greater) random feature subset size. Let us figure this out by repeating the above procedure for random subset size as 2 and 8.
# subset size 2 num_trees2=[10,20,50,100] rf_accuracy_2=[round(get_rf_accuracy(i,2)*100,3) for i in num_trees2] print('Random Forest accuracies (as %) :' + str(rf_accuracy_2))
Random Forest accuracies (as %) :[91.98, 94.44, 95.64, 96.42]
# subset size 8 num_trees8=[5,10,50,100] rf_accuracy_8=[round(get_rf_accuracy(i,8)*100,3) for i in num_trees8] print('Random Forest accuracies (as %) :' + str(rf_accuracy_8))
Random Forest accuracies (as %) :[90.82, 93.76, 95.76, 96.02]
Let us plot all the results together and then comprehend the results.
x2= y2=[86.48] x2.extend(num_trees2) y2.extend(rf_accuracy_2) x8= y8=[86.48] x8.extend(num_trees8) y8.extend(rf_accuracy_8) plt.plot(x2,y2,'--bo',label='Subset Size = 2') plt.plot(x4,y4,'--r^',label='Subset Size = 4') plt.plot(x8,y8,'--gs',label='Subset Size = 8') plt.xlabel('Number of trees') plt.ylabel('Multiclass Accuracy (as %) ') plt.legend(bbox_to_anchor=(0.92,0.4)) plt.xlim([0,110]) plt.ylim([85,100]) plt.show()
As we can see from the above plot, the subset size does not have a major impact on the saturated accuracy obtained in this particular dataset. While this is true in many datasets, this is not a generic observation. In some datasets, the random feature sample size does have a measurable impact on the test accuracy. A simple strategy to find the optimal subset size is to use cross-validation. But with Random Forest model, there is actually no need to perform cross-validation. Let us see how in the next section.
The individual trees in a Random Forest are trained over data vectors randomly chosen with replacement. As a result, some of the data vectors are left out of training by each of the individual trees. These vectors form the out-of-bag (OOB) vectors of the corresponding trees. A data vector can be part of OOB classes of multiple trees. While calculating OOB error, a data vector is applied to only those trees of which it is a part of OOB class and the results are combined. This combined result averaged over similar estimate for all other vectors gives the OOB error. The OOB error is an estimate of the generalization bound of the Random Forest model. Let us see how to compute this OOB estimate in Shogun.
rf=setup_random_forest(100,2,comb_rule,feat_types) rf.set_labels(train_labels) rf.train(train_feats) # set evaluation strategy eval=MulticlassAccuracy() oobe=rf.get_oob_error(eval) print('OOB accuracy : '+str(round(oobe*100,3))+'%')
OOB accuracy : 95.193%
The above OOB accuracy calculated is found to be slighly less than the test error evaluated in the previous section (see plot for num_trees=100 and rand_subset_size=2). This is because of the fact that the OOB estimate depicts the expected error for any generalized set of data vectors. It is only natural that for some set of vectors, the actual accuracy is slightly greater than the OOB estimate while in some cases the accuracy observed in a bit lower.
Let us now apply the Random Forest model to the wine dataset. This dataset is different from the previous one in the sense that this dataset is small and has no separate test dataset. Hence OOB (or equivalently cross-validation) is the only viable strategy available here. Let us read the dataset first.
trainfeat_file='../../../../data/uci/wine/fm_wine.dat' trainlab_file='../../../../data/uci/wine/label_wine.dat' train_feats,train_labels=load_file(trainfeat_file,trainlab_file)
Next let us find out the appropriate feature subset size. For this we will make use of OOB error.
import matplotlib.pyplot as plt def get_oob_errors_wine(num_trees,rand_subset_size): feat_types=array([False]*13) rf=setup_random_forest(num_trees,rand_subset_size,MajorityVote(),feat_types) rf.set_labels(train_labels) rf.train(train_feats) eval=MulticlassAccuracy() return rf.get_oob_error(eval) size=[1,2,4,6,8,10,13] oobe=[round(get_oob_errors_wine(400,i)*100,3) for i in size] print('Out-of-box Accuracies (as %) : '+str(oobe)) plt.plot(size,oobe,'--bo') plt.xlim([0,14]) plt.xlabel('Random subset size') plt.ylabel('Multiclass accuracy') plt.show()
Out-of-box Accuracies (as %) : [98.876, 98.876, 98.876, 97.753, 97.753, 97.753, 97.753]
From the above plot it is clear that subset size of 2 or 3 produces maximum accuracy for wine classification. At this value of subset size, the expected classification accuracy is of the model is 98.87%. Finally, as a sanity check, let us plot the accuracy vs number of trees curve to ensure that 400 is indeed a sufficient value ie. the oob error saturates before 400.
size=[50,100,200,400,600] oobe=[round(get_oob_errors_wine(i,2)*100,3) for i in size] print('Out-of-box Accuracies (as %) : '+str(oobe)) plt.plot(size,oobe,'--bo') plt.xlim([40,650]) plt.ylim([95,100]) plt.xlabel('Number of trees') plt.ylabel('Multiclass accuracy') plt.show()
Out-of-box Accuracies (as %) : [98.315, 98.315, 98.876, 98.876, 98.876]
We see from the above plot that the accuracy remains constant beyond 100. Hence 400 is a sufficient value. In-fact, values just above 100 would have been ideal because of the lower training time associated with them.
 Bache, K. & Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science
 Leo Breiman. 2001. Random Forests. Mach. Learn. 45, 1 (October 2001), 5-32. DOI=10.1023/A:1010933404324 http://dx.doi.org/10.1023/A:1010933404324