Based on the model selection framework of his Google summer of code 2011 project | Saurabh Mahindre - github.com/Saurabh7 as a part of Google Summer of Code 2014 project mentored by - Heiko Strathmann
Cross validation aims to estimate an algorithm's performance on unseen data. For example, one might be interested in the average classification accuracy of a Support Vector Machine when being applied to new data, that it was not trained on. This is important in order to compare the performance different algorithms on the same target. Most crucial is the point that the data that was used for running/training the algorithm is not used for testing. Different algorithms here also can mean different parameters of the same algorithm. Thus, cross-validation can be used to tune parameters of learning algorithms, as well as comparing different families of algorithms against each other. Cross-validation estimates are related to the marginal likelihood in Bayesian statistics in the sense that using them for selecting models avoids overfitting.
Evaluating an algorithm's performance on training data should be avoided since the learner may adjust to very specific random features of the training data which are not very important to the general relation. This is called overfitting. Maximising performance on the training examples usually results in algorithms explaining the noise in data (rather than actual patterns), which leads to bad performance on unseen data. This is one of the reasons behind splitting the data and using different splits for training and testing, which can be done using cross-validation. Let us generate some toy data for binary classification to try cross validation on.
%pylab inline %matplotlib inline # include all Shogun classes import os SHOGUN_DATA_DIR=os.getenv('SHOGUN_DATA_DIR', '../../../data') from modshogun import * # generate some ultra easy training data gray() n=20 title('Toy data for binary classification') X=hstack((randn(2,n), randn(2,n)+1)) Y=hstack((-ones(n), ones(n))) _=scatter(X, X, c=Y , s=100) p1 = Rectangle((0, 0), 1, 1, fc="w") p2 = Rectangle((0, 0), 1, 1, fc="k") legend((p1, p2), ["Class 1", "Class 2"], loc=2) # training data in Shogun representation features=RealFeatures(X) labels=BinaryLabels(Y)
Populating the interactive namespace from numpy and matplotlib
As said earlier Cross-validation is based upon splitting the data into multiple partitions. Shogun has various strategies for this. The base class for them is CSplittingStrategy.
Formally, this is achieved via partitioning a dataset $X$ of size $|X|=n$ into $k \leq n$ disjoint partitions $X_i\subseteq X$ such that $X_1 \cup X_2 \cup \dots \cup X_n = X$ and $X_i\cap X_j=\emptyset$ for all $i\neq j$. Then, the algorithm is executed on all $k$ possibilities of merging $k-1$ partitions and subsequently tested on the remaining partition. This results in $k$ performances which are evaluated in some metric of choice (Shogun support multiple ones). The procedure can be repeated (on different splits) in order to obtain less variance in the estimate. See  for a nice review on cross-validation using different performance measures.
k=5 normal_split=CrossValidationSplitting(labels, k)
On classificaiton data, the best choice is stratified cross-validation. This divides the data in such way that the fraction of labels in each partition is roughly the same, which reduces the variance of the performance estimate quite a bit, in particular for data with more than two classes. In Shogun this is implemented by CStratifiedCrossValidationSplitting class.
Let us visualize the generated folds on the toy data.
#code to visualize splitting def get_folds(split, num): split.build_subsets() x= y= lab= for j in range(num): indices=split.generate_subset_indices(j) x_= y_= lab_= for i in range(len(indices)): x_.append(X[indices[i]]) y_.append(X[indices[i]]) lab_.append(Y[indices[i]]) x.append(x_) y.append(y_) lab.append(lab_) return x, y, lab def plot_folds(split_strategies, num): for i in range(len(split_strategies)): x, y, lab=get_folds(split_strategies[i], num) figure(figsize=(18,4)) gray() suptitle(split_strategies[i].get_name(), fontsize=12) for j in range(0, num): subplot(1, num, (j+1), title='Fold %s' %(j+1)) scatter(x[j], y[j], c=lab[j], s=100) _=plot_folds(split_strategies, 4)
Stratified splitting takes care that each fold has almost the same number of samples from each class. This is not the case with normal splitting which usually leads to imbalanced folds.
Following the example from above, we will tune the performance of a SVM on the binary classification problem. We will
The involved methods are
# define SVM with a small rbf kernel (always normalise the kernel!) C=1 kernel=GaussianKernel(2, 0.001) kernel.init(features, features) kernel.set_normalizer(SqrtDiagKernelNormalizer()) classifier=LibSVM(C, kernel, labels) # train _=classifier.train()
Ok, we now have performed classification on the training data. How good did this work? We can easily do this for many different performance measures.
# instanciate a number of Shogun performance measures metrics=[ROCEvaluation(), AccuracyMeasure(), ErrorRateMeasure(), F1Measure(), PrecisionMeasure(), RecallMeasure(), SpecificityMeasure()] for metric in metrics: print metric.get_name(), metric.evaluate(classifier.apply(features), labels)
ROCEvaluation 1.0 AccuracyMeasure 1.0 ErrorRateMeasure 0.0 F1Measure 1.0 PrecisionMeasure 1.0 RecallMeasure 1.0 SpecificityMeasure 1.0
Note how for example error rate is 1-accuracy. All of those numbers represent the training error, i.e. the ability of the classifier to explain the given data.
Now, the training error is zero. This seems good at first. But is this setting of the parameters a good idea? No! A good performance on the training data alone does not mean anything. A simple look up table is able to produce zero error on training data. What we want is that our methods generalises the input data somehow to perform well on unseen data. We will now use cross-validation to estimate the performance on such.
We will use CStratifiedCrossValidationSplitting, which accepts a reference to the labels and the number of partitions as parameters. This instance is then passed to the class CCrossValidation, which does the estimation using the desired splitting strategy. The latter class can take all algorithms that are implemented against the CMachine interface.
metric=AccuracyMeasure() cross=CrossValidation(classifier, features, labels, stratified_split, metric) # perform the cross-validation, note that this call involved a lot of computation result=cross.evaluate() # the result needs to be casted to CrossValidationResult result=CrossValidationResult.obtain_from_generic(result) # this class contains a field "mean" which contain the mean performance metric print "Testing", metric.get_name(), result.mean
Testing AccuracyMeasure 0.5
Now this is incredibly bad compared to the training error. In fact, it is very close to random performance (0.5). The lesson: Never judge your algorithms based on the performance on training data!
Note that for small data sizes, the cross-validation estimates are quite noisy. If we run it multiple times, we get different results.
print "Testing", metric.get_name(), [CrossValidationResult.obtain_from_generic(cross.evaluate()).mean for _ in range(10)]
Testing AccuracyMeasure [0.525, 0.525, 0.525, 0.55, 0.5, 0.55, 0.475, 0.525, 0.475, 0.5]
It is better to average a number of different runs of cross-validation in this case. A nice side effect of this is that the results can be used to estimate error intervals for a given confidence rate.
# 25 runs and 95% confidence intervals cross.set_num_runs(25) # perform x-validation (now even more expensive) cross.evaluate() result=cross.evaluate() result=CrossValidationResult.obtain_from_generic(result) print "Testing cross-validation mean %.2f " \ % (result.mean)
Testing cross-validation mean 0.52
Using this machinery, it is very easy to compare multiple kernel parameters against each other to find the best one. It is even possible to compare a different kernel.
widths=2**linspace(-5,25,10) results=zeros(len(widths)) for i in range(len(results)): kernel.set_width(widths[i]) result=CrossValidationResult.obtain_from_generic(cross.evaluate()) results[i]=result.mean plot(log2(widths), results, 'blue') xlabel("log2 Kernel width") ylabel(metric.get_name()) _=title("Accuracy for different kernel widths") print "Best Gaussian kernel width %.2f" % widths[results.argmax()], "gives", results.max() # compare this with a linear kernel classifier.set_kernel(LinearKernel()) lin_k=CrossValidationResult.obtain_from_generic(cross.evaluate()) plot([log2(widths), log2(widths[len(widths)-1])], [lin_k.mean,lin_k.mean], 'r') # please excuse this horrible code :) print "Linear kernel gives", lin_k.mean _=legend(["Gaussian", "Linear"], loc="lower center")
Best Gaussian kernel width 330280.74 gives 0.831 Linear kernel gives 0.782
This gives a brute-force way to select paramters of any algorithm implemented under the CMachine interface. The cool thing about this is, that it is also possible to compare different model families against each other. Below, we compare a a number of regression models in Shogun on the Boston Housing dataset.
Various regression models in Shogun are now used to predict house prices using the boston housing dataset. Cross-validation is used to find best parameters and also test the performance of the models.
feats=RealFeatures(CSVFile(os.path.join(SHOGUN_DATA_DIR, 'uci/housing/fm_housing.dat'))) labels=RegressionLabels(CSVFile(os.path.join(SHOGUN_DATA_DIR, 'uci/housing/housing_label.dat'))) preproc=RescaleFeatures() preproc.init(feats) feats.add_preprocessor(preproc) feats.apply_preprocessor(True) #Regression models ls=LeastSquaresRegression(feats, labels) tau=1 rr=LinearRidgeRegression(tau, feats, labels) width=1 tau=1 kernel=GaussianKernel(feats, feats, width) kernel.set_normalizer(SqrtDiagKernelNormalizer()) krr=KernelRidgeRegression(tau, kernel, labels) regression_models=[ls, rr, krr]
Let us use cross-validation to compare various values of tau paramter for ridge regression (Regression notebook). We will use MeanSquaredError as the performance metric. Note that normal splitting is used since it might be impossible to generate "good" splits using Stratified splitting in case of regression since we have continous values for labels.
n=30 taus = logspace(-4, 1, n) #5-fold cross-validation k=5 split=CrossValidationSplitting(labels, k) metric=MeanSquaredError() cross=CrossValidation(rr, feats, labels, split, metric) cross.set_num_runs(50) errors= for tau in taus: #set necessary parameter rr.set_tau(tau) result=cross.evaluate() result=CrossValidationResult.obtain_from_generic(result) #Enlist mean error for all runs errors.append(result.mean) figure(figsize=(20,6)) suptitle("Finding best (tau) parameter using cross-validation", fontsize=12) p=subplot(121) title("Ridge Regression") plot(taus, errors, linewidth=3) p.set_xscale('log') p.set_ylim([0, 80]) xlabel("Taus") ylabel("Mean Squared Error") cross=CrossValidation(krr, feats, labels, split, metric) cross.set_num_runs(50) errors= for tau in taus: krr.set_tau(tau) result=cross.evaluate() result=CrossValidationResult.obtain_from_generic(result) #print tau, "error", result.mean errors.append(result.mean) p2=subplot(122) title("Kernel Ridge regression") plot(taus, errors, linewidth=3) p2.set_xscale('log') xlabel("Taus") _=ylabel("Mean Squared Error")
/usr/local/lib/python2.7/dist-packages/ipykernel/__main__.py:15: RuntimeWarning: [WARN] In file /home/buildslave/release/build/src/shogun/evaluation/CrossValidation.cpp line 103: LinearRidgeRegression does not support locking. Autolocking is skipped. Set autolock flag to false to get rid of warning.
A low value of error certifies a good pick for the tau paramter which should be easy to conclude from the plots. In case of Ridge Regression the value of tau i.e. the amount of regularization doesn't seem to matter but does seem to in case of Kernel Ridge Regression. One interpretation of this could be the lack of over fitting in the feature space for ridge regression and the occurence of over fitting in the new kernel space in which Kernel Ridge Regression operates. </br> Next we will compare a range of values for the width of Gaussian Kernel used in Kernel Ridge Regression
n=50 widths=logspace(-2, 3, n) krr.set_tau(0.1) metric=MeanSquaredError() k=5 split=CrossValidationSplitting(labels, k) cross=CrossValidation(krr, feats, labels, split, metric) cross.set_num_runs(10) errors= for width in widths: kernel.set_width(width) result=cross.evaluate() result=CrossValidationResult.obtain_from_generic(result) #print width, "error", result.mean errors.append(result.mean) figure(figsize=(15,5)) p=subplot(121) title("Finding best width using cross-validation") plot(widths, errors, linewidth=3) p.set_xscale('log') xlabel("Widths") _=ylabel("Mean Squared Error")
The values for the kernel parameter and tau may not be independent of each other, so the values we have may not be optimal. A brute force way to do this would be to try all the pairs of these values but it is only feasible for a low number of parameters.
n=40 taus = logspace(-3, 0, n) widths=logspace(-1, 4, n) cross=CrossValidation(krr, feats, labels, split, metric) cross.set_num_runs(1) x, y=meshgrid(taus, widths) grid=array((ravel(x), ravel(y))) print grid.shape errors= for i in range(0, n*n): krr.set_tau(grid[:,i]) kernel.set_width(grid[:,i]) result=cross.evaluate() result=CrossValidationResult.obtain_from_generic(result) errors.append(result.mean) errors=array(errors).reshape((n, n))
from mpl_toolkits.mplot3d import Axes3D #taus = logspace(0.5, 1, n) jet() fig=figure(figsize(15,7)) ax=subplot(121) c=pcolor(x, y, errors) _=contour(x, y, errors, linewidths=1, colors='black') _=colorbar(c) xlabel('Taus') ylabel('Widths') ax.set_xscale('log') ax.set_yscale('log') ax1=fig.add_subplot(122, projection='3d') ax1.plot_wireframe(log10(y),log10(x), errors, linewidths=2, alpha=0.6) ax1.view_init(30,-40) xlabel('Taus') ylabel('Widths') _=ax1.set_zlabel('Error')
<matplotlib.figure.Figure at 0x7f997ff93b90>
Let us approximately pick the good parameters using the plots. Now that we have the best parameters, let us compare the various regression models on the data set.
#use the best parameters rr.set_tau(1) krr.set_tau(0.05) kernel.set_width(2) title_='Performance on Boston Housing dataset' print "%50s" %title_ for machine in regression_models: metric=MeanSquaredError() cross=CrossValidation(machine, feats, labels, split, metric) cross.set_num_runs(25) result=cross.evaluate() result=CrossValidationResult.obtain_from_generic(result) print "-"*80 print "|", "%30s" % machine.get_name(),"|", "%20s" %metric.get_name(),"|","%20s" %result.mean ,"|" print "-"*80
Performance on Boston Housing dataset
/usr/local/lib/python2.7/dist-packages/ipykernel/__main__.py:12: RuntimeWarning: [WARN] In file /home/buildslave/release/build/src/shogun/evaluation/CrossValidation.cpp line 103: LeastSquaresRegression does not support locking. Autolocking is skipped. Set autolock flag to false to get rid of warning.
-------------------------------------------------------------------------------- | LeastSquaresRegression | MeanSquaredError | 30.2201994674 |
/usr/local/lib/python2.7/dist-packages/ipykernel/__main__.py:12: RuntimeWarning: [WARN] In file /home/buildslave/release/build/src/shogun/evaluation/CrossValidation.cpp line 103: LinearRidgeRegression does not support locking. Autolocking is skipped. Set autolock flag to false to get rid of warning.
-------------------------------------------------------------------------------- | LinearRidgeRegression | MeanSquaredError | 30.1648624021 | -------------------------------------------------------------------------------- | KernelRidgeRegression | MeanSquaredError | 10.9608050107 | --------------------------------------------------------------------------------
A standard way of selecting the best parameters of a learning algorithm is by Grid Search. This is done by an exhaustive search of a specified parameter space. CModelSelectionParameters is used to select various parameters and their ranges to be used for model selection. A tree like structure is used where the nodes can be CSGObject or the parameters to the object. The range of values to be searched for the parameters is set using
#Root param_tree_root=ModelSelectionParameters() #Parameter tau tau=ModelSelectionParameters("tau") param_tree_root.append_child(tau) # also R_LINEAR/R_LOG is available as type min_value=0.01 max_value=1 type_=R_LINEAR step=0.05 base=2 tau.build_values(min_value, max_value, type_, step, base)
Next we will create CModelSelectionParameters instance with a kernel object which has to be appended the root node. The kernel object itself will be append with a kernel width parameter which is the parameter we wish to search.
#kernel object param_gaussian_kernel=ModelSelectionParameters("kernel", kernel) gaussian_kernel_width=ModelSelectionParameters("log_width") gaussian_kernel_width.build_values(0.1, 6.0, R_LINEAR, 0.5, 2.0) #kernel parameter param_gaussian_kernel.append_child(gaussian_kernel_width) param_tree_root.append_child(param_gaussian_kernel)
# cross validation instance used cross_validation=CrossValidation(krr, feats, labels, split, metric) cross_validation.set_num_runs(1)
# model selection instance model_selection=GridSearchModelSelection(cross_validation, param_tree_root)
print_state=False # TODO: enable it once crossval has been fixed #best_parameters=model_selection.select_model(print_state) #best_parameters.apply_to_machine(krr) #best_parameters.print_tree() result=cross_validation.evaluate() result=CrossValidationResult.obtain_from_generic(result) print 'Error with Best parameters:', result.mean
Error with Best parameters: 10.8678393156
The error value using the parameters obtained from Grid Search is pretty close (and should be better) to the one we had seen in the last section. Grid search suffers from the curse of dimensionality though, which can lead to huge computation costs in higher dimensions.
 Forman, G. and Scholz, M. (2009). Apples-to-apples in cross-validation studies: Pitfalls in classifier performance measurement. Technical report, HP Laboratories.