Multivariate Pattern Analysis in Python |
Often it is desired to explore multiple models (classifiers, parameterizations) but it becomes an easy trap for introducing an optimistic bias into generalization estimate. The easiest but computationally intensive solution to overcome such a bias is to carry model selection by estimating the same (or different) performance characteristic while operating only on training data. If such performance is a cross-validation, then it leads to the so called “nested cross-validation” procedure.
This example will demonstrate on how to implement such nested cross-validation while selecting the best performing classifier from the warehouse of available within PyMVPA.
from mvpa.suite import *
# increase verbosity a bit for now
verbose.level = 3
# pre-seed RNG if you want to investigate the effects, thus
# needing reproducible results
#mvpa.seed(3)
# To minimize divergence from code for >= 0.5
np = N
For this simple example lets generate some fresh random data with 2 relevant features and low SNR.
dataset = normalFeatureDataset(perlabel=24, nlabels=2, nchunks=3,
nonbogus_features=[0, 1],
nfeatures=100, snr=3.0)
For the demonstration of model selection benefit, lets first compute cross-validated error using simple and popular kNN.
clf_sample = kNN()
cv_sample = CrossValidatedTransferError(
TransferError(clf_sample), NFoldSplitter())
verbose(1, "Estimating error using a sample classifier")
error_sample = np.mean(cv_sample(dataset))
For the convenience lets define a helpful function which we will use twice – once within cross-validation, and once on the whole dataset
def select_best_clf(dataset_, clfs):
"""Select best model according to CVTE
Helper function which we will use twice -- once for proper nested
cross-validation, and once to see how big an optimistic bias due
to model selection could be if we simply provide an entire dataset.
Parameters
----------
dataset_ : Dataset
clfs : list of Classifiers
Which classifiers to explore
Returns
-------
best_clf, best_error
"""
best_error = None
for clf in clfs:
cv = CrossValidatedTransferError(TransferError(clf),
NFoldSplitter())
# unfortunately we don't have ability to reassign clf atm
# cv.transerror.clf = clf
try:
error = np.mean(cv(dataset_))
except LearnerError, e:
# skip the classifier if data was not appropriate and it
# failed to learn/predict at all
continue
if best_error is None or error < best_error:
best_clf = clf
best_error = error
verbose(4, "Classifier %s cv error=%.2f" % (clf.descr, error))
verbose(3, "Selected the best out of %i classifiers %s with error %.2f"
% (len(clfs), best_clf.descr, best_error))
return best_clf, best_error
First lets select a classifier within cross-validation, thus eliminating model-selection bias
errors = [] best_clfs = {} confusion = ConfusionMatrix() verbose(1, “Estimating error using nested CV for model selection”) for isplit, (dstrain, dstest) in enumerate(NFoldSplitter()(dataset)):
verbose(2, “Processing split #%i” % isplit) best_clf, best_error = select_best_clf(dstrain, clfswh[‘!gnpp’]) best_clfs[best_clf.descr] = best_clfs.get(best_clf.descr, 0) + 1 # now that we have the best classifier, lets assess its transfer # to the testing dataset while training on entire training te = TransferError(best_clf, enable_states=[‘confusion’]) errors.append(te(dstest, dstrain)) confusion += te.states.confusion
And for comparison, lets assess what would be the best performance if we simply explore all available classifiers, providing all the data at once
verbose(1, “Estimating error via fishing expedition (best clf on entire dataset)”) cheating_clf, cheating_error = select_best_clf(dataset, clfswh[‘!gnpp’])
print “nConfusion table for the nested cross-validation results:” print confusion
See also
The full source code of this example is included in the PyMVPA source distribution (doc/examples/nested_cv.py).