A Wrong Way to Do Cross-Validation

While this point may seem obvious to the reader, we have seen this blunder committed many times in published papers in top rank journals.

Consider a classification problem with a large number of predictors, as may arise, for example, in genomic or proteomic applications. A typical strategy for analysis might be as follows:

Screen the predictors: find a subset of “good” predictors that show fairly strong (univariate) correlation with the class labels.
Using just this subset of predictors, build a multivariate classifier.
Use cross-validation to estimate the unknown tuning parameters and to estimate the prediction error of the final model.

Is this a correct application of cross-validation? Consider a scenario with $N = 50$ samples in two equal-sized classes, and $p = 5000$ quantitative predictors (standard Gaussian) that are independent of the class labels. The true (test) error rate of any classifier is 50%. We carried out the above recipe, choosing in step (1) the 100 predictors having highest correlation with the class labels, and then using a 1-nearest neighbor classifier, based on just these 100 predictors, in step (2). Over 50 simulations from this setting, the average CV error rate was 3%. This is far lower than the true error rate of 50%.

What has happened? The problem is that the predictors have an unfair advantage, as they were chosen in step (1) on the basis of all of the samples. Leaving samples out after the variables have been selected does not correctly mimic the application of the classifier to a completely independent test set, since these predictors “have already seen” the left out samples.

Here is the correct way to carry out cross-validation in this example:

Divide the samples into $K$ cross-validation folds (groups) at random.
For each fold $k = 1, 2, . . . , K$

(a) Find a subset of “good” predictors that show fairly strong (univariate) correlation with the class labels, using all of the samples except those in fold $k$.

(b) Using just this subset of predictors, build a multivariate classifier, using all of the samples except those in fold $k$.

In general, with a multistep modeling procedure, cross-validation must be applied to the entire sequence of modeling steps. In particular, samples must be “left out” before any selection or filtering steps are applied. There is one qualification: initial unsupervised screening steps can be done before samples are left out. For example, we could select the 1000 predictors with highest variance across all 50 samples, before starting cross-validation. Since this filtering does not involve the class labels, it does not give the predictors an unfair advantage.

A demo is attached below.

Ref

Friedman, J., Hastie, T., & Tibshirani, R. (2009). The elements of statistical learning (2nd Ed.). New York: Springer series in statistics. pp. 245-247.

Demo

import heapq

import numpy as np
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import KFold
from tqdm import tqdm

def generate_data(N=50, p=5000):
    '''
    :param N: int. Number of samples in two equal-sized classes
    :param p: int. Number of quantitative predictors (standard Gaussian) that are independent of the class labels
    :return: DataFrame with shape (N, p+1)
    '''
    data = pd.DataFrame(np.random.randn(N, p))
    labels = np.ones(N)
    labels[:N//2] = 0
    np.random.shuffle(labels)
    data['label'] = labels
    return data


def chosen_features(data, n_chosen=100):
    '''
    :param data: DataFrame
    :param n_chosen: int. Number of predictors having highest correlation with the class labels in data
    :return: list of indexes for chosen features
    '''
    # (idx_feature, corr) for an entry
    correlations = ((i, data[i].corr(data['label'])) for i in range(data.shape[1]-1))
    chosen = heapq.nlargest(n_chosen, correlations, key=lambda x: np.abs(x[1]))
    return [item[0] for item in chosen]


def simulation(clf=KNeighborsClassifier(n_neighbors=1), n_splits=5, n_sim=50, random_state=0xC7):
    '''
    :param clf: sklearn classifier
    :param n_splits: int. Number of folds for CV
    :param n_sim: int. Number of simulations
    :param random_state: np.random.seed(random_state)
    '''
    np.random.seed(random_state)
    sum1, sum2 = 0, 0
    
    for i in tqdm(range(n_sim)):
        
        data = generate_data()
        kf = KFold(n_splits=n_splits)
        idx_pre_chosen = chosen_features(data)
        
        for train_index, test_index in kf.split(data):
            y_train, y_test = data.iloc[train_index, -1], data.iloc[test_index, -1]
            
            # right way
            idx_chosen = chosen_features(data.iloc[train_index])
            clf.fit(data.iloc[train_index, idx_chosen], y_train)
            sum1 += clf.score(data.iloc[test_index, idx_chosen], y_test)
            
            # wrong way
            clf.fit(data.iloc[train_index, idx_pre_chosen], y_train)
            sum2 += clf.score(data.iloc[test_index, idx_pre_chosen], y_test)

    print(f'Average CV error rate for {n_sim} simulations via the right way: {1 - sum1/(n_splits*n_sim): .1%}.')
    print(f'Average CV error rate for {n_sim} simulations via the wrong way: {1 - sum2/(n_splits*n_sim): .1%}.')
    return None

simulation()

'''
Average CV error rate for 50 simulations via the right way:  52.8%.
Average CV error rate for 50 simulations via the wrong way:  1.8%.
'''

def chosen_features(data, n_chosen=100):
    correlations = ((i, data[i].corr(data['label'])) for i in range(len(data.columns)-1))
    chosen = heapq.nlargest(n_chosen, correlations, key=lambda x: x[1])  # changed here
    return [item[0] for item in chosen]

simulation()

'''
Average CV error rate for 50 simulations via the right way:  53.0%.
Average CV error rate for 50 simulations via the wrong way:  3.3%.
'''