While this point may seem obvious to the reader, we have seen this blunder committed many times in published papers in top rank journals.
Consider a classification problem with a large number of predictors, as may arise, for example, in genomic or proteomic applications. A typical strategy for analysis might be as follows:
- Screen the predictors: find a subset of “good” predictors that show fairly strong (univariate) correlation with the class labels.
- Using just this subset of predictors, build a multivariate classifier.
- Use cross-validation to estimate the unknown tuning parameters and to estimate the prediction error of the final model.