Cross-validation

Cross-validation (CV) is a method used to evaluate how well a model predicts new, unseen data. It also helps to find the best complexity for the model. Breeze automatically uses cross-validation when building PLS and PLS-DA models, but you can adjust how it's applied in the model settings.

In cross-validation, parts of the training set are left out, and the model is built using the remaining data. The model then predicts the excluded data, and this process is repeated for different parts of the dataset, ensuring that each object is left out and tested once.

The accuracy of the model is measured by calculating the Predicted Residual Error Sum of Squares (), which is the sum of the squared differences between the actual and predicted values for the left-out data.

There are two types of cross-validation:

Partial cross-validation: This is done component by component. After each component is calculated, it is removed before testing the next round.
Full cross-validation: This uses the full model (with all components) in each round of cross-validation.

For PLS and PLS-DA models, the score is used to measure the cross-validated explained variance in . It is calculated as:

Where is the sum of squares for , and is the prediction error.

The total cross-validation variance, represented as and , can be calculated from all model components to give a measure of the overall variance explained by the model.

Methods

Evenly Spread

This method excludes objects that are evenly distributed across the dataset. It ensures that excluded objects are spaced in a balanced manner throughout the data, preventing over-fitting by ensuring the training set remains representative of the overall dataset.

Sequential Group

This method divides the training set into a predefined number of sequential groups. Each group consists of consecutive objects from the dataset, ensuring that similar patterns in the order of the data are captured. This is useful when the data has a natural sequence or time component that needs to be preserved.

Random

This method randomly excludes objects from the dataset in each round. Random exclusion is often used to simulate a broad variety of training/testing conditions by preventing any pattern or bias in object selection, helping to ensure the generalizability of the model.

Leave-One-Out

In this method, one object is excluded at a time. The model is trained on all but one object and tested on the excluded object, repeating this for every object in the dataset. Leave-one-out cross-validation is computationally expensive but ensures that each object in the dataset is used for validation.

Stratified

Stratified cross-validation aims to maintain the class distribution across all groups. It divides the dataset into groups in a way that preserves the original distribution of categories. Objects are selected for each group based on the class they belong to, ensuring that each group is a mini-representation of the overall class distribution. This method is particularly effective for imbalanced datasets, where preserving the proportion of each class is important for reliable model evaluation. The algorithm ensures that similar classes are kept together across groups, preventing any skewed representation of the training or test sets.

Rounds

The user can specify the number of cross-validation rounds, which defaults to 7. Increasing the number of rounds decreases the number of objects excluded in each round, allowing for more granular testing across the dataset. This parameter is flexible except when using exclusion by category, where the number of rounds is automatically determined.

References

Efron, B., and Gong, G., (1983), A Leisurely Look at the Bootstrap, the Jack-knife, and Cross-validation, American Statistician, 37, 36-48.