Cross-validation over replicas
What is a replica?
A replica is a repeated measurement of the same physical sample. In order to estimate true generalization performance of our models, we should keep replicas of a specific physical object either in training or in test set, but never split between both. The reason is that having very similar examples in both training set and the test set makes the performance of our models positively biased (over-optimistic). Our models have seen very similar data in training and thus correct results on such data in the test set do not necessarily translate into good generalization capabilities. By generalization we mean robust performance on entirely unseen examples.
Cross-validation over replicas in perClass Mira
If the replica status is preserved in the scan names, then we can easily instruct the Cross-validation tool to perform data splitting over replicas, and not over images. For example, in our powder data set, the A/B/C letters indicate replicas of the same physical vial.
We keep the leave-one-out method selected in the Selection tab, return to Samples tab and change the selection from Image to Samples. This means that we need to define what a sample consists of for cross-validation. The default is invalid which leads to all selected images flagged as red. We can now define a regular expression in
that parses image names and the sample definition in
that is used for the cross-validation.
Example solution in our case is to detect the mixing proportion from the scan name and construct a new scan name that only lists the mixing proportion, nothing else. The reason is that we want to make sure that one vial (i.e. one mixing proportion) ends up either in the training set or in the testing set, but not split in both.
Technically, we put the regular expression that matches each scan name, allowing for the replica definition using a single capital letter from A-Z range in . After the underscore, we capture one or more digits until the next underscore. The capture (part of a string that will be extracted and available to us for reference) is enclosed in parentheses. The \d refers to a single digit, the + sign after means that the digit repeats one or more times. This is a standard regular expression syntax that is very handy when dealing with structured patterns in strings.
TIP: For reference information on regular expressions, see https://en.wikipedia.org/wiki/Regular_expression.
We also fill the output pattern in the Sample name field above. The important point is that we may refer here to any captures using the $1, $2, etc. syntax denoting the 1st, 2nd, or later capture (text matched within the parentheses of the regular expression).
The table shows how original scan name translate into our new definition, samples with the same name are grouped, which is visible based on the background color.
By clicking Start session we can now initiate a new cross-validation session where we will perform leave-one-out. We also pressed Model search in the Regression panel in order to directly see test samples in the Regression plot. Note that the first fold now covers all replicas of the vial with mixing proportion 0, which in our case are two images.
By selecting a different fold using the spinbox and re-running the Model search, we will exclude all replicas of another vial from training:
Note that in the fold 6, we have three replicas of the vial with mixing proportion 60, all three are now in a test set.
For each fold model, we may copy the regression performance from the Statistics tab to clipboard and paste it into Excel sheet. In this way, we gradually build a table of per-fold results where we will be able to assess statistical variability of each measure.
In this section, we have seen how to fairly assess performance of our regression model on unseen vials in the powder project.