Keywords: confusion matrices
Problem: To find out what samples suffer from a specific type of error (defined by a confusion matrix)
Solution: Use the sdconfmatind
function to find indices of samples in a specific cell of a confusion matrix.
Let us assume a two class banana dataset split into a training and test set:
>> load fruit; a=a(:,:,[1 2])
'Fruit set' 200 by 2 sddata, 2 classes: 'apple'(100) 'banana'(100)
>> [tr,ts]=randsubset
(a,0.5)
Banana Set, 50 by 2 dataset with 2 classes: [25 25]
Banana Set, 50 by 2 dataset with 2 classes: [25 25]
We train a Gaussian model, apply the model to the test set, and obtain the classifier decisions (dec
):
>> p=sdgauss
(tr)
sequential pipeline 2x1 'Gaussian model+Decision'
1 Gaussian model 2x2 full cov.mat.
2 Decision 2x1 weighting, 2 classes
>> dec=ts*p
sdlab with 100 entries, 2 groups: 'apple'(57) 'banana'(43)
The confusion matrix compares the ground-truth labels, stored in the test dataset ts
, to the decisions dec
:
>> sdconfmat
(ts.lab,ts*p)
ans =
True | Decisions
Labels | apple banana | Totals
---------------------------------------
apple | 46 4 | 50
banana | 11 39 | 50
---------------------------------------
Totals | 57 43 | 100
We would now like to find out, what are the 4 apple
samples that are
misclassified as banana
by our classifier. We use the sdconfmatind
function providing it with ground truth labels, decisions and the true and
estimated class defining the confusion matrix cell (here 'apple' and
'banana'):
>> ind=sdconfmatind
(ts.lab,ts*p,'apple','banana')
ind =
15
23
25
48
The four test samples are:
>> ts(ind)
'Fruit set' 4 by 2 sddata, class: 'apple'
How to find out these samples in the original data set a
? It is,
actually, quite easy! When we display details on data set a
with the
transpose operator...
>> a'
'Fruit set' 200 by 2 sddata, 2 classes: 'apple'(100) 'banana'(100)
sample props: 'lab'->'class' 'class'(L) 'ident'(N)
feature props: 'featlab'->'featname' 'featname'(L)
data props: 'data'(N)
... we can see there is a numerical ident
field. We have included an
index of each sample using:
>> a.ident=1:length(a)
'Fruit set' 200 by 2 sddata, 2 classes: 'apple'(100) 'banana'(100)
We may retrieve the ident
property on any sample:
>> a(10).ident
ans =
10
Therefore, we may quickly see original indices of misclassified test samples:
>> ts(ind).ident
ans =
33
52
55
97