Problem: How to speed up execution of classifiers using PCA/LDA extraction
Solution: We can join affine and scaling pipeline steps with + operator.
32.1. Introduction ↩
Common pattern in many pattern recognition systems is to use Principal Component Analysis (PCA) to reduce dimensionality and Linear Discriminant analysis (LDA) to derive low-dimensional subspace separating our classes.
Both methods are often used together because LDA alone, applied to high-dimensional data with few examples, may easily over-fit loosing generalization on unseen data.
In this article, we demonstrate a quick way to speedup classifier execution
by joining pipeline steps with +
operator.
32.2. Example of hyper-spectral classification ↩
As an example, we take classification of defect in French fries. Each data sample represents a pixel of a hyperspectral image with 103 narrow spectral bands. Our data set collects almost 38 thousand examples labeled in five images. We have four classes of interest, namely healthy potato flesh and residual potato skin ("peel"), and two types of defects: rot and greening:
>> tr
37967 by 103 sddata, 4 classes: 'rot'(10066) 'green'(3062) 'peel'(12025) 'flesh'(12814)
>> tr.image
sdlab with 37967 entries, 5 groups:
'agria001'(4958) 'agria003'(5501) 'agria004'(9452) 'agria005'(7734) 'agria006'(10322)
We will use a combination of PCA followed by LDA dimensionality reduction. The PCA reduces the dimensionality from input 103 to 20 preserving all available variance:
>> p=sdpca
(tr,20)
PCA pipeline 103x20 100% of variance
This is common in hyperspectral data sets because of high lever of correlation between adjacent bands.
Our second feature extraction step employs class information and derives a three-dimensional subspace best separating our classes of interest. Note, that the dimensionality is determined as the number of classes - 1.
>> p2=sdlda
(tr*p)
LDA pipeline 20x3
Finally, we train a multi-layer perceptron classifier with 10 units in the resulting subspace:
>> p3=sdneural
(tr*p*p2,'units',10)
sequential pipeline 3x1 'Scaling+Neural network'
1 Scaling 3x3 standardization
2 MLP neural network 3x4 10 units
3 Decision 4x1 weighting, 4 classes
32.3. Execution speed of the entire classifier ↩
The complete chain of processing is expressed by the pipeline pall
, first
applying PCA, then LDA and finally the neural network:
>> pall=p*p2*p3
sequential pipeline 103x1 'PCA+LDA+Scaling+Neural network'
1 PCA 103x20 100%% of variance
2 LDA 20x3
3 Scaling 3x3 standardization
4 MLP neural network 3x4 10 units
5 Decision 4x1 weighting, 4 classes
We will measure the classification speed on a realistic example, executing on a new hyperspectral cube:
>> load cube3
im3 544x318x103 142545408 double
>> im3=sdimage
(im3,'sddata')
172992 by 103 sddata, class: 'unknown'
As discussed in knowledge base article 31, we measure
execution speed with sdexe
command timing the tight internal loop,
relevant in real-time applications:
>> [out,t]=sdexe
(pall,im3);
>> [out,t]=sdexe
(pall,im3);
>> t
t =
0.3550
The result is 355 ms needed to process 173 000 spectral measurements.
32.4. Joining pipeline steps ↩
When we look closer at the pipeline pall
, we can see that the first two
steps are affine projection of input data:
>> pall
sequential pipeline 103x1 'PCA+LDA+Scaling+Neural network'
1 PCA 103x20 100%% of variance
2 LDA 20x3
3 Scaling 3x3 standardization
4 MLP neural network 3x4 10 units
5 Decision 4x1 weighting, 4 classes
We may join them constructing a single affine projection with +
operator:
>> pall2=pall(1) + pall(2:end)
sequential pipeline 103x1 'PCA+LDA+Scaling+MLP neural network+Decision'
1 PCA+LDA 103x3
2 Scaling 3x3 standardization
3 MLP neural network 3x4 10 units
4 Decision 4x1 weighting, 4 classes
This naturally limits the amount of multiplication and summation operations needed on each new spectrum. As a result, we observe a significant speedup:
>> [out,t]=sdexe
(pall2,im3);
>> [out,t]=sdexe
(pall2,im3);
>> t
t =
0.1609
32.5. Supported pipeline steps ↩
perClass supports the +
operator for the following pipeline steps:
- scaling
- affine
- neural network (MLP first stage is affine)
- multiple feature selection steps
In our example, we may even attempt to go two steps further. We may join PCA, LDA and scaling into one pipeline action:
>> pall3=pall(1)+pall(2)+pall(3:end)
sequential pipeline 103x1 'PCA+LDA+Scaling+MLP neural network+Decision'
1 PCA+LDA+Scaling 103x3
2 MLP neural network 3x4 10 units
3 Decision 4x1 weighting, 4 classes
>> [out,t]=sdexe
(pall3,im3);
>> [out,t]=sdexe
(pall3,im3);
>> t
t =
0.1575
Here, we still observe a small speed improvement.
Finally, we may also attempt to join PCA, LDA, scaling and the neural network model together:
>> pall4=pall(1) + pall(2) + pall(3) + pall(4:end)
sequential pipeline 103x1 'PCA+LDA+Scaling+MLP neural network+Decision'
1 PCA+LDA+Scaling+MLP neural network 103x4 10 units
2 Decision 4x1 weighting, 4 classes
>> [out,t]=sdexe
(pall4,im3);
>> [out,t]=sdexe
(pall4,im3);
>> t
t =
0.2990
The last option is not increasing speed as we're executing our neural
network model from the input 103-dimensional space instead of the
3-dimensional one, used in the fastest model pall3
.
32.6. Validating classifier outputs ↩
As with any speed optimization, it is important to validate that the final classifier outputs are correct. In our example, we are most interested in classifier decisions. Lets us compare the original and the fastest classifier:
We prepare the decision objects by applying the classifiers to the entire image data set:
>> dec1=im3*pall
sdlab with 172992 entries, 4 groups: 'rot'(8392) 'green'(4246) 'peel'(124343) 'flesh'(36011)
>> dec3=im3*pall3
sdlab with 172992 entries, 4 groups: 'rot'(8392) 'green'(4246) 'peel'(124343) 'flesh'(36011)
The fastest way is to simply compare the decision objects:
>> any(dec1~=dec3)
ans =
0
We can see, there is no sample in the test image where both classifiers would yield different decisions.
An alternative comparison is to display the confusion matrix:
>> sdconfmat
(dec1,dec3)
{Warning: The labels passed as first parameter were created by executing a
classifier. Ground-truth labels should be the first input and decisions the
second.}
ans =
True | Decisions
Labels | rot green peel flesh | Totals
--------------------------------------------------------------------
rot | 8392 0 0 0 | 8392
green | 0 4246 0 0 | 4246
peel | 0 0 124343 0 | 124343
flesh | 0 0 0 36011 | 36011
--------------------------------------------------------------------
Totals | 8392 4246 124343 36011 | 172992
Again, we can see that all examples fall on the diagonal, i.e. there are no discrepancies.
Note the warning generated: Typical use of confusion matrix is to compare
ground truth to classifier decisions. perClass makes a safety check to
avoid that the user provides the sdconfmat
inputs other way round. In our
example, we may safely ignore this message.
The second form of validation is comparing soft classifier outputs:
% recall that unary minus removes the decision step, if present
>> out1=im3 * -pall;
>> out3=im3 * -pall3;
Soft outpus on first three pixels:
>> +out1(1:3)
ans =
0.1076 0.0963 0.8893 0.0911
0.0710 0.0926 0.8827 0.0944
0.1051 0.1083 0.8845 0.0882
>> +out3(1:3)
ans =
0.1076 0.0963 0.8893 0.0911
0.0710 0.0926 0.8827 0.0944
0.1051 0.1083 0.8845 0.0882
Values look identical, but closer look shows small differences between both methods:
>> +out1(1:3) - +out3(1:3)
ans =
1.0e-15 *
0.8882 -0.2220 0 0.2220
0.4441 0 0 0.2220
-0.2220 0.4441 -0.4441 0
Note, that this is inevitable consequence of using imprecise floating point representation (for discussion, see e.g. http://floating-point-gui.de) . Even changing order of summation on identical inputs could lead to small differences.
From the practical viewpoint, we need to always validate that out classifier is providing correct decisions on a representative set of test data.
32.7. Summary ↩
We have seen, how joining pipeline steps with +
operator may yield 2.25
times faster classifier on a realistic high-throughput hyperspectral
classification problem.