- 19.1. Introduction
- 19.2. Algorithm function
- 19.3. Algorithm training and execution
- 19.4. Algorithms performing ROC and setting an operating point
- 19.5. Setting algorithm current operating point
- 19.6. Converting algorithms into pipelines for out-of-Matlab execution
19.1. Introduction ↩
Design of pattern recognition systems requires to handle systems built from
many low-level components such as feature extractors, models and decision
functions. perClass provides a tool to model the entire pattern
recognition system: the sdalg
algorithm.
Major features of the sdalg
algorithm:
- Simple to define: Algorithm training and execution is defined in a single Matlab function
- Parametrized: User may define named parameters so that one algorithm may describe the whole class of methods. For example, the classifier model may be a parameter.
- Self-contained: All algorithm parameters are kept inside the algorithm object (including the operating point) so untrained or trained algorithms may be stored and reused.
- Easy to deploy: The user controls the conversion of a trained algorithm into a pipeline for fast execution using perClass runtime library.
19.2. Algorithm function ↩
Let us illustrate the use of sdalg algorithm on a simple example. We
consider a classifier trained in two steps, first reducing the
dimensionality using the Principal Component Analysis (PCA), and then
training a classifier in the resulting subspace. This classification
system is fully defined in a function sda_pca_clf
.
1: function out = sda_pca_clf(alg,data)
3: if nargin==0
5: alg=sdalg
(mfilename);
7: alg.frac=0.9;
8: alg.clf=sdlinear
;
10: out=alg;
12: elseif totrain(alg)
14: alg.pca=sdpca
(data,alg.frac);
15: data=data*alg.pca;
17: alg.trclf=data*alg.clf;
19: out=setstate
(alg,'trained');
21: elseif toexecute(alg)
23: data=data*alg.pca;
24: out=data*alg.trclf;
26: end
Algorithm function takes two input parameters, namely the algorithm object
alg
and the data set data
and returns the output out
. The type of
out
will be different if training or executing the algorithm.
The function contains three sections. The first one describes algorithm initialization (lines 3-10), the second training (lines 12-19) and the third the algorithm execution (lines 21-24).
In the initialization section, we instantiate the algorithm object
and attach it to the function (in our case to sda_pca_clf
). We may then
add arbitrary parameters useful for making our algorithm more general. In
our case, we want to adjust the fraction of preserved variance of the PCA
dimensionality reduction and the classifier model.
Algorithm parameters behave analogously to Matlab structure fields. It is
because sdalg
is nothing more than a structure connected to a
function.
Final statement in initialization section makes sure the algorithm object
alg
gets returned in variable out
.
The training section describes training of initialized algorithm on
sddata
object data
. First we train PCA projection and store it
as alg.pca
. On line 15, we project the input data using the trained PCA.
Line 17 trains the model alg.clf
on projected data
and adds a default
operating point. Finally, we set the algorithm state to 'trained' and
return it on line 19.
The execution section is invoked on a trained algorithm and
sddata
object data
. In our algorithm, if merely projects input
data
by trained PCA projection and then applies the alg.trclf
pipeline
which executes the trained classifier model and performs decisions. The
output out
is therefore the sdlab
object with decisions for each
sample in data
.
19.3. Algorithm training and execution ↩
We can construct an untrained algorithm by simply calling its definition function without parameters:
>> alg=sda_pca_clf
untrained sdalg 'sda_pca_clf'
frac 1x1 8 double 0.9
clf 0x0 1034 sdppl untrained sdlinear
We will use the medical problem in our example:
>> load medical
>> a
'medical all' 259783 by 11 sddata, 3 classes: 'cancer'(56652) 'non-cancer'(168467) 'shadow'(34664)
>> b=a(:,:,1:2) % only cancer and non-cancer classes
'medical all' 225119 by 11 sddata, 2 classes: 'cancer'(56652) 'non-cancer'(168467)
We will use first 8 patients as a test set and the remaining ones for training:
>> [ts,tr]=subset
(b,'patient',1:8)
'medical all' 112053 by 11 sddata, 2 classes: 'cancer'(33785) 'non-cancer'(78268)
'medical all' 113066 by 11 sddata, 2 classes: 'cancer'(22867) 'non-cancer'(90199)
To train the algorithm, we call directly the algorithm function:
>> tralg=sda_pca_clf
(alg,tr)
trained sdalg 'sda_pca_clf'
frac 1x1 8 double 0.9
clf 0x0 1034 sdppl untrained sdlinear
pca 11x2 5272 sdppl trained sdp_affine
trclf 2x1 17440 sdppl trained sdp_decide
Simpler alternative is to use the multiplication operator:
>> tralg=tr*alg
trained sdalg 'sda_pca_clf'
frac 1x1 8 double 0.9
clf 0x0 1034 sdppl untrained sdlinear
pca 11x2 5272 sdppl trained sdp_affine
trclf 2x1 17440 sdppl trained sdp_decide
Training added pca
and trclf
fields. We may display the trained classifier trclf
:
>> tralg.trclf
sequential pipeline 2x1 'Gauss eq.cov.+Output normalization+Decision'
1 Gauss eq.cov. 2x2 2 classes, 2 components (sdp_normal)
2 Output normalization 2x2 (sdp_norm)
3 Decision 2x1 weighting, 2 classes, 1 ops at op 1 (sdp_decide)
To execute the algorithm, apply it to the test set:
>> dec=ts*tralg
sdlab with 112053 entries, 2 groups: 'cancer'(15134) 'non-cancer'(96919)
Algorithms may be executed only on sddata
objects and the
execution output may be either decisions (sdlab
object) or soft
output (sddata
object).
19.4. Algorithms performing ROC and setting an operating point ↩
In the example above, we trained a PCA projection and a classifier. You may ask: Why to bother writing an algorithm function for it when it may be expressed as a simple one-liner:
>> p=sdpca
([],0.9)*sdlinear
untrained pipeline 2 steps: sdpca+sdlinear
>> ptr=a*p
sequential pipeline 2x1 'PCA+Gaussian model+Normalization+Decision'
1 PCA 2x2 100%% of variance
2 Gaussian model 2x3 single cov.mat.
3 Normalization 3x3
4 Decision 3x1 weighting, 3 classes
Good question! In this section, we demonstrate the real-world problem that algorithms were created to address. Any real-world classifier needs more than training few sequential steps. It requires application-specific performance optimization using ROC analysis. For example, in our fruit sorting system, we must make sure that at least 99% of bananas are found with minimal false positives.
This is very simple to achieve in a custom algorithm.
function out = sda_pca_clf_roc(alg,data)
if nargin==0
alg=sdalg
(mfilename);
alg.frac=0.9;
8: alg.tpr_apple=0.99; % fix desired sensitivity on apple
alg.clf=sdlinear
;
out=alg;
elseif totrain(alg)
% split the avalable training data
16: [tr,val]=randsubset(data,0.5);
% train model
alg.pca=sdpca
(tr,alg.frac);
tr=tr*alg.pca;
alg.trclf=tr*alg.clf;
% estimate ROC (on validation set)
24: alg.roc=sdroc
(val, alg.trclf, 'measures',{'FPr','apple','TPr','apple'});
% constrain TPr(apple) and minimize the FPr(apple) in one step:
27: alg.roc=setcurop
(alg.roc,'constrain','TPr(apple)',alg.tpr_apple,...
'min','FPr(apple)');
out=setstate
(alg,'trained');
elseif toexecute(alg)
34: p=alg.pca*alg.trclf*alg.roc;
out=data*p;
end
On line 8, we add a parameter tpr_apple
specifying desired sensitivity
(true positive rate). On line 16, we split available training data into a
part used for training a model (tr
) and a validation set val
used for
ROC estimation. This is needed in order to avoid positive bias of the ROC
characteristic. The line 24 estimated ROC with TPr and FPr measures. On the
line 27, we set the operating point by first constraining TPr and then
minimizing FPr. Finally, in the execution section, we construct final
pipeline composed of a PCA projection, model and ROC.
>> alg=sda_pca_clf_roc
>> alg.clf=sdparzen
untrained sdalg 'sda_pca_clf_roc'
frac 1x1 8 double 0.9
tpr_apple 1x1 8 double 0.99
clf 0x0 1822 sdppl untrained Parzen
>> tralg=a*alg
....trained sdalg 'sda_pca_clf_roc'
frac 1x1 8 double 0.9
tpr_apple 1x1 8 double 0.99
clf 0x0 1822 sdppl untrained Parzen
pca 2x2 9828 sdppl trained PCA
trclf 2x1 27378 sdppl trained Parzen model+Decision
roc 159x3 21670 sdroc
>> tralg.roc
ROC (159 w-based op.points, 3 measures), curop: 83
est: 1:FPr(apple)=0.04, 2:TPr(apple)=0.99, 3:mean-error=0.03
The trained algorithm now contains the ROC object with properly set operating point.
Note, that this algorithm may be now easily applied to new data or
cross-validated using sdcrossval
function.
19.5. Setting algorithm current operating point ↩
The process of setting current operating point may be specified in a
separate algorithm section. This allows us to re-run it after training. We
may simply add a tosetcurop
section to the algorithm function. It fixes a
new operating point in already trained algorithm:
elseif tosetcurop(alg)
% constrain TPr(apple) and minimize the FPr(apple) in one step:
alg.roc=setcurop
(alg.roc,'constrain','TPr(apple)',alg.tpr_apple,...
'min','FPr(apple)');
out=alg;
end
Note, that there are no extra paramters passed from outside. The code simply uses parameters set in the alg object.
Say, we have a traine algorithm built in the previous section:
>> tralg=a*alg
frac 1x1 8 double 0.9
tpr_apple 1x1 8 double 0.99
clf 0x0 1822 sdppl untrained Parzen
pca 2x2 9828 sdppl trained PCA
trclf 2x1 27378 sdppl trained Parzen model+Decision
roc 159x3 21670 sdroc
It was fixing the operating point at 99% TPr. We wish to adjust a different
one with 95% TPr. We simply update the tpr_apple
parameter:
>> tralg.tpr_apple=0.95
trained sdalg 'sda_pca_clf_roc'
frac 1x1 8 double 0.9
tpr_apple 1x1 8 double 0.95
clf 0x0 1822 sdppl untrained Parzen
pca 2x2 9828 sdppl trained PCA
trclf 2x1 27378 sdppl trained Parzen model+Decision
roc 159x3 21670 sdroc
And call the setcurop
method on the algorithm:
>> tralg=setcurop(tralg)
tosetcurop sdalg 'sda_pca_clf_roc'
frac 1x1 8 double 0.9
tpr_apple 1x1 8 double 0.95
clf 0x0 1822 sdppl untrained Parzen
pca 2x2 9828 sdppl trained PCA
trclf 2x1 27378 sdppl trained Parzen model+Decision
roc 159x3 21670 sdroc
The returned algorithm no uses new operating point:
>> tralg.roc
ROC (159 w-based op.points, 3 measures), curop: 76
est: 1:FPr(apple)=0.01, 2:TPr(apple)=0.95, 3:mean-error=0.03
19.6. Converting algorithms into pipelines for out-of-Matlab execution ↩
In order to execute a complex algorithm outside of Matlab, we need to
convert it into an sdppl
pipeline. We may add an extra toconvert
section in the algorithm function to describe this conversion.
For our example algorithm, the pipeline construction would be simple. We will only need to return concatenate the PCA projection with the trained classifier.
function out = sda_pca_clf(alg,data)
if nargin==0
alg=sdalg
(mfilename);
alg.frac=0.9;
alg.clf=sdlinear
;
out=alg;
elseif totrain(alg)
alg.pca=sdpca
(data,alg.frac);
data=data*alg.pca;
alg.trclf=data*alg.clf;
out=alg;
elseif toexecute(alg)
data=data*alg.pca;
out=data*alg.trclf;
elseif toconvert(alg)
out=alg.pca*alg.trclf;
end
The trained algorithm may be now converted using sdconvert
function:
>> p=sdconvert
(tralg)
sequential pipeline 11x1 'PCA+Gauss eq.cov.+Output normalization+Decision'
1 PCA 11x2 93%% of variance (sdp_affine)
2 Gauss eq.cov. 2x2 2 classes, 2 components (sdp_normal)
3 Output normalization 2x2 (sdp_norm)
4 Decision 2x1 weighting, 2 classes, 1 ops at op 1 (sdp_decide)