- 19.1. Introduction
- 19.2. Algorithm function
- 19.3. Algorithm training and execution
- 19.4. Algorithms performing ROC and setting an operating point
- 19.5. Setting algorithm current operating point
- 19.6. Converting algorithms into pipelines for out-of-Matlab execution

# 19.1. Introduction ↩

Design of pattern recognition systems requires to handle systems built from
many low-level components such as feature extractors, models and decision
functions. perClass provides a tool to model the entire pattern
recognition system: the `sdalg`

algorithm.

Major features of the `sdalg`

algorithm:

**Simple to define**: Algorithm training and execution is defined in a single Matlab function**Parametrized**: User may define named parameters so that one algorithm may describe the whole class of methods. For example, the classifier model may be a parameter.**Self-contained**: All algorithm parameters are kept inside the algorithm object (including the operating point) so untrained or trained algorithms may be stored and reused.**Easy to deploy**: The user controls the conversion of a trained algorithm into a pipeline for fast execution using perClass runtime library.

# 19.2. Algorithm function ↩

Let us illustrate the use of sdalg algorithm on a simple example. We
consider a classifier trained in two steps, first reducing the
dimensionality using the Principal Component Analysis (PCA), and then
training a classifier in the resulting subspace. This classification
system is fully defined in a function `sda_pca_clf`

.

```
1: function out = sda_pca_clf(alg,data)
3: if nargin==0
5: alg=
````sdalg`

(mfilename);
7: alg.frac=0.9;
8: alg.clf=`sdlinear`

;
10: out=alg;
12: elseif totrain(alg)
14: alg.pca=`sdpca`

(data,alg.frac);
15: data=data*alg.pca;
17: alg.trclf=data*alg.clf;
19: out=`setstate`

(alg,'trained');
21: elseif toexecute(alg)
23: data=data*alg.pca;
24: out=data*alg.trclf;
26: end

Algorithm function takes two input parameters, namely the algorithm object
`alg`

and the data set `data`

and returns the output `out`

. The type of
`out`

will be different if training or executing the algorithm.

The function contains three sections. The first one describes algorithm initialization (lines 3-10), the second training (lines 12-19) and the third the algorithm execution (lines 21-24).

In the ** initialization section**, we instantiate the algorithm object
and attach it to the function (in our case to

`sda_pca_clf`

). We may then
add arbitrary parameters useful for making our algorithm more general. In
our case, we want to adjust the fraction of preserved variance of the PCA
dimensionality reduction and the classifier model. Algorithm parameters behave analogously to Matlab structure fields. It is
because `sdalg`

is nothing more than a structure connected to a
function.

Final statement in initialization section makes sure the algorithm object
`alg`

gets returned in variable `out`

.

The **training section** describes training of initialized algorithm on
`sddata`

object `data`

. First we train PCA projection and store it
as `alg.pca`

. On line 15, we project the input data using the trained PCA.
Line 17 trains the model `alg.clf`

on projected `data`

and adds a default
operating point. Finally, we set the algorithm state to 'trained' and
return it on line 19.

The **execution section** is invoked on a trained algorithm and
`sddata`

object `data`

. In our algorithm, if merely projects input
`data`

by trained PCA projection and then applies the `alg.trclf`

pipeline
which executes the trained classifier model and performs decisions. The
output `out`

is therefore the `sdlab`

object with decisions for each
sample in `data`

.

# 19.3. Algorithm training and execution ↩

We can construct an untrained algorithm by simply calling its definition function without parameters:

**>> alg=**`sda_pca_clf`

untrained sdalg 'sda_pca_clf'
frac 1x1 8 double 0.9
clf 0x0 1034 sdppl untrained sdlinear

We will use the medical problem in our example:

**>> load medical**
**>> a**
'medical all' 259783 by 11 sddata, 3 classes: 'cancer'(56652) 'non-cancer'(168467) 'shadow'(34664)
**>> b=a(:,:,1:2) **% only cancer and non-cancer classes
'medical all' 225119 by 11 sddata, 2 classes: 'cancer'(56652) 'non-cancer'(168467)

We will use first 8 patients as a test set and the remaining ones for training:

**>> [ts,tr]=**`subset`

(b,'patient',1:8)
'medical all' 112053 by 11 sddata, 2 classes: 'cancer'(33785) 'non-cancer'(78268)
'medical all' 113066 by 11 sddata, 2 classes: 'cancer'(22867) 'non-cancer'(90199)

To train the algorithm, we call directly the algorithm function:

**>> tralg=**`sda_pca_clf`

(alg,tr)
trained sdalg 'sda_pca_clf'
frac 1x1 8 double 0.9
clf 0x0 1034 sdppl untrained sdlinear
pca 11x2 5272 sdppl trained sdp_affine
trclf 2x1 17440 sdppl trained sdp_decide

Simpler alternative is to use the multiplication operator:

**>> tralg=tr*alg**
trained sdalg 'sda_pca_clf'
frac 1x1 8 double 0.9
clf 0x0 1034 sdppl untrained sdlinear
pca 11x2 5272 sdppl trained sdp_affine
trclf 2x1 17440 sdppl trained sdp_decide

Training added `pca`

and `trclf`

fields. We may display the trained classifier `trclf`

:

**>> tralg.trclf**
sequential pipeline 2x1 'Gauss eq.cov.+Output normalization+Decision'
1 Gauss eq.cov. 2x2 2 classes, 2 components (sdp_normal)
2 Output normalization 2x2 (sdp_norm)
3 Decision 2x1 weighting, 2 classes, 1 ops at op 1 (sdp_decide)

To execute the algorithm, apply it to the test set:

**>> dec=ts*tralg**
sdlab with 112053 entries, 2 groups: 'cancer'(15134) 'non-cancer'(96919)

Algorithms may be executed only on `sddata`

objects and the
execution output may be either decisions (`sdlab`

object) or soft
output (`sddata`

object).

# 19.4. Algorithms performing ROC and setting an operating point ↩

In the example above, we trained a PCA projection and a classifier. You may ask: Why to bother writing an algorithm function for it when it may be expressed as a simple one-liner:

**>> p=**`sdpca`

([],0.9)*`sdlinear`

untrained pipeline 2 steps: sdpca+sdlinear
**>> ptr=a*p**
sequential pipeline 2x1 'PCA+Gaussian model+Normalization+Decision'
1 PCA 2x2 100%% of variance
2 Gaussian model 2x3 single cov.mat.
3 Normalization 3x3
4 Decision 3x1 weighting, 3 classes

Good question! In this section, we demonstrate the real-world problem that algorithms were created to address. Any real-world classifier needs more than training few sequential steps. It requires application-specific performance optimization using ROC analysis. For example, in our fruit sorting system, we must make sure that at least 99% of bananas are found with minimal false positives.

This is very simple to achieve in a custom algorithm.

```
function out = sda_pca_clf_roc(alg,data)
if nargin==0
alg=
````sdalg`

(mfilename);
alg.frac=0.9;
8: alg.tpr_apple=0.99; % fix desired sensitivity on apple
alg.clf=`sdlinear`

;
out=alg;
elseif totrain(alg)
% split the avalable training data
16: [tr,val]=randsubset(data,0.5);
% train model
alg.pca=`sdpca`

(tr,alg.frac);
tr=tr*alg.pca;
alg.trclf=tr*alg.clf;
% estimate ROC (on validation set)
24: alg.roc=`sdroc`

(val, alg.trclf, 'measures',{'FPr','apple','TPr','apple'});
% constrain TPr(apple) and minimize the FPr(apple) in one step:
27: alg.roc=`setcurop`

(alg.roc,'constrain','TPr(apple)',alg.tpr_apple,...
'min','FPr(apple)');
out=`setstate`

(alg,'trained');
elseif toexecute(alg)
34: p=alg.pca*alg.trclf*alg.roc;
out=data*p;
end

On line 8, we add a parameter `tpr_apple`

specifying desired sensitivity
(true positive rate). On line 16, we split available training data into a
part used for training a model (`tr`

) and a validation set `val`

used for
ROC estimation. This is needed in order to avoid positive bias of the ROC
characteristic. The line 24 estimated ROC with TPr and FPr measures. On the
line 27, we set the operating point by first constraining TPr and then
minimizing FPr. Finally, in the execution section, we construct final
pipeline composed of a PCA projection, model and ROC.

**>> alg=sda_pca_clf_roc**
**>> alg.clf=sdparzen**
untrained sdalg 'sda_pca_clf_roc'
frac 1x1 8 double 0.9
tpr_apple 1x1 8 double 0.99
clf 0x0 1822 sdppl untrained Parzen
**>> tralg=a*alg**
....trained sdalg 'sda_pca_clf_roc'
frac 1x1 8 double 0.9
tpr_apple 1x1 8 double 0.99
clf 0x0 1822 sdppl untrained Parzen
pca 2x2 9828 sdppl trained PCA
trclf 2x1 27378 sdppl trained Parzen model+Decision
roc 159x3 21670 sdroc
**>> tralg.roc**
ROC (159 w-based op.points, 3 measures), curop: 83
est: 1:FPr(apple)=0.04, 2:TPr(apple)=0.99, 3:mean-error=0.03

The trained algorithm now contains the ROC object with properly set operating point.

Note, that this algorithm may be now easily applied to new data or
cross-validated using `sdcrossval`

function.

# 19.5. Setting algorithm current operating point ↩

The process of setting current operating point may be specified in a
separate algorithm section. This allows us to re-run it after training. We
may simply add a `tosetcurop`

section to the algorithm function. It fixes a
new operating point in already trained algorithm:

```
elseif tosetcurop(alg)
% constrain TPr(apple) and minimize the FPr(apple) in one step:
alg.roc=
````setcurop`

(alg.roc,'constrain','TPr(apple)',alg.tpr_apple,...
'min','FPr(apple)');
out=alg;
end

Note, that there are no extra paramters passed from outside. The code simply uses parameters set in the alg object.

Say, we have a traine algorithm built in the previous section:

**>> tralg=a*alg**
frac 1x1 8 double 0.9
tpr_apple 1x1 8 double 0.99
clf 0x0 1822 sdppl untrained Parzen
pca 2x2 9828 sdppl trained PCA
trclf 2x1 27378 sdppl trained Parzen model+Decision
roc 159x3 21670 sdroc

It was fixing the operating point at 99% TPr. We wish to adjust a different
one with 95% TPr. We simply update the `tpr_apple`

parameter:

**>> tralg.tpr_apple=0.95**
trained sdalg 'sda_pca_clf_roc'
frac 1x1 8 double 0.9
tpr_apple 1x1 8 double 0.95
clf 0x0 1822 sdppl untrained Parzen
pca 2x2 9828 sdppl trained PCA
trclf 2x1 27378 sdppl trained Parzen model+Decision
roc 159x3 21670 sdroc

And call the `setcurop`

method on the algorithm:

**>> tralg=setcurop(tralg)**
tosetcurop sdalg 'sda_pca_clf_roc'
frac 1x1 8 double 0.9
tpr_apple 1x1 8 double 0.95
clf 0x0 1822 sdppl untrained Parzen
pca 2x2 9828 sdppl trained PCA
trclf 2x1 27378 sdppl trained Parzen model+Decision
roc 159x3 21670 sdroc

The returned algorithm no uses new operating point:

**>> tralg.roc**
ROC (159 w-based op.points, 3 measures), curop: 76
est: 1:FPr(apple)=0.01, 2:TPr(apple)=0.95, 3:mean-error=0.03

# 19.6. Converting algorithms into pipelines for out-of-Matlab execution ↩

In order to execute a complex algorithm outside of Matlab, we need to
convert it into an `sdppl`

pipeline. We may add an extra `toconvert`

section in the algorithm function to describe this conversion.

For our example algorithm, the pipeline construction would be simple. We will only need to return concatenate the PCA projection with the trained classifier.

```
function out = sda_pca_clf(alg,data)
if nargin==0
alg=
````sdalg`

(mfilename);
alg.frac=0.9;
alg.clf=`sdlinear`

;
out=alg;
elseif totrain(alg)
alg.pca=`sdpca`

(data,alg.frac);
data=data*alg.pca;
alg.trclf=data*alg.clf;
out=alg;
elseif toexecute(alg)
data=data*alg.pca;
out=data*alg.trclf;
elseif toconvert(alg)
out=alg.pca*alg.trclf;
end

The trained algorithm may be now converted using `sdconvert`

function:

**>> p=**`sdconvert`

(tralg)
sequential pipeline 11x1 'PCA+Gauss eq.cov.+Output normalization+Decision'
1 PCA 11x2 93%% of variance (sdp_affine)
2 Gauss eq.cov. 2x2 2 classes, 2 components (sdp_normal)
3 Output normalization 2x2 (sdp_norm)
4 Decision 2x1 weighting, 2 classes, 1 ops at op 1 (sdp_decide)