perClass Documentation version 5.4 (7-Dec-2018)

Chapter 11: Pipelines

This chapter describes pipelines, trainable operations on data.

# 11.1. Introduction ↩

In perClass, processing or transformation of data is described using the concept of a pipeline. Let us take, as an example, training of a linear classifier:

``````>> load fruit
>> a
'Fruit set' 260 by 2 sddata, 3 classes: 'apple'(100) 'banana'(100) 'stone'(60)

>> p=sdlinear(a)
sequential pipeline       2x1 'Gaussian model+Normalization+Decision'
1 Gaussian model          2x3  single cov.mat.
2 Normalization           3x3
3 Decision                3x1  weighting, 3 classes
``````

The object `p` is a pipeline comprised of three stages. The first one is a Gaussian model computed in the input 2D feature space. The model describes three classes ('apple','banana' and 'stone') and, therefore, provides three corresponding outputs (probability densities). The second stage is a normalization turning the density into posterior. Finally, the third stage converts the posteriors in a decision providing a single integer output.

Pipelines in perClass are not limited to classification. They describe all types of data processing, including data scaling, feature extraction and selection.

## 11.1.1. Execution on new data ↩

The pipeline `p` may be applied to any data set with two features using the multiplication operator `*`:

``````>> data=sddata(data)
5 by 2 sddata, class: 'unknown'
>> out=data*p
sdlab with 5 entries, 2 groups: 'apple'(3) 'stone'(2)
``````

Output of our pipeline is an `sdlab` object with classifier decisions. In perClass 4, all classifiers produce decisions by default.

The pipeline execution is an analogy of a matrix multiplication. Our pipeline `p` acts as a matrix with two rows ( feature inputs) and one output (decision).

The multiplication operator is only syntactic sugar, the real work is done by `sdexe` function:

``````>> sdexe(p,data)
sdlab with 5 entries, 2 groups: 'apple'(3) 'stone'(2)
``````

If we execute pipeline on raw data matrix, we obtain raw numerical output:

``````>> data=rand(5,2)*100

data =

41.4248   77.6399
36.8954    4.8470
85.0896   59.0271
79.7602   15.8238
35.0236   93.7622

>> out=data*p

out =

3
1
1
1
3
``````

The mapping between integer decisions and decision names is handled by pipeline list:

``````>> p.list
sdlist (3 entries)
ind name
1 apple
2 banana
3 stone
``````

We may use the list object to convert decisions into names and vice versa:

``````>> p.list(3)

ans =

stone

>> p.list('apple')

ans =

1

>> p.list(out)

ans =

stone
apple
apple
apple
stone
``````

## 11.1.2. Accessing pipeline steps ↩

Unless specified explicitly, all pipeline operations refer to the last step:

``````>> p
sequential pipeline       2x1 'Gaussian model+Normalization+Decision'
1 Gaussian model          2x3  single cov.mat.
2 Normalization           3x3
3 Decision                3x1  weighting, 3 classes

>> p.output

ans =

decision
``````

We may access individual pipeline steps using parentheses `()`:

``````>> p(1).output

ans =

probability density
``````

Say, we wish to extract "soft outputs" of our classifier just before turning them into decisions:

``````>> p(1:2)
sequential pipeline       2x3 'Gaussian model+Normalization'
1 Gaussian model          2x3  single cov.mat.
2 Normalization           3x3

>> out=data*p(1:2)
5 by 3 sddata, class: 'unknown'
``````

The output is now a data set, because the second pipeline step returns real-value output.

A quick shorthand for removing decision step is a unary minus (`-`) operator:

``````>> -p
sequential pipeline       2x3 'Gaussian model+Normalization'
1 Gaussian model          2x3  single cov.mat.
2 Normalization           3x3
``````

We may, therefore, get classifier soft outputs using:

``````>> data*-p
5 by 3 sddata, class: 'unknown'
``````

Applying the unary minus to a data set which already returns soft output has no effect:

``````>> --p
sequential pipeline       2x3 'Gaussian model+Normalization'
1 Gaussian model          2x3  single cov.mat.
2 Normalization           3x3
``````

## 11.1.3. Displaying pipeline details ↩

Similarly to data sets and labels, perClass provides a quick shortcut for displaying details about a pipeline with a transpose operator (`'`):

``````>> p'
sequential pipeline     2x1 'Gaussian model+Normalization+Decision'
1 Gaussian model          2x3  single cov.mat.
inlab: 'length','color'
lab: 'apple','banana','stone'
output: probability density
2 Normalization           3x3
inlab: 'apple','banana','stone'
lab: 'apple','banana','stone'
output: posterior
3 Decision                3x1  weighting, 3 classes
inlab: 'apple','banana','stone'
output: decision ('apple','banana','stone')
``````

For each step, we can see the input/output labels and the type of output. We can see, that out pipeline `p` expects two input features, namely 'length' and 'color'.

This information may be accessed using pipeline fields `inlab`, `lab` and `output`:

``````>> p(1).inlab
sdlab with 2 entries: 'length','color'

>> p(3).output

ans =

decision
``````

## 11.1.4. Untrained pipelines ↩

Usually, we create pipelines by training them on a data set. However, in some situations, it may be more beneficial to create a pipeline description without a concrete data set. Such pipeline is called untrained.

An untrained pipeline is created by providing the first empty `[]`.

The trained Parzen classifier:

``````>> a
'Fruit set' 260 by 2 sddata, 3 classes: 'apple'(100) 'banana'(100) 'stone'(60)

>> p=sdparzen(a)
.....sequential pipeline       2x1 'Parzen model+Decision'
1 Parzen model            2x3  260 prototypes, h=0.8
2 Decision                3x1  weighting, 3 classes
``````

The untrained Parzen classifier:

``````>> u=sdparzen([])
untrained pipeline 'sdparzen'
``````

By multiplying a data set with untrained pipeline, we train it:

``````>> p2=a*u
.....sequential pipeline       2x1 'Parzen model+Decision'
1 Parzen model            2x3  260 prototypes, h=0.8
2 Decision                3x1  weighting, 3 classes
``````

Note, that the order is always `data * pipeline`.

Untrained pipelines are useful to separate the definition of a classifier from its training on data. We may provide any parameters when defining an untrained pipeline:

``````>> u2=sdneural([],'units',20,'iters',1000)
untrained pipeline 'sdneural'
``````

Untrained pipelines are used, for example, by `sdcrossval` to perform evaluation by cross-validation:

``````>> sdcrossval(u,a)
10 folds: [1: ....] [2: .....] [3: ....] [4: ....] [5: .....] [6: .....] [7: ....] [8: .....] [9: .....] [10: ....]

ans =

10-fold rotation

ind mean (std)  measure
1 0.09 (0.02) mean error over classes, priors [0.3,0.3,0.3]

>> sdcrossval(u2,a)
10 folds: [1: ] [2: ] [3: ] [4: ] [5: ] [6: ] [7: ] [8: ] [9: ] [10: ]

ans =

10-fold rotation

ind mean (std)  measure
1 0.08 (0.01) mean error over classes, priors [0.3,0.3,0.3]
``````