- 9.1. Introduction
- 9.2. Creating data sets with nominal features
- 9.3. Testing if data set contains nominal features
- 9.4. Display info about nominal data
- 9.5. Converting nominal feature to labels
- 9.6. Training a pipeline on nominal data set
- 9.7. Combining nominal data sets
- 9.8. Testing if two nominal reprsentations are identical
- 9.9. Making two nominal representations identical
- 9.10. Applying pipelines to nominal data sets
- 9.11. Turning labels into nominal features
9.1. Introduction ↩
Nominal, or categorical, features describe qualitative aspects of an object. For example, an object's color may be "blue", "red" or "green". The color feature is nominal because the available values do not share any ordering relationship ("blue" is not higher/lower or better/worse than "green").
perClass with DB licensing option offers handling of nominal features and
training classifiers on this type of data. You can check if the DB option
is present using sdversion
.
9.2. Creating data sets with nominal features ↩
With DB option, sddata
sets may be created from cell arrays
containing string fields. The data matrix remains numerical, perClass
however defines and maintains the mapping between nominal values and their
numerical representation.
Let us consider this cell array:
>> C={1.5 'aaa'; -5 'bbb'; 1.7 'aaa'; 4 'ccc'}
C =
[1.5000] 'aaa'
[ -5] 'bbb'
[1.7000] 'aaa'
[ 4] 'ccc'
>> a=sddata
(C)
4 by 2 sddata (nominal), class: 'unknown'
The data set a
contains four samples and two features.
The content of the data set a
is numerical, similarly to any other
sddata
set:
>> +a
ans =
1.5000 1.0000
-5.0000 2.0000
1.7000 1.0000
4.0000 3.0000
9.3. Testing if data set contains nominal features ↩
We can use isnominal
to test whether a data set contains nominal
information.
>> isnominal
(a)
ans =
1
The isnominal
returns 1
if at least one nominal feature exists
in the data set and 0
otherwise:
>> isnominal
(a(:,1))
ans =
0
9.4. Display info about nominal data ↩
Detailed information on nominal data is provided by sdnominal
command:
>> sdnominal
(a)
Data set contains one nominal feature:
1 'Feature 1' (real)
2 'Feature 2' (nominal) 1:aaa 2:bbb 3:ccc
For each nominal feature, it displays all nominal values with the corresponding numerical representation.
9.5. Converting nominal feature to labels ↩
Any nominal feature may be converted into sdlab
object:
>> L=sdlab
(a(:,2))
sdlab with 4 entries, 3 groups: 'aaa'(2) 'bbb'(1) 'ccc'(1)
>> +L
ans =
aaa
bbb
aaa
ccc
9.6. Training a pipeline on nominal data set ↩
A pipeline remembers it was trained on nominal data set. Therefore, we may
test it with isnominal
and even see the nominal representation
with sdnominal
:
>> p=sdknn
(a)
sequential pipeline 2x1 '1-NN+Decision'
1 1-NN 2x1 4 prototypes
2 Decision 1x1 threshold on 'unknown'
>> isnominal
(p)
ans =
1
>> sdnominal
(p)
Pipeline expects on input one nominal feature:
1 'Feature 1' (real)
2 'Feature 2' (nominal) 1:aaa 2:bbb 3:ccc
9.7. Combining nominal data sets ↩
Image, we have another cell array:
>> C2={5 'ccc'; -3 'bbb'}
C2 =
[ 5] 'ccc'
[-3] 'bbb'
We convert it into a data set:
>> a2=sddata
(C2)
2 by 2 sddata (nominal), class: 'unknown'
>> +a2
ans =
5 1
-3 2
If we concatenate the data sets a
and a2
, we receive an error message:
>> [a;a2]
{??? Error using ==> sddata.vertcat at 50
Data sets being concatenated do not share identical nominal representation. Use
sdnominal to either use one existing representation for all data sets ('from'
option) or create a new representation for all sets ('join' option).
The reason is, that both data sets encode the nominal values by different
numbers. While the value c
is represented by 3 in data set a
, it is 2
in data set a2
.
>> sdnominal
(a)
Data set contains one nominal feature:
1 'Feature 1' (real)
2 'Feature 2' (nominal) 1:aaa 2:bbb 3:ccc
>> sdnominal
(a2)
Data set contains one nominal feature:
1 'Feature 1' (real)
2 'Feature 2' (nominal) 1:ccc 2:bbb
We need identical numerical representation of nominal features in all data sets and, consequently, all classifiers we build! The fundamental rule when working with nominal data is: In your project, use a single nominal representation of each nominal feature.
9.8. Testing if two nominal reprsentations are identical ↩
The sdnominal
function allows us to test whether two objects share the
same nominal representation:
>> sdnominal
(a,a2)
ISSUE: Each object represents nominal features by different numerical values.
ans =
0
Subset of the same data set has identical nominal representation:
>> sdnominal
(a, a(3:end) )
OK: Both objects share identical numerical representation of nominal data.
ans =
1
Similarly to other perClass functions, the additional display output may be surpressed using the 'no display' option:
>> sdnominal
(a, a(3:end), 'nodisplay' )
ans =
1
9.9. Making two nominal representations identical ↩
In our example above, the data sets a
and a2
contain different
numerical representation of nominal data. We may use sdnominal
to pass
the nominal representation from one object to another:
>> b2=sdnominal
(a2,'from',a)
Nominal representation in the data matrix updated based on the 'from' data set.
2 by 2 sddata (nominal), class: 'unknown'
The new nominal representation in b2 data set considers three values of 'Feature 2', namely 'aaa','bbb' and 'ccc':
>> sdnominal
(b2)
Data set contains one nominal feature:
1 'Feature 1' (real)
2 'Feature 2' (nominal) 1:aaa 2:bbb 3:ccc
>> [+a2 +b2]
ans =
5 1 5 3
-3 2 -3 2
The data sets a
and b2
now share identical representation:
>> sdnominal
(a,b2)
OK: Both objects share identical numerical representation of nominal data.
ans =
1
Therefore, they may be concatenated:
>> [a;b2]
6 by 2 sddata (nominal), class: 'unknown'
Note however, that we cannot pass nominal representation from a2
to a
because some categories are not present:
>> b=sdnominal
(a,'from',a2)
The following values are not present in the nominal list.
ans =
aaa
{??? Error using ==> sddata.sddata at 129
Some categories in the label object were not found in the list of nominal
values
9.10. Applying pipelines to nominal data sets ↩
We may apply a classifier, trained above on data set a
, to it or any
subset of a
:
>> p
sequential pipeline 2x1 '1-NN+Decision'
1 1-NN 2x1 4 prototypes
2 Decision 1x1 threshold on 'unknown'
>> sdnominal
(p)
Pipeline expects on input one nominal feature:
1 'Feature 1' (real)
2 'Feature 2' (nominal) 1:aaa 2:bbb 3:ccc
>> a*p
sdlab with 4 entries from 'unknown'
>> a(3)*p
sdlab with one entry: 'unknown'
However, an error is raised if applying it to data set a2
:
>> a2*p
{??? Error using ==> sdexe at 102
Nominal representations in data set and pipeline do not agree! Use sdnominal to
validate and/or update nominal representation.
This operation is not allowed because it would lead to incorrect results (recall, that a2 encodes nominal values as 'ccc' as 1 while the classifier thinks it should be represented by 3.)
We may, however, execute p
on the data set b2
, created from a2
in the
previous section:
>> b2*p
sdlab with 2 entries from 'reject'
Important: perClass only checks if nominal representations match when
working with sddata
sets, but not for numerical matrices:
>> +a2*p
ans =
2
2
Also, the C-based execution runtime does not check for correctness of nominal representation executing classifiers out-of-Matlab.
9.11. Turning labels into nominal features ↩
Above, we have seen how to convert nominal feature
into an sdlab
object. That is very useful if we want to work with
categories using powerful logical operations or regular expressions.
But how to bring the label object back to nominal feature values?
Let us consider this label object:
>> L=sdlab
(a(3:end,2))
sdlab with 2 entries, 2 groups: 'aaa'(1) 'ccc'(1)
>> +L
ans =
aaa
ccc
If we only convert L
into sddata
set, categories are
represented differently than in the original set:
>> c=sddata
(L)
2 by 1 sddata (nominal), class: 'unknown'
>> +c
ans =
1
2
This is because label objects in perClass only represent categories present.
The useful for converting the labels back into original nominal representation is:
>> d=sdnominal
(sddata
(L),'from',a(:,2))
Nominal representation in the data matrix updated based on the 'from' data set.
2 by 1 sddata (nominal), class: 'unknown'
>> +d
ans =
1
3