Problem: How to build a classifier for SQL database records that contain nominal values?
Solution: Use sdnominal to make sure your data sets share common representation of nominal data.
28.1. Introduction ↩
When we build classifiers for database applications, we often encounter nominal features. A nominal feature captures an object quality. An example is the country a person is from. As designers, we must make sure our classifiers encounter identical representation of nominal features both in training and in execution.
In this article, we walk through an database application example that explains how to deal with nominal data representation.
28.2. Creating a data set from SQL database ↩
In our example, we use Adult data set from UCI repository. The data describes education and occupation of people and can be used to predict their annual income.
The data set was imported into SQLite database. Therefore, we may pull in Matlab only a small subset we actually want to use.
Let's say we want to build a classifier only concerning people in England.
We will first open the database for use in perClass:
>> d=sddb
('adult.db')
perClass DB: 1 connections
The database contains one table called 'data':
>> tables
(d)
ans =
'data'
The 'data' table contains the following fields:
>> F=fields
(d,'data')
ans =
Columns 1 through 5
'age' 'workclass' 'fnlwgt' 'education' 'education_num'
Columns 6 through 10
'marital_status' 'occupation' 'relationship' 'race' 'sex'
Columns 11 through 14
'capital_gain' 'capital_loss' 'hours_per_week' 'native_country'
Column 15
'income'
And create an SQL query only for data from England:
>> s=sdsql
(d,'select * from data where native_country="England";')
127 by 15 set in 'adult.db':
The query returned 127 records, each with 15 fields.
We may convert the sdsql
object into a Matlab cell array:
>> C=cell(s);
and view its first record:
>> C(1,:)
ans =
Columns 1 through 6
[48] 'Private' [171095] 'Assoc-acdm' [12] 'Divorced'
Columns 7 through 12
'Exec-managerial' 'Unmarried' 'White' 'Female' [0] [0]
Columns 13 through 15
[40] 'England' '<=50K'
Each record is a mix of numerical and string (nominal) values.
We will construct an sddata
object from the SQL query:
>> a=sddata
(s)
127 by 15 sddata (nominal), class: 'unknown'
and add field names as feature labels:
>> a.featlab=F
127 by 15 sddata (nominal), class: 'unknown'
>> a.featlab'
1 age
2 workclass
3 fnlwgt
4 education
5 education_num
6 marital_status
7 occupation
8 relationship
9 race
10 sex
11 capital_gain
12 capital_loss
13 hours_per_week
14 native_country
15 income
Note, that sddata
indicates that the object contains nominal
values. With sdnominal
command, we may display details on the nominal
data set:
>> sdnominal
(a)
Data set contains 9 nominal features:
1 'age' (real)
2 'workclass' (nominal) 1:? 2:Federal-gov 3:Local-gov 4:Private 5:Self-emp-inc 6:Self-emp-not-inc 7:State-gov
3 'fnlwgt' (real)
4 'education' (nominal) 1:10th 2:11th 3:12th 4:1st-4th 5:5th-6th 6:9th 7:Assoc-acdm 8:Assoc-voc 9:Bachelors 10:Doctorate 11:HS-grad 12:Masters 13:Prof-school 14:Some-college
5 'education_num' (real)
6 'marital_status' (nominal) 1:Divorced 2:Married-civ-spouse 3:Married-spouse-absent 4:Never-married 5:Separated 6:Widowed
7 'occupation' (nominal) 1:? 2:Adm-clerical 3:Craft-repair 4:Exec-managerial 5:Farming-fishing 6:Handlers-cleaners 7:Machine-op-inspct 8:Other-service 9:Priv-house-serv 10:Prof-specialty 11:Protective-serv 12:Sales 13:Tech-support 14:Transport-moving
8 'relationship' (nominal) 1:Husband 2:Not-in-family 3:Other-relative 4:Own-child 5:Unmarried 6:Wife
9 'race' (nominal) 1:Asian-Pac-Islander 2:Black 3:Other 4:White
10 'sex' (nominal) 1:Female 2:Male
11 'capital_gain' (real)
12 'capital_loss' (real)
13 'hours_per_week' (real)
14 'native_country' (nominal) 1:England
15 'income' (nominal) 1:<=50K 2:>50K
For each nominal attribute, sdnominal
lists the values both in numerical
representation used inside sddata
and the original string value.
For our example, we will use the 15th 'income' feature as a label. We can
convert any nominal feature into sdlab
object:
>> a.lab=sdlab
(a(:,'income'))
127 by 15 sddata (nominal), 2 classes: '<=50K'(80) '>50K'(47)
and 14th feature (native_country
) as a "country" labels:
>> a.country=sdlab
(a(:,'native_country'))
127 by 15 sddata (nominal), 2 classes: '<=50K'(80) '>50K'(47)
>> a.country
sdlab with 127 entries from 'England'
Finally, we preserve only the first 13 columns as features:
>> a=a(:,1:13)
127 by 13 sddata (nominal), 2 classes: '<=50K'(80) '>50K'(47)
28.3. Training a classifier on nominal data set ↩
We can directly train any classifier on our nominal data set a
:
>> pr=sdrandforest
(a)
..........
sequential pipeline 13x1 'Random forest+Decision'
1 Random forest 13x2 20 trees
2 Decision 2x1 weighting, 2 classes
Random forest error on the training set:
>> sdconfmat
(a.lab,a*pr)
ans =
True | Decisions
Labels | <=50K >50K | Totals
---------------------------------------
<=50K | 79 1 | 80
>50K | 3 44 | 47
---------------------------------------
Totals | 82 45 | 127
Note, that perClass classifiers, trained on nominal data, preserve detailed nominal information:
>> sdnominal
(pr)
Pipeline expects on input 7 nominal features:
1 'age' (real)
2 'workclass' (nominal) 1:? 2:Federal-gov 3:Local-gov 4:Private 5:Self-emp-inc 6:Self-emp-not-inc 7:State-gov
3 'fnlwgt' (real)
4 'education' (nominal) 1:10th 2:11th 3:12th 4:1st-4th 5:5th-6th 6:9th 7:Assoc-acdm 8:Assoc-voc 9:Bachelors 10:Doctorate 11:HS-grad 12:Masters 13:Prof-school 14:Some-college
5 'education_num' (real)
6 'marital_status' (nominal) 1:Divorced 2:Married-civ-spouse 3:Married-spouse-absent 4:Never-married 5:Separated 6:Widowed
7 'occupation' (nominal) 1:? 2:Adm-clerical 3:Craft-repair 4:Exec-managerial 5:Farming-fishing 6:Handlers-cleaners 7:Machine-op-inspct 8:Other-service 9:Priv-house-serv 10:Prof-specialty 11:Protective-serv 12:Sales 13:Tech-support 14:Transport-moving
8 'relationship' (nominal) 1:Husband 2:Not-in-family 3:Other-relative 4:Own-child 5:Unmarried 6:Wife
9 'race' (nominal) 1:Asian-Pac-Islander 2:Black 3:Other 4:White
10 'sex' (nominal) 1:Female 2:Male
11 'capital_gain' (real)
12 'capital_loss' (real)
13 'hours_per_week' (real)
28.4. Creating test data set ↩
We will pull records from Germany for testing:
>> s2=sdsql
(d,'select * from data where native_country="Germany";')
206 by 15 set in 'adult.db':
SQL='select * from data where native_country="Germany";'
>> b=sddata
(s2)
206 by 15 sddata (nominal), class: 'unknown'
>> b.featlab=fields(d,'data')
206 by 15 sddata (nominal), class: 'unknown'
>> b.lab=sdlab
( b(:,'income') )
206 by 15 sddata (nominal), 2 classes: '<=50K'(148) '>50K'(58)
>> b.country=sdlab
( b(:,'native_country') )
206 by 15 sddata (nominal), 2 classes: '<=50K'(148) '>50K'(58)
>> b=b(:,1:13)
206 by 13 sddata (nominal), 2 classes: '<=50K'(148) '>50K'(58)
Let's apply our classifier pr
on the data set b
:
>> b*pr
{??? Error using ==> sdexe
Nominal representations in data set and pipeline do not agree! Use sdnominal to
validate and/or update nominal representation.
What is the cause of the reported error?
The nominal representation in data set b
and classifier pr
do not
match. In other words, different numerical coding is used for nominal
categories in b
and in pr
. Applying one to another would yield
meaningless results.
>> sdnominal
(b)
Data set contains 7 nominal features:
1 'age' (real)
2 'workclass' (nominal) 1:? 2:Federal-gov 3:Local-gov 4:Private 5:Self-emp-inc 6:Self-emp-not-inc 7:State-gov
3 'fnlwgt' (real)
4 'education' (nominal) 1:10th 2:11th 3:12th 4:7th-8th 5:Assoc-acdm 6:Assoc-voc 7:Bachelors 8:Doctorate 9:HS-grad 10:Masters 11:Prof-school 12:Some-college
5 'education_num' (real)
6 'marital_status' (nominal) 1:Divorced 2:Married-civ-spouse 3:Married-spouse-absent 4:Never-married 5:Separated 6:Widowed
7 'occupation' (nominal) 1:? 2:Adm-clerical 3:Craft-repair 4:Exec-managerial 5:Farming-fishing 6:Handlers-cleaners 7:Machine-op-inspct 8:Other-service 9:Priv-house-serv 10:Prof-specialty 11:Protective-serv 12:Sales 13:Tech-support 14:Transport-moving
8 'relationship' (nominal) 1:Husband 2:Not-in-family 3:Other-relative 4:Own-child 5:Unmarried 6:Wife
9 'race' (nominal) 1:Amer-Indian-Eskimo 2:Asian-Pac-Islander 3:Black 4:Other 5:White
10 'sex' (nominal) 1:Female 2:Male
11 'capital_gain' (real)
12 'capital_loss' (real)
13 'hours_per_week' (real)
Comparing detailed nominal information on data set b
and the classifier
pr
(see previous section) shows, for example, that our test set contains
Amer-Indian-Eskimo
race but the classifier does not know about it as this
category was not present in the training data set a
.
28.5. Making sure nominal representation are identical ↩
The message is: We must make sure the nominal representation of our problem is identical in all data sets and classifiers we use.
The sdnominal
command may create a joined representation updating both
data sets a
and b
so that they are compatible:
>> [a2,b2]=sdnominal
(a,'join',b)
OK: Both returned data sets share the same joined nominal data representation (data matrices updated).
127 by 13 sddata (nominal), 2 classes: '<=50K'(80) '>50K'(47)
206 by 13 sddata (nominal), 2 classes: '<=50K'(148) '>50K'(58)
Note, that our classifier pr
is now obsolete because its representation
does not match the one in the updated training set a2
:
>> sdnominal
(a2,pr);
ISSUE: Each object represents nominal features by different numerical values.
We need to retrain it.
>> pr2=sdrandforest
(a2)
..........
sequential pipeline 13x1 'Random forest+Decision'
1 Random forest 13x2 20 trees
2 Decision 2x1 weighting, 2 classes
Our classifier pr2
may be now applied to data set b2
:
>> sdconfmat
(b2.lab,b2*pr2)
ans =
True | Decisions
Labels | <=50K >50K | Totals
---------------------------------------
<=50K | 131 17 | 148
>50K | 20 38 | 58
---------------------------------------
Totals | 151 55 | 206
28.6. Conclusions ↩
We have seen how to pull data from SQL database and train a classifier on nominal feature represenation. perClass helps us to make sure that the nominal representation used in our data sets and classifiers is identical.
TIP: To avoid the need for re-training our classifiers, we should define our data sets with the complete set of nominal categories for each attribute and use the same setup in the entire project.