Exercises and Answers

1. Using the Vehicle set from the mlbench package. Construct a KNN model for parameters k from 1 to 5. Which one gives the best results? What is the re-substitution error and LOO CV error. Make a confusion matrix. After loading the package, load the Vehicle dataset as follow.

data(Vehicle)

Proposed answer:

library(caret)
library(mlbench)
library(dplyr)
data(Vehicle)

data_set <- Vehicle

ctrl.loo <- trainControl(method = 'LOOCV',
                         search = 'grid') 

(model.knn <- train(Class ~ ., 
                    data = data_set, 
                    method = 'knn', 
                    tuneGrid = data.frame(k = 1:5),
                    trControl = ctrl.loo))
## k-Nearest Neighbors 
## 
## 846 samples
##  18 predictor
##   4 classes: 'bus', 'opel', 'saab', 'van' 
## 
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation 
## Summary of sample sizes: 845, 845, 845, 845, 845, 845, ... 
## Resampling results across tuning parameters:
## 
##   k  Accuracy   Kappa    
##   1  0.6524823  0.5364211
##   2  0.6453901  0.5269118
##   3  0.6643026  0.5523409
##   4  0.6572104  0.5430247
##   5  0.6572104  0.5429991
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 3.
1-model.knn$results[2] #boot error rate 
##    Accuracy
## 1 0.3475177
## 2 0.3546099
## 3 0.3356974
## 4 0.3427896
## 5 0.3427896
mean(predict(model.knn) != data_set$Class) #resubtitution error
## [1] 0.1879433
confusionMatrix(predict(model.knn), data_set$Class)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction bus opel saab van
##       bus  207    6    8   1
##       opel   4  137   54   0
##       saab   5   59  148   2
##       van    2   10    7 196
## 
## Overall Statistics
##                                          
##                Accuracy : 0.8132         
##                  95% CI : (0.7853, 0.839)
##     No Information Rate : 0.2577         
##     P-Value [Acc > NIR] : < 2e-16        
##                                          
##                   Kappa : 0.751          
##                                          
##  Mcnemar's Test P-Value : 0.02524        
## 
## Statistics by Class:
## 
##                      Class: bus Class: opel Class: saab Class: van
## Sensitivity              0.9495      0.6462      0.6820     0.9849
## Specificity              0.9761      0.9085      0.8951     0.9706
## Pos Pred Value           0.9324      0.7026      0.6916     0.9116
## Neg Pred Value           0.9824      0.8848      0.8908     0.9952
## Prevalence               0.2577      0.2506      0.2565     0.2352
## Detection Rate           0.2447      0.1619      0.1749     0.2317
## Detection Prevalence     0.2624      0.2305      0.2530     0.2541
## Balanced Accuracy        0.9628      0.7774      0.7885     0.9778

2. Using the leafshape set from the DAAG package. Construct a naive Bayesian model and see how much CV LOO and re-substitution error. What the accuracy is?

Proposed answer:

library(caret)
library(DAAG)
library(dplyr)

data_set <- leafshape

ctrl.loo <- trainControl(method = 'LOOCV',
                         search = 'grid') 

(model.nb <- train(location ~ ., 
                   data = data_set, 
                   method = 'nb',
                   trControl = ctrl.loo))
## Naive Bayes 
## 
## 286 samples
##   8 predictor
##   6 classes: 'Sabah', 'Panama', 'Costa Rica', 'N Queensland', 'S Queensland', 'Tasmania' 
## 
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation 
## Summary of sample sizes: 285, 285, 285, 285, 285, 285, ... 
## Resampling results across tuning parameters:
## 
##   usekernel  Accuracy   Kappa 
##   FALSE             NA      NA
##    TRUE      0.5839161  0.4816
## 
## Tuning parameter 'fL' was held constant at a value of 0
## Tuning
##  parameter 'adjust' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were fL = 0, usekernel = TRUE and adjust
##  = 1.
1-model.nb$results[2] #boot error rate 
##   usekernel
## 1         1
## 2         0
mean(predict(model.nb) != data_set$location) #resubtitution error
## [1] 0.3146853

3. Using the painters set from the MASS package. Construct a LDA model. What is the re-substitution error and LOO CV error. Make a confusion matrix.

Proposed answer:

library(caret)
library(MASS)
library(dplyr)

data_set <- painters

ctrl.loo <- trainControl(method = 'LOOCV',
                         search = 'grid') 

(model.lda <- train(School ~ ., 
                    data = data_set, 
                    method = 'lda',
                    trControl = ctrl.loo))
## Linear Discriminant Analysis 
## 
## 54 samples
##  4 predictor
##  8 classes: 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H' 
## 
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation 
## Summary of sample sizes: 53, 53, 53, 53, 53, 53, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.3333333  0.2192771
1-model.lda$results[2] #boot error rate 
##    Accuracy
## 1 0.6666667
mean(predict(model.lda) != data_set$School) #resubtitution error
## [1] 0.4444444
confusionMatrix(predict(model.lda), data_set$School)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction A B C D E F G H
##          A 5 4 0 0 0 1 1 0
##          B 0 1 2 0 0 0 0 0
##          C 1 1 2 0 0 0 0 1
##          D 2 0 0 9 1 0 1 0
##          E 0 0 2 0 4 0 1 0
##          F 0 0 0 0 0 2 0 0
##          G 0 0 0 1 1 1 4 0
##          H 2 0 0 0 1 0 0 3
## 
## Overall Statistics
##                                          
##                Accuracy : 0.5556         
##                  95% CI : (0.414, 0.6908)
##     No Information Rate : 0.1852         
##     P-Value [Acc > NIR] : 1.328e-09      
##                                          
##                   Kappa : 0.4812         
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E Class: F
## Sensitivity           0.50000  0.16667  0.33333   0.9000  0.57143  0.50000
## Specificity           0.86364  0.95833  0.93750   0.9091  0.93617  1.00000
## Pos Pred Value        0.45455  0.33333  0.40000   0.6923  0.57143  1.00000
## Neg Pred Value        0.88372  0.90196  0.91837   0.9756  0.93617  0.96154
## Prevalence            0.18519  0.11111  0.11111   0.1852  0.12963  0.07407
## Detection Rate        0.09259  0.01852  0.03704   0.1667  0.07407  0.03704
## Detection Prevalence  0.20370  0.05556  0.09259   0.2407  0.12963  0.03704
## Balanced Accuracy     0.68182  0.56250  0.63542   0.9045  0.75380  0.75000
##                      Class: G Class: H
## Sensitivity           0.57143  0.75000
## Specificity           0.93617  0.94000
## Pos Pred Value        0.57143  0.50000
## Neg Pred Value        0.93617  0.97917
## Prevalence            0.12963  0.07407
## Detection Rate        0.07407  0.05556
## Detection Prevalence  0.12963  0.11111
## Balanced Accuracy     0.75380  0.84500

4. Using the Cars93 set from the MASS package. Construct a classification tree where the variable Type depends on the variables Length, Weight, Engine, Size, Horsepower, RPM and tuneLength = 5. What is the re-substitution error and LOO CV error. Make a confusion matrix and plot.

Proposed answer:

library(caret)
library(MASS)
library(dplyr)
library(rattle)

data_set <- Cars93

ctrl.loo <- trainControl(method = 'LOOCV',
                         search = 'grid') 

(model.tree <- train(Type ~ Length + Weight + EngineSize + Horsepower + RPM, 
                     data = data_set, 
                     method = 'rpart',
                     trControl = ctrl.loo,
                     tuneLength = 5))
## CART 
## 
## 93 samples
##  5 predictor
##  6 classes: 'Compact', 'Large', 'Midsize', 'Small', 'Sporty', 'Van' 
## 
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation 
## Summary of sample sizes: 92, 92, 92, 92, 92, 92, ... 
## Resampling results across tuning parameters:
## 
##   cp          Accuracy   Kappa       
##   0.02816901  0.5698925  0.4736842105
##   0.07042254  0.5053763  0.3883328567
##   0.09859155  0.4086022  0.2581580856
##   0.16901408  0.4193548  0.2476404494
##   0.28169014  0.2365591  0.0007566586
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.02816901.
1-model.tree$results[2] #boot error rate 
##    Accuracy
## 1 0.4301075
## 2 0.4946237
## 3 0.5913978
## 4 0.5806452
## 5 0.7634409
mean(predict(model.tree) != data_set$Type) #resubtitution error
## [1] 0.2903226
confusionMatrix(predict(model.tree), data_set$Type)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Compact Large Midsize Small Sporty Van
##    Compact      14     1       2     1      7   3
##    Large         0    10       3     0      0   0
##    Midsize       2     0      16     0      2   0
##    Small         0     0       0    20      5   0
##    Sporty        0     0       0     0      0   0
##    Van           0     0       1     0      0   6
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7097          
##                  95% CI : (0.6064, 0.7992)
##     No Information Rate : 0.2366          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6428          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Compact Class: Large Class: Midsize Class: Small
## Sensitivity                  0.8750       0.9091         0.7273       0.9524
## Specificity                  0.8182       0.9634         0.9437       0.9306
## Pos Pred Value               0.5000       0.7692         0.8000       0.8000
## Neg Pred Value               0.9692       0.9875         0.9178       0.9853
## Prevalence                   0.1720       0.1183         0.2366       0.2258
## Detection Rate               0.1505       0.1075         0.1720       0.2151
## Detection Prevalence         0.3011       0.1398         0.2151       0.2688
## Balanced Accuracy            0.8466       0.9363         0.8355       0.9415
##                      Class: Sporty Class: Van
## Sensitivity                 0.0000    0.66667
## Specificity                 1.0000    0.98810
## Pos Pred Value                 NaN    0.85714
## Neg Pred Value              0.8495    0.96512
## Prevalence                  0.1505    0.09677
## Detection Rate              0.0000    0.06452
## Detection Prevalence        0.0000    0.07527
## Balanced Accuracy           0.5000    0.82738
fancyRpartPlot(model.tree$finalModel)