1. Using the Vehicle
set from the
mlbench
package. Construct a KNN model for parameters k
from 1 to 5. Which one gives the best results? What is the
re-substitution error and LOO CV error. Make a confusion matrix. After
loading the package, load the Vehicle dataset as follow.
data(Vehicle)
Proposed answer:
library(caret)
library(mlbench)
library(dplyr)
data(Vehicle)
data_set <- Vehicle
ctrl.loo <- trainControl(method = 'LOOCV',
search = 'grid')
(model.knn <- train(Class ~ .,
data = data_set,
method = 'knn',
tuneGrid = data.frame(k = 1:5),
trControl = ctrl.loo))
## k-Nearest Neighbors
##
## 846 samples
## 18 predictor
## 4 classes: 'bus', 'opel', 'saab', 'van'
##
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation
## Summary of sample sizes: 845, 845, 845, 845, 845, 845, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 1 0.6524823 0.5364211
## 2 0.6453901 0.5269118
## 3 0.6643026 0.5523409
## 4 0.6572104 0.5430247
## 5 0.6572104 0.5429991
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 3.
1-model.knn$results[2] #boot error rate
## Accuracy
## 1 0.3475177
## 2 0.3546099
## 3 0.3356974
## 4 0.3427896
## 5 0.3427896
mean(predict(model.knn) != data_set$Class) #resubtitution error
## [1] 0.1879433
confusionMatrix(predict(model.knn), data_set$Class)
## Confusion Matrix and Statistics
##
## Reference
## Prediction bus opel saab van
## bus 207 6 8 1
## opel 4 137 54 0
## saab 5 59 148 2
## van 2 10 7 196
##
## Overall Statistics
##
## Accuracy : 0.8132
## 95% CI : (0.7853, 0.839)
## No Information Rate : 0.2577
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.751
##
## Mcnemar's Test P-Value : 0.02524
##
## Statistics by Class:
##
## Class: bus Class: opel Class: saab Class: van
## Sensitivity 0.9495 0.6462 0.6820 0.9849
## Specificity 0.9761 0.9085 0.8951 0.9706
## Pos Pred Value 0.9324 0.7026 0.6916 0.9116
## Neg Pred Value 0.9824 0.8848 0.8908 0.9952
## Prevalence 0.2577 0.2506 0.2565 0.2352
## Detection Rate 0.2447 0.1619 0.1749 0.2317
## Detection Prevalence 0.2624 0.2305 0.2530 0.2541
## Balanced Accuracy 0.9628 0.7774 0.7885 0.9778
2. Using the leafshape
set from the
DAAG
package. Construct a naive Bayesian model and see how
much CV LOO and re-substitution error. What the accuracy
is?
Proposed answer:
library(caret)
library(DAAG)
library(dplyr)
data_set <- leafshape
ctrl.loo <- trainControl(method = 'LOOCV',
search = 'grid')
(model.nb <- train(location ~ .,
data = data_set,
method = 'nb',
trControl = ctrl.loo))
## Naive Bayes
##
## 286 samples
## 8 predictor
## 6 classes: 'Sabah', 'Panama', 'Costa Rica', 'N Queensland', 'S Queensland', 'Tasmania'
##
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation
## Summary of sample sizes: 285, 285, 285, 285, 285, 285, ...
## Resampling results across tuning parameters:
##
## usekernel Accuracy Kappa
## FALSE NA NA
## TRUE 0.5839161 0.4816
##
## Tuning parameter 'fL' was held constant at a value of 0
## Tuning
## parameter 'adjust' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were fL = 0, usekernel = TRUE and adjust
## = 1.
1-model.nb$results[2] #boot error rate
## usekernel
## 1 1
## 2 0
mean(predict(model.nb) != data_set$location) #resubtitution error
## [1] 0.3146853
3. Using the painters
set from the
MASS
package. Construct a LDA model. What is the
re-substitution error and LOO CV error. Make a confusion
matrix.
Proposed answer:
library(caret)
library(MASS)
library(dplyr)
data_set <- painters
ctrl.loo <- trainControl(method = 'LOOCV',
search = 'grid')
(model.lda <- train(School ~ .,
data = data_set,
method = 'lda',
trControl = ctrl.loo))
## Linear Discriminant Analysis
##
## 54 samples
## 4 predictor
## 8 classes: 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H'
##
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation
## Summary of sample sizes: 53, 53, 53, 53, 53, 53, ...
## Resampling results:
##
## Accuracy Kappa
## 0.3333333 0.2192771
1-model.lda$results[2] #boot error rate
## Accuracy
## 1 0.6666667
mean(predict(model.lda) != data_set$School) #resubtitution error
## [1] 0.4444444
confusionMatrix(predict(model.lda), data_set$School)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E F G H
## A 5 4 0 0 0 1 1 0
## B 0 1 2 0 0 0 0 0
## C 1 1 2 0 0 0 0 1
## D 2 0 0 9 1 0 1 0
## E 0 0 2 0 4 0 1 0
## F 0 0 0 0 0 2 0 0
## G 0 0 0 1 1 1 4 0
## H 2 0 0 0 1 0 0 3
##
## Overall Statistics
##
## Accuracy : 0.5556
## 95% CI : (0.414, 0.6908)
## No Information Rate : 0.1852
## P-Value [Acc > NIR] : 1.328e-09
##
## Kappa : 0.4812
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E Class: F
## Sensitivity 0.50000 0.16667 0.33333 0.9000 0.57143 0.50000
## Specificity 0.86364 0.95833 0.93750 0.9091 0.93617 1.00000
## Pos Pred Value 0.45455 0.33333 0.40000 0.6923 0.57143 1.00000
## Neg Pred Value 0.88372 0.90196 0.91837 0.9756 0.93617 0.96154
## Prevalence 0.18519 0.11111 0.11111 0.1852 0.12963 0.07407
## Detection Rate 0.09259 0.01852 0.03704 0.1667 0.07407 0.03704
## Detection Prevalence 0.20370 0.05556 0.09259 0.2407 0.12963 0.03704
## Balanced Accuracy 0.68182 0.56250 0.63542 0.9045 0.75380 0.75000
## Class: G Class: H
## Sensitivity 0.57143 0.75000
## Specificity 0.93617 0.94000
## Pos Pred Value 0.57143 0.50000
## Neg Pred Value 0.93617 0.97917
## Prevalence 0.12963 0.07407
## Detection Rate 0.07407 0.05556
## Detection Prevalence 0.12963 0.11111
## Balanced Accuracy 0.75380 0.84500
4. Using the Cars93
set from the
MASS
package. Construct a classification tree where the
variable Type depends on the variables Length, Weight, Engine, Size,
Horsepower, RPM and tuneLength = 5. What is the re-substitution error
and LOO CV error. Make a confusion matrix and plot.
Proposed answer:
library(caret)
library(MASS)
library(dplyr)
library(rattle)
data_set <- Cars93
ctrl.loo <- trainControl(method = 'LOOCV',
search = 'grid')
(model.tree <- train(Type ~ Length + Weight + EngineSize + Horsepower + RPM,
data = data_set,
method = 'rpart',
trControl = ctrl.loo,
tuneLength = 5))
## CART
##
## 93 samples
## 5 predictor
## 6 classes: 'Compact', 'Large', 'Midsize', 'Small', 'Sporty', 'Van'
##
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation
## Summary of sample sizes: 92, 92, 92, 92, 92, 92, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.02816901 0.5698925 0.4736842105
## 0.07042254 0.5053763 0.3883328567
## 0.09859155 0.4086022 0.2581580856
## 0.16901408 0.4193548 0.2476404494
## 0.28169014 0.2365591 0.0007566586
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.02816901.
1-model.tree$results[2] #boot error rate
## Accuracy
## 1 0.4301075
## 2 0.4946237
## 3 0.5913978
## 4 0.5806452
## 5 0.7634409
mean(predict(model.tree) != data_set$Type) #resubtitution error
## [1] 0.2903226
confusionMatrix(predict(model.tree), data_set$Type)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Compact Large Midsize Small Sporty Van
## Compact 14 1 2 1 7 3
## Large 0 10 3 0 0 0
## Midsize 2 0 16 0 2 0
## Small 0 0 0 20 5 0
## Sporty 0 0 0 0 0 0
## Van 0 0 1 0 0 6
##
## Overall Statistics
##
## Accuracy : 0.7097
## 95% CI : (0.6064, 0.7992)
## No Information Rate : 0.2366
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6428
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Compact Class: Large Class: Midsize Class: Small
## Sensitivity 0.8750 0.9091 0.7273 0.9524
## Specificity 0.8182 0.9634 0.9437 0.9306
## Pos Pred Value 0.5000 0.7692 0.8000 0.8000
## Neg Pred Value 0.9692 0.9875 0.9178 0.9853
## Prevalence 0.1720 0.1183 0.2366 0.2258
## Detection Rate 0.1505 0.1075 0.1720 0.2151
## Detection Prevalence 0.3011 0.1398 0.2151 0.2688
## Balanced Accuracy 0.8466 0.9363 0.8355 0.9415
## Class: Sporty Class: Van
## Sensitivity 0.0000 0.66667
## Specificity 1.0000 0.98810
## Pos Pred Value NaN 0.85714
## Neg Pred Value 0.8495 0.96512
## Prevalence 0.1505 0.09677
## Detection Rate 0.0000 0.06452
## Detection Prevalence 0.0000 0.07527
## Balanced Accuracy 0.5000 0.82738
fancyRpartPlot(model.tree$finalModel)