ROC in rfe() in caret package for R
I am using the caret package in R for training a radial basis SVM for classification; in addition, a linear SVM is used for variable selection. With metric="Accuracy", this works fine, but eventually I am more interested in optimizing metric="ROC". While the ROC is calculated for all models that are fit, there seems to be some problem with aggregating the ROC values.
The following is some example code:
library(caret)
library(mlbench)
set.seed(0)
data(Sonar)
x<-scale(Sonar[,1:60])
y<-as.factor(Sonar[,61])
# Custom summary function to use both
# defaultSummary() and twoClassSummary
# Also input and output of summary function are printed
svm.summary<-function(data, lev = NULL, model = NULL){
print(head(data,n=3))
a<-defaultSummary(data, lev, model)
b<-twoClassSummary(data, lev, model)
out<-c(a,b)
print(out)
out}
fitControl <- trainControl(
method = "cv",
number = 2,
classProbs = TRUE,
summaryFunction=svm.summary,
verbose=T,
allowParallel = FALSE)
# Ranking function: Rank Variables using a linear
# SVM
rankSVM<-function(object,x,y) {
print("ranking")
obj<-ksvm(x=as.matrix(x), y=y,
kernel=vanilladot,
kpar=list(), C=10,
scaled=F)
w<-t(obj@coef[[1]]%*%obj@xmatrix[[1]])
z<-abs(w)/sqrt(sum(w^2))
ord<-order(z,decreasing=T)
data.frame(var=dimnames(z)[[1]][ord],Overall=z[ord])
}
svmFuncs<-getModelInfo("svmRadial",regex=F)
svmFit<-function(x,y,first,last,...) {
out<-train(x=x,y=as.factor(y),
method="svmRadial",
trControl=fitControl,
scaled=F,
metric="Accuracy",
maximize=T,
returnData=T)
out$finalModel}
selectionFunctions<-list(summary=svm.summary,
fit=svmFit,
pred=svmFuncs$svmRadial$predict,
prob=svmFuncs$svmRadial$prob,
rank=rankSVM,
selectSize=pickSizeBest,
selectVar=pickVars)
selectionControl<-rfeControl(functions=selectionFunctions,
rerank=F,
verbose=T,
method="cv",
number=2)
subsets<-c(1,30,60)
svmProfile<-rfe(x=x,y=y,
sizes=subsets,
metric="Accuracy",
maximize=TRUE,
rfeControl=selectionControl)
svmProfile
The final output is the following:
> svmProfile
Recursive feature selection
Outer resampling method: Cross-Validated (2 fold)
Resampling performance over subset size:
Variables Accuracy Kappa ROC Sens Spec AccuracySD KappaSD ROCSD SensSD SpecSD Selected
1 0.8075 0.6122 NaN 0.8292 0.7825 0.02981 0.06505 NA 0.06153 0.1344 *
30 0.8028 0.6033 NaN 0.8205 0.7825 0.00948 0.02533 NA 0.09964 0.1344
60 0.8028 0.6032 NaN 0.8206 0.7823 0.00948 0.02679 NA 0.12512 0.1635
The top 1 variables (out of 1):
V49
ROC is NaN. Inspecting the output (as verbose=T and the summary function was patched to display both its output and parts of its input) reveals that while when tuning the SVMs in the inner loop, ROC seems to be calculated correctly:
+ Fold1: sigma=0.01172, C=0.25
pred obs M R
1 M R 0.6658878 0.3341122
2 M R 0.5679477 0.4320523
3 R R 0.2263576 0.7736424
Accuracy Kappa ROC Sens Spec
0.6730769 0.3480826 0.7961310 0.6428571 0.7083333
- Fold1: sigma=0.01172, C=0.25
+ Fold1: sigma=0.01172, C=0.50
pred obs M R
1 M R 0.7841249 0.2158751
2 M R 0.7231365 0.2768635
3 R R 0.3033492 0.6966508
Accuracy Kappa ROC Sens Spec
0.7692308 0.5214724 0.8407738 0.9642857 0.5416667
- Fold1: sigma=0.01172, C=0.50
[...]
there seems to be a problem in the outer iteration. "Between" two folds we get the following:
-(rfe) fit Fold1 size: 1
pred obs Variables
1 M R 1
2 M R 1
3 M R 1
Accuracy Kappa ROC Sens Spec
0.7864078 0.5662328 NA 0.8727273 0.6875000
pred obs Variables
1 R R 30
2 M R 30
3 M R 30
Accuracy Kappa ROC Sens Spec
0.7961165 0.5853939 NA 0.8909091 0.6875000
pred obs Variables
1 R R 60
2 M R 60
3 M R 60
Accuracy Kappa ROC Sens Spec
0.7961165 0.5842783 NA 0.9090909 0.6666667
+(rfe) fit Fold2 size: 60
So here it seems the input for the summary function is a matrix that does not contain the class probabilities but the number of variables instead, and so the ROCs cannot be calculated / aggregated correctly. Does anybody know how to prevent this? Did I forget to tell caret to output class probabilities in some place?
Help is greatly appreciated, as caret is really a cool package to use and would save me plenty of work if I can get this to run correctly.
Thoralf
getModelInfo
is designed to get code for train
and doesn't automatically work with rfe
(I'll make a note of that in the documentation). rfe
doesn't look for a slot called probs
and no probability predictions means not ROC summary.
You might want base your code on caretFuncs
, which is designed to work with rfe
and should automate a lot of what I think you would like to do.
For example, in caretFuncs
, the pred
module will create class and probability predictions:
function(object, x) {
tmp <- predict(object, x)
if(object$modelType == "Classification" &
!is.null(object$modelInfo$prob)) {
out <- cbind(data.frame(pred = tmp),
as.data.frame(predict(object, x, type = "prob")))
} else out <- tmp
out
}
You might be able to simply plug in your rankSVM
into caretFuncs$rank
.
Take a look at the feature selection page on the website. It has details about what code modules you will need.
链接地址: http://www.djcxy.com/p/38384.html上一篇: 从Caret软件包预测功能会给出错误