different results using train(), predict() and resamples()

2018-06-13 11:04:18

I'm using the Caret package to analyse various models and I'm assessing the results using:

print() [print the results of train()],

predict(), and

resamples().

Why are these results in the following example different?

I'm interested in sensitivity (true positives). Why is J48_fit assessed as having a sensitivity of .71, then .81, then .71 again

The same happens when I run other models - the sensitivity changes depending on the assessment.

NB: I have included two models here so as to illustrate the resamples() function, which has to take two models as input, but my main question is about the differences between the results depending on what method one uses.

In other words, what is the difference between the outcome of train() (C5.0_fit/J48_fit), predict() and resamples()? What is going on 'behind the scenes' and which result should I trust?

EXAMPLE:

library(C50)
data(churn)

Seed <- 10

# Set train options
set.seed(Seed)
Train_options <- trainControl(method = "cv", number = 10,
                              classProbs = TRUE,
                              summaryFunction = twoClassSummary)

# C5.0 model:
set.seed(Seed)
C5.0_fit <- train(churn~., data=churnTrain, method="C5.0", metric="ROC",
                 trControl=Train_options)

# J48 model:
set.seed(Seed)
J48_fit <- train(churn~., data=churnTrain, method="J48", metric="ROC",
                 trControl=Train_options)
# Get results by printing the outcome
print(J48_fit)

#                      ROC Sens Spec
# Best (sensitivity): 0.87 0.71 0.98  

# Get results using predict()
set.seed(Seed)
J48_fit_predict <- predict(J48_fit, churnTrain)
confusionMatrix(J48_fit_predict, churnTrain$churn)
#             Reference
# Prediction  yes   no
#       yes  389    14
#       no    94  2836
# Sens : 0.81          
# Spec : 0.99

# Get results by comparing algorithms with resamples()
set.seed(Seed)
results <- resamples(list(C5.0_fit=C5.0_fit, J48_fit=J48_fit))
summary(results)
# ROC         mean
# C5.0_fit    0.92  
# J48_fit     0.87
# Sens        mean
# C5.0_fit    0.76  
# J48_fit     0.71
# Spec        mean
# C5.0_fit    0.99  
# J48_fit     0.98

By the way, here is a function for getting all three results together:

Get_results <- function(...){

  Args <- list(...)
  Model_names <- as.list(sapply(substitute({...})[-1], deparse))

  message("Model names:")
  print(Model_names)

  # Function for getting max sensitivity
  Max_sens <- function(df, colname = "results"){
    df <- df[[colname]]
    new_df <- df[which.max(df$Sens), ]
    x <- sapply(new_df, is.numeric)
    new_df[, x] <- round(new_df[, x], 2)
    new_df
  }

  # Find max Sens for each model
  message("Max sensitivity from model printout:")
  Max_sens_out <- lapply(Args, Max_sens)
  names(Max_sens_out) <- Model_names
  print(Max_sens_out)

  # Find predict() result for each model
  message("Results using predict():")
  set.seed(Seed)
  Predict_out <- lapply(Args, function(x) predict(x, churnTrain))
  Predict_results <- lapply(Predict_out, function(x) confusionMatrix(x, churnTrain$churn))
  names(Predict_results) <- Model_names
  print(Predict_results)

  # Find resamples() results for each model

  message("Results using resamples():")
  set.seed(Seed)
  results <- resamples(list(...),modelNames = Model_names)
  # names(results) <- Model_names
  summary(results)

}

# Test
Get_results(C5.0_fit, J48_fit)

Many thanks!

The best sensitivity that you printed is the average of model performance across each of the 10 folds (from your CV). You can see the performance for each fold with J48_fit$resample . Then to confirm, you can take the mean of the first column, ROC, with mean(J48_fit$resample[,1]) and you'll get 0.865799.

When you use predict() on the full dataset you'll end up with a different result because the data is different than what was used in the resample - you're getting model performance on the whole data, instead of on 10% at a time.

链接地址: http://www.djcxy.com/p/38392.html

上一篇: 插入符号保存最小尺寸模型

下一篇: 不同的结果使用train（），predict（）和resamples（）