Why is caret train taking up so much memory?

When I train just using glm , everything works, and I don't even come close to exhausting memory. But when I run train(..., method='glm') , I run out of memory.

Is this because train is storing a lot of data for each iteration of the cross-validation (or whatever the trControl procedure is)? I'm looking at trainControl and I can't find how to prevent this...any hints? I only care about the performance summary and maybe the predicted responses.

(I know it's not related to storing data from each iteration of the parameter-tuning grid search because there's no grid for glm's, I believe.)


The problem is two fold. i) train doesn't just fit a model via glm() , it will bootstrap that model, so even with the defaults, train() will do 25 bootstrap samples, which, coupled with problem ii) is the (or a) source of your problem, and ii) train() simply calls the glm() function with its defaults. And those defaults are to store the model frame (argument model = TRUE of ?glm ), which includes a copy of the data in model frame style. The object returned by train() already stores a copy of the data in $trainingData , and the "glm" object in $finalModel also has a copy of the actual data.

At this point, simply running glm() using train() will be producing 25 copies of the fully expanded model.frame and the original data, which will all need to be held in memory during the resampling process - whether these are held concurrently or consecutively is not immediately clear from a quick look at the code as the resampling happens in an lapply() call. There will also be 25 copies of the raw data.

Once the resampling is finished, the returned object will contain 2 copies of the raw data and a full copy of the model.frame . If your training data is large relative to available RAM or contains many factors to be expanded in the model.frame , then you could easily be using huge amounts of memory just carrying copies of the data around.

If you add model = FALSE to your train call, that might make a difference. Here is a small example using the clotting data in ?glm :

clotting <- data.frame(u = c(5,10,15,20,30,40,60,80,100),
                       lot1 = c(118,58,42,35,27,25,21,19,18),
                       lot2 = c(69,35,26,21,18,16,13,12,12))
require(caret)

then

> m1 <- train(lot1 ~ log(u), data=clotting, family = Gamma, method = "glm", 
+             model = TRUE)
Fitting: parameter=none 
Aggregating results
Fitting model on full training set
> m2 <- train(lot1 ~ log(u), data=clotting, family = Gamma, method = "glm",
+             model = FALSE)
Fitting: parameter=none 
Aggregating results
Fitting model on full training set
> object.size(m1)
121832 bytes
> object.size(m2)
116456 bytes
> ## ordinary glm() call:
> m3 <- glm(lot1 ~ log(u), data=clotting, family = Gamma)
> object.size(m3)
47272 bytes
> m4 <- glm(lot1 ~ log(u), data=clotting, family = Gamma, model = FALSE)
> object.size(m4)
42152 bytes

So there is a size difference in the returned object and memory use during training will be lower. How much lower will depend on whether the internals of train() keep all copies of the model.frame in memory during the resampling process.

The object returned by train() is also significantly larger than that returned by glm() - as mentioned by @DWin in the comments, below.

To take this further, either study the code more closely, or email Max Kuhn, the maintainer of caret , to enquire about options to reduce the memory footprint.


Gavin's answer is spot on. I built the function for ease of use rather than for speed or efficiency [1]

First, using the formula interface can be an issue when you have a lot of predictors. This is something that R Core could fix; the formula approach requires a very large but sparse terms() matrix to be retained and R has packages to effectively deal with that issue. For example, with n = 3, 000 and p = 2, 000, a 3–tree random forest model object was 1.5 times larger in size and took 23 times longer to execute when using the formula interface (282s vs 12s).

Second, you don't have to keep the training data (see the returnData argument in trainControl() ).

Also, since R doesn't have any real shared memory infrastructure, Gavin is correct about the number of copies of the data that are retained in memory. Basically, a list is created for every resample and lapply() is used to process the list, then return only the resampled estimates. An alternative would be to sequentially make one copy of the data (for the current resample), do the required operations, then repeat for the remaining iterations. The issue there is I/O and the inability to do any parallel processing. [2]

If you have a large data set, I suggest using the non-formula interface (even though the actual model, like glm, eventually uses a formula). Also, for large data sets, train() saves the resampling indices for use by resamples() and other functions. You could probably remove those too.

Yang - it would be good to know more about the data via str(data) so we can understand the dimensions and other aspects (eg. factors with many levels etc).

I hope that helps,

Max

[1] I should not that we go to great lengths to fit as few models as possible when we can. The "sub-model" trick is used for many models, such as pls, gbm, rpart, earth and many others. Also, when a model has formula and non-formula interfaces (eg. lda() or earth() , we default to the non-formula interface.

[2] Every once in a while I get the insane urge to reboot the train() function. Using foreach might get around some of these issues.


I think the above answers are a bit outdated. The caret and caretEnsemble packages now include an additional parameter in trainControl 'trim.' Trim is initially set to FALSE but changing it to TRUE will significantly decrease model size. You should use this in combination with returnData=FALSE for the smallest model sizes possible. If you're using a model ensemble, you should also specify these two parameters in the greedy/stack ensemble trainControl.

For my case, a 1.6gb model shrunk to ~500mb with both parameters in the ensemble control and further shrunk to ~300mb also using the parameters in the greedy ensemble control.

Ensemble_control_A9 <- trainControl(trim=TRUE, method = "repeatedcv", number = 3, repeats = 2, verboseIter = TRUE, returnData = FALSE, returnResamp = "all", classProbs = TRUE, summaryFunction = twoClassSummary, savePredictions = TRUE, allowParallel = TRUE, sampling = "up")


Ensemble_greedy_A5 <- caretEnsemble(Ensemble_list_A5, metric="ROC",  trControl=trainControl(number=2, trim=TRUE, returnData = FALSE, summaryFunction=twoClassSummary, classProbs=TRUE))
链接地址: http://www.djcxy.com/p/38380.html

上一篇: train()函数和速率模型(具有偏移量的泊松回归)与脱字符号

下一篇: 为什么插页训练占用这么多的记忆?