Too slow or out of memory problems in Machine Learning/Data Mining
EDIT: An attempt to rephrase myself:
Tools like R, Weka are feature-rich but are slow and limited in size of data they can work with. Tools like Mahout, Vowpal Wabbit (VW) and its extension AllReduce are targeted for 1K node clusters, and they are limited in their capabilities. VW, for example, can only "learn" by minimizing some loss function.
What I haven't seen in any popular software is use of parallel programming (good ol' pthreads, MPI etc) for speeding up. I suppose it is useful for the kinds of problem where cluster may be an overkill, but waiting for the program to finish while the other processor cores are idle is just too bad. [As an example, one can get 26-core machine and 88-core cluster at AWS, and a good parallel algorithm can give speed up of, say, 20X and 60X without resorting to heavy-weight hadoop like systems]
What I meant to learn from the community: (subjectively) your real life problems/algorithms that are not TOO large to be called big data, but still big enough where you feel faster would have been better. (Objectively) your experiences along the lines of "algorithm X on data with characteristics C and size D took T time, and I had M processor cores that I could have thrown at it, if parallel version of the algorithm X was available".
And the reason I ask, of course, is to learn the need for parallel programming in this field, and perhaps have a community driven effort to address it. My experiences with few problems are detailed below.
What are some of the problems in machine learning, data mining and related fields that you have difficulties with because they are too slow or need excessively large memory?
As a hobby research project we built an out-of-core programming model to handle data larger than system memory and it natively supports parallel/distributed execution. It showed good performance on some problems (see below) and we wish to expand this technology (hopefully community-driven) for the real life problems.
Some benchmarks (against Weka, Mahout and R):
a) Apriori Algorithm for frequent itemset mining [CPU-bound but average memory]
Webdocs dataset with 1.7M transactions over 5.2M unique items (1.4GB). The algorithm finds sets of items that appear frequently in transactions. For 10% support, Weka3 could not complete in 3 days . Our version completed in 4hr 24 min (although to be fair, we used Tries instead of hashtables as in Weka). More importantly though, on one 8-core machine it took 39min, on 8 machines -> 6min 30sec (=40x)
b) SlopeOne recommendation engine [High memory usage]
MovieLens dataset with 10M ratings from 70K for 10K movies. SlopeOne recommends new movies based on Collaborative Filtering. Apache Mahout's "Taste" non-distributed recommender would fail for less than 6GB memory. To benchmark the out-of-core performance, we restricted our version to 1/10th of this limit ( 600MB ) and it completed with 11% overhead (due to out-of-core) in execution time.
c) Dimensionality Reduction with Principal Component Analysis (PCA) [Both CPU and Memory bound]
Mutants "p53" protein dataset of 32K samples with 5400 attributes each (1.1GB). PCA is used to reduce the dimension of dataset by dropping variables with very small variances. Although our version could process data larger than system virtual memory, we benchmarked this dataset since the R software can process it. R completed the job in 86 min . Our out-of-core version had no additional overhead; in fact, it completed in 67min on single-core and 14 min on 8-core machine.
The excellent software today either work for data in Megabytes range by loading them in memory (R, Weka, numpy) or tera/petabytes range for data centers (Mahout, SPSS, SAS). There seems to be a gap in the Gigabytes range -- large than virtual memory but less than "big data". Although, projects like numpy's Blaze, R bigmemory, scalapack etc are addressing this need.
From your experiences, can you relate examples where such a faster and out-of-core tool can benefit the data mining/machine learning community?
This is really an open ended question. From what I can tell you are asking to things:
For the first question, one of the best tools that is used in many production environments with big, big data is Vowpal Wabbit (VW). Head over to hunch.net to take a look.
For you second question if you can beat VW then that would absolutely benefit the community. However VW is pretty good :)
链接地址: http://www.djcxy.com/p/61382.html上一篇: 数据挖掘小数据集