Python with Numpy/Scipy vs. Pure C++ for Big Data Analysis
Doing Python on relatively small projects makes me appreciate the dynamically typed nature of this language (no need for declaration code to keep track of types), which often makes for a quicker and less painful development process along the way. However, I feel that in much larger projects this may actually be a hindrance, as the code would run slower than say, its equivalent in C++. But then again, using Numpy and/or Scipy with Python may get your code to run just as fast as a native C++ program (where the code in C++ would sometimes take longer to develop).
I post this question after reading Justin Peel's comment on the thread "Is Python faster and lighter than C++?" where he states: "Also, people who speak of Python being slow for serious number crunching haven't used the Numpy and Scipy modules. Python is really taking off in scientific computing these days. Of course, the speed comes from using modules written in C or libraries written in Fortran, but that's the beauty of a scripting language in my opinion." Or as S. Lott writes on the same thread regarding Python: "...Since it manages memory for me, I don't have to do any memory management, saving hours of chasing down core leaks." I also inspected a Python/Numpy/C++ related performance question on "Benchmarking (python vs. c++ using BLAS) and (numpy)" where JF Sebastian writes "...There is no difference between C++ and numpy on my machine."
Both of these threads got me to wondering whether there is any real advantage conferred to knowing C++ for a Python programmer that uses Numpy/Scipy for producing software to analyze 'big data' where performance is obviously of great importance (but also code readability and development speed are a must)?
Note: I'm especially interested in handling huge text files. Text files on the order of 100K-800K lines with multiple columns, where Python could take a good five minutes to analyze a file "only" 200K lines long.
First off, if the bulk of your "work" comes from processing huge text files, that often means that your only meaningful speed bottleneck is your disk I/O speed, regardless of programming language.
As to the core question, it's probably too opinion-rich to "answer", but I can at least give you my own experience. I've been writing Python to do big data processing (weather and environmental data) for years. I have never once encountered significant performance problems due to the language.
Something that developers (myself included) tend to forget is that once the process runs fast enough, it's a waste of company resources to spend time making it run any faster. Python (using mature tools like pandas
/ scipy
) runs fast enough to meet the requirements, and it's fast to develop, so for my money, it's a perfectly acceptable language for "big data" processing.
The short answer is that for simple problems, then there should not be much difference. If you want to do anything complicated, then you will quickly run into stark performance differences.
As a simple example, try adding three vectors together
a = b + c + d
In python, as I understand it, this generally adds b
to c
, adds the result to d
, and then make a point to that final result. Each of those operations can be fast since they are just farmed out to a BLAS library. However, if the vectors are large, then the intermediate result can not be stored in cache. Moving that intermediate result to main memory is slow.
You can do the same thing in C++ using valarray and it will be equivalently slow. However, you can also do something else
for(int i=0; i<N; ++i)
a[i] = b[i] + c[i] + d[i]
This gets rid of the intermediate result and makes the code less sensitive to speed to main memory.
Doing the equivalent thing in python is possible, but python's looping constructs are not as efficient. They do nice things like bounds checks, but sometimes it is faster to run with the safeties disengaged. Java, for example, does a fair amount of work to remove bounds checks. So if you had a sufficiently smart compiler/JIT, python's loops could be fast. In practice, that has not worked out.
Python will definitely save your development time, it provides you flexibility as well if you are just comparing two languages here, though it still can't match the power and performance of C/C++ but who cares in this age of high memory, clusters, caching and parallel processing techniques? Another disadvantage with C++ can be the possible crashes and then debugging and fixing with big data can be a nightmare.
But having said that I have not seen a place where there is one size fit all solution is available , No programming language contains solutions to every problem, (unless you are an old native C developer who like to build the database in C as well :) you have to first identify all the problems, requirements, type of data, whether it is structured or non structured, what sort of text files you need to manipulate in what way and order, is scheduling an issue and so on... Then you need to build a complete stack of applications with some tool sets and scripting languages. Like you can always put more money in hardware or even buy some expensive tool like Ab Initio which give you power of loading and parsing those large text files and manipulate over the data unless you don't need real high end pattern matching capabilities on really biggg data files, python would be just fine with a conjunction of other tools. But i don't see a single yes/no answer, in certain situations, python may not be the best solution.
链接地址: http://www.djcxy.com/p/31528.html