Speed comparison with Project Euler: C vs Python vs Erlang vs Haskell

I have taken Problem #12 from Project Euler as a programming exercise and to compare my (surely not optimal) implementations in C, Python, Erlang and Haskell. In order to get some higher execution times, I search for the first triangle number with more than 1000 divisors instead of 500 as stated in the original problem.

The result is the following:

C:

lorenzo@enzo:~/erlang$ gcc -lm -o euler12.bin euler12.c
lorenzo@enzo:~/erlang$ time ./euler12.bin
842161320

real    0m11.074s
user    0m11.070s
sys 0m0.000s

Python:

lorenzo@enzo:~/erlang$ time ./euler12.py 
842161320

real    1m16.632s
user    1m16.370s
sys 0m0.250s

Python with PyPy:

lorenzo@enzo:~/Downloads/pypy-c-jit-43780-b590cf6de419-linux64/bin$ time ./pypy /home/lorenzo/erlang/euler12.py 
842161320

real    0m13.082s
user    0m13.050s
sys 0m0.020s

Erlang:

lorenzo@enzo:~/erlang$ erlc euler12.erl 
lorenzo@enzo:~/erlang$ time erl -s euler12 solve
Erlang R13B03 (erts-5.7.4) [source] [64-bit] [smp:4:4] [rq:4] [async-threads:0] [hipe] [kernel-poll:false]

Eshell V5.7.4  (abort with ^G)
1> 842161320

real    0m48.259s
user    0m48.070s
sys 0m0.020s

Haskell:

lorenzo@enzo:~/erlang$ ghc euler12.hs -o euler12.hsx
[1 of 1] Compiling Main             ( euler12.hs, euler12.o )
Linking euler12.hsx ...
lorenzo@enzo:~/erlang$ time ./euler12.hsx 
842161320

real    2m37.326s
user    2m37.240s
sys 0m0.080s

Summary:

  • C: 100%
  • Python: 692% (118% with PyPy)
  • Erlang: 436% (135% thanks to RichardC)
  • Haskell: 1421%
  • I suppose that C has a big advantage as it uses long for the calculations and not arbitrary length integers as the other three. Also it doesn't need to load a runtime first (Do the others?).

    Question 1: Do Erlang, Python and Haskell lose speed due to using arbitrary length integers or don't they as long as the values are less than MAXINT ?

    Question 2: Why is Haskell so slow? Is there a compiler flag that turns off the brakes or is it my implementation? (The latter is quite probable as Haskell is a book with seven seals to me.)

    Question 3: Can you offer me some hints how to optimize these implementations without changing the way I determine the factors? Optimization in any way: nicer, faster, more "native" to the language.

    EDIT:

    Question 4: Do my functional implementations permit LCO (last call optimization, aka tail recursion elimination) and hence avoid adding unnecessary frames onto the call stack?

    I really tried to implement the same algorithm as similar as possible in the four languages, although I have to admit that my Haskell and Erlang knowledge is very limited.


    Source codes used:

    #include <stdio.h>
    #include <math.h>
    
    int factorCount (long n)
    {
        double square = sqrt (n);
        int isquare = (int) square;
        int count = isquare == square ? -1 : 0;
        long candidate;
        for (candidate = 1; candidate <= isquare; candidate ++)
            if (0 == n % candidate) count += 2;
        return count;
    }
    
    int main ()
    {
        long triangle = 1;
        int index = 1;
        while (factorCount (triangle) < 1001)
        {
            index ++;
            triangle += index;
        }
        printf ("%ldn", triangle);
    }
    

    #! /usr/bin/env python3.2
    
    import math
    
    def factorCount (n):
        square = math.sqrt (n)
        isquare = int (square)
        count = -1 if isquare == square else 0
        for candidate in range (1, isquare + 1):
            if not n % candidate: count += 2
        return count
    
    triangle = 1
    index = 1
    while factorCount (triangle) < 1001:
        index += 1
        triangle += index
    
    print (triangle)
    

    -module (euler12).
    -compile (export_all).
    
    factorCount (Number) -> factorCount (Number, math:sqrt (Number), 1, 0).
    
    factorCount (_, Sqrt, Candidate, Count) when Candidate > Sqrt -> Count;
    
    factorCount (_, Sqrt, Candidate, Count) when Candidate == Sqrt -> Count + 1;
    
    factorCount (Number, Sqrt, Candidate, Count) ->
        case Number rem Candidate of
            0 -> factorCount (Number, Sqrt, Candidate + 1, Count + 2);
            _ -> factorCount (Number, Sqrt, Candidate + 1, Count)
        end.
    
    nextTriangle (Index, Triangle) ->
        Count = factorCount (Triangle),
        if
            Count > 1000 -> Triangle;
            true -> nextTriangle (Index + 1, Triangle + Index + 1)  
        end.
    
    solve () ->
        io:format ("~p~n", [nextTriangle (1, 1) ] ),
        halt (0).
    

    factorCount number = factorCount' number isquare 1 0 - (fromEnum $ square == fromIntegral isquare)
        where square = sqrt $ fromIntegral number
              isquare = floor square
    
    factorCount' number sqrt candidate count
        | fromIntegral candidate > sqrt = count
        | number `mod` candidate == 0 = factorCount' number sqrt (candidate + 1) (count + 2)
        | otherwise = factorCount' number sqrt (candidate + 1) count
    
    nextTriangle index triangle
        | factorCount triangle > 1000 = triangle
        | otherwise = nextTriangle (index + 1) (triangle + index + 1)
    
    main = print $ nextTriangle 1 1
    

    Using GHC 7.0.3 , gcc 4.4.6 , Linux 2.6.29 on an x86_64 Core2 Duo (2.5GHz) machine, compiling using ghc -O2 -fllvm -fforce-recomp for Haskell and gcc -O3 -lm for C.

  • Your C routine runs in 8.4 seconds (faster than your run probably because of -O3 )
  • The Haskell solution runs in 36 seconds (due to the -O2 flag)
  • Your factorCount' code isn't explicitly typed and defaulting to Integer (thanks to Daniel for correcting my misdiagnosis here!). Giving an explicit type signature (which is standard practice anyway) using Int and the time changes to 11.1 seconds
  • in factorCount' you have needlessly called fromIntegral . A fix results in no change though (the compiler is smart, lucky for you).
  • You used mod where rem is faster and sufficient. This changes the time to 8.5 seconds .
  • factorCount' is constantly applying two extra arguments that never change ( number , sqrt ). A worker/wrapper transformation gives us:
  •  $ time ./so
     842161320  
    
     real    0m7.954s  
     user    0m7.944s  
     sys     0m0.004s  
    

    That's right, 7.95 seconds . Consistently half a second faster than the C solution . Without the -fllvm flag I'm still getting 8.182 seconds , so the NCG backend is doing well in this case too.

    Conclusion: Haskell is awesome.

    Resulting Code

    factorCount number = factorCount' number isquare 1 0 - (fromEnum $ square == fromIntegral isquare)
        where square = sqrt $ fromIntegral number
              isquare = floor square
    
    factorCount' :: Int -> Int -> Int -> Int -> Int
    factorCount' number sqrt candidate0 count0 = go candidate0 count0
      where
      go candidate count
        | candidate > sqrt = count
        | number `rem` candidate == 0 = go (candidate + 1) (count + 2)
        | otherwise = go (candidate + 1) count
    
    nextTriangle index triangle
        | factorCount triangle > 1000 = triangle
        | otherwise = nextTriangle (index + 1) (triangle + index + 1)
    
    main = print $ nextTriangle 1 1
    

    EDIT: So now that we've explored that, lets address the questions

    Question 1: Do erlang, python and haskell lose speed due to using arbitrary length integers or don't they as long as the values are less than MAXINT?

    In Haskell, using Integer is slower than Int but how much slower depends on the computations performed. Luckily (for 64 bit machines) Int is sufficient. For portability sake you should probably rewrite my code to use Int64 or Word64 (C isn't the only language with a long ).

    Question 2: Why is haskell so slow? Is there a compiler flag that turns off the brakes or is it my implementation? (The latter is quite probable as haskell is a book with seven seals to me.)

    Question 3: Can you offer me some hints how to optimize these implementations without changing the way I determine the factors? Optimization in any way: nicer, faster, more "native" to the language.

    That was what I answered above. The answer was

  • 0) Use optimization via -O2
  • 1) Use fast (notably: unbox-able) types when possible
  • 2) rem not mod (a frequently forgotten optimization) and
  • 3) worker/wrapper transformation (perhaps the most common optimization).
  • Question 4: Do my functional implementations permit LCO and hence avoid adding unnecessary frames onto the call stack?

    Yes, that wasn't the issue. Good work and glad you considered this.


    There are some problems with the Erlang implementation. As baseline for the following, my measured execution time for your unmodified Erlang program was 47.6 seconds, compared to 12.7 seconds for the C code.

    The first thing you should do if you want to run computationally intensive Erlang code is to use native code. Compiling with erlc +native euler12 got the time down to 41.3 seconds. This is however a much lower speedup (just 15%) than expected from native compilation on this kind of code, and the problem is your use of -compile(export_all) . This is useful for experimentation, but the fact that all functions are potentially reachable from the outside causes the native compiler to be very conservative. (The normal BEAM emulator is not that much affected.) Replacing this declaration with -export([solve/0]). gives a much better speedup: 31.5 seconds (almost 35% from the baseline).

    But the code itself has a problem: for each iteration in the factorCount loop, you perform this test:

    factorCount (_, Sqrt, Candidate, Count) when Candidate == Sqrt -> Count + 1;
    

    The C code doesn't do this. In general, it can be tricky to make a fair comparison between different implementations of the same code, and in particular if the algorithm is numerical, because you need to be sure that they are actually doing the same thing. A slight rounding error in one implementation due to some typecast somewhere may cause it to do many more iterations than the other even though both eventually reach the same result.

    To eliminate this possible error source (and get rid of the extra test in each iteration), I rewrote the factorCount function as follows, closely modelled on the C code:

    factorCount (N) ->
        Sqrt = math:sqrt (N),
        ISqrt = trunc(Sqrt),
        if ISqrt == Sqrt -> factorCount (N, ISqrt, 1, -1);
           true          -> factorCount (N, ISqrt, 1, 0)
        end.
    
    factorCount (_N, ISqrt, Candidate, Count) when Candidate > ISqrt -> Count;
    factorCount ( N, ISqrt, Candidate, Count) ->
        case N rem Candidate of
            0 -> factorCount (N, ISqrt, Candidate + 1, Count + 2);
            _ -> factorCount (N, ISqrt, Candidate + 1, Count)
        end.
    

    This rewrite, no export_all , and native compilation, gave me the following run time:

    $ erlc +native euler12.erl
    $ time erl -noshell -s euler12 solve
    842161320
    
    real    0m19.468s
    user    0m19.450s
    sys 0m0.010s
    

    which is not too bad compared to the C code:

    $ time ./a.out 
    842161320
    
    real    0m12.755s
    user    0m12.730s
    sys 0m0.020s
    

    considering that Erlang is not at all geared towards writing numerical code, being only 50% slower than C on a program like this is pretty good.

    Finally, regarding your questions:

    Question 1: Do erlang, python and haskell loose speed due to using arbitrary length integers or don't they as long as the values are less than MAXINT?

    Yes, somewhat. In Erlang, there is no way of saying "use 32/64-bit arithmetic with wrap-around", so unless the compiler can prove some bounds on your integers (and it usually can't), it must check all computations to see if they can fit in a single tagged word or if it has to turn them into heap-allocated bignums. Even if no bignums are ever used in practice at runtime, these checks will have to be performed. On the other hand, that means you know that the algorithm will never fail because of an unexpected integer wraparound if you suddenly give it larger inputs than before.

    Question 4: Do my functional implementations permit LCO and hence avoid adding unnecessary frames onto the call stack?

    Yes, your Erlang code is correct with respect to last call optimization.


    In regards to Python optimization, in addition to using PyPy (for pretty impressive speed-ups with zero change to your code), you could use PyPy's translation toolchain to compile an RPython-compliant version, or Cython to build an extension module, both of which are faster than the C version in my testing, with the Cython module nearly twice as fast . For reference I include C and PyPy benchmark results as well:

    C (compiled with gcc -O3 -lm )

    % time ./euler12-c 
    842161320
    
    ./euler12-c  11.95s 
     user 0.00s 
     system 99% 
     cpu 11.959 total
    

    PyPy 1.5

    % time pypy euler12.py
    842161320
    pypy euler12.py  
    16.44s user 
    0.01s system 
    99% cpu 16.449 total
    

    RPython (using latest PyPy revision, c2f583445aee )

    % time ./euler12-rpython-c
    842161320
    ./euler12-rpy-c  
    10.54s user 0.00s 
    system 99% 
    cpu 10.540 total
    

    Cython 0.15

    % time python euler12-cython.py
    842161320
    python euler12-cython.py  
    6.27s user 0.00s 
    system 99% 
    cpu 6.274 total
    

    The RPython version has a couple of key changes. To translate into a standalone program you need to define your target , which in this case is the main function. It's expected to accept sys.argv as it's only argument, and is required to return an int. You can translate it by using translate.py, % translate.py euler12-rpython.py which translates to C and compiles it for you.

    # euler12-rpython.py
    
    import math, sys
    
    def factorCount(n):
        square = math.sqrt(n)
        isquare = int(square)
        count = -1 if isquare == square else 0
        for candidate in xrange(1, isquare + 1):
            if not n % candidate: count += 2
        return count
    
    def main(argv):
        triangle = 1
        index = 1
        while factorCount(triangle) < 1001:
            index += 1
            triangle += index
        print triangle
        return 0
    
    if __name__ == '__main__':
        main(sys.argv)
    
    def target(*args):
        return main, None
    

    The Cython version was rewritten as an extension module _euler12.pyx , which I import and call from a normal python file. The _euler12.pyx is essentially the same as your version, with some additional static type declarations. The setup.py has the normal boilerplate to build the extension, using python setup.py build_ext --inplace .

    # _euler12.pyx
    from libc.math cimport sqrt
    
    cdef int factorCount(int n):
        cdef int candidate, isquare, count
        cdef double square
        square = sqrt(n)
        isquare = int(square)
        count = -1 if isquare == square else 0
        for candidate in range(1, isquare + 1):
            if not n % candidate: count += 2
        return count
    
    cpdef main():
        cdef int triangle = 1, index = 1
        while factorCount(triangle) < 1001:
            index += 1
            triangle += index
        print triangle
    
    # euler12-cython.py
    import _euler12
    _euler12.main()
    
    # setup.py
    from distutils.core import setup
    from distutils.extension import Extension
    from Cython.Distutils import build_ext
    
    ext_modules = [Extension("_euler12", ["_euler12.pyx"])]
    
    setup(
      name = 'Euler12-Cython',
      cmdclass = {'build_ext': build_ext},
      ext_modules = ext_modules
    )
    

    I honestly have very little experience with either RPython or Cython, and was pleasantly surprised at the results. If you are using CPython, writing your CPU-intensive bits of code in a Cython extension module seems like a really easy way to optimize your program.

    链接地址: http://www.djcxy.com/p/14118.html

    上一篇: 如何在Python中导致堆栈溢出和堆溢出

    下一篇: 与Project Euler进行速度比较:C vs Python vs Erlang vs Haskell