Performance problem with Euler problem and recursion on Int64 types

2018-06-08 00:14:12

I'm currently learning Haskell using the project Euler problems as my playground. I was astound by how slow my Haskell programs turned out to be compared to similar programs written in other languages. I'm wondering if I've forseen something, or if this is the kind of performance penalties one has to expect when using Haskell.

The following program in inspired by Problem 331, but I've changed it before posting so I don't spoil anything for other people. It computes the arc length of a discrete circle drawn on a 2^30 x 2^30 grid. It is a simple tail recursive implementation and I make sure that the updates of the accumulation variable keeping track of the arc length is strict. Yet it takes almost one and a half minute to complete (compiled with the -O flag with ghc).

import Data.Int

arcLength :: Int64->Int64
arcLength n = arcLength' 0 (n-1) 0 0 where
    arcLength' x y norm2 acc
        | x > y = acc
        | norm2 < 0 = arcLength' (x + 1) y (norm2 + 2*x +1) acc
        | norm2 > 2*(n-1) = arcLength' (x - 1) (y-1) (norm2 - 2*(x + y) + 2) acc
        | otherwise = arcLength' (x + 1) y (norm2 + 2*x + 1) $! (acc + 1)

main = print $ arcLength (2^30)

Here is a corresponding implementation in Java. It takes about 4.5 seconds to complete.

public class ArcLength {
public static void main(String args[]) {
    long n = 1 << 30;
    long x = 0;
    long y = n-1;
    long acc = 0;
    long norm2 = 0;
    long time = System.currentTimeMillis();

    while(x <= y) {
        if (norm2 < 0) {
            norm2 += 2*x + 1;
            x++;
        } else if (norm2 > 2*(n-1)) {
            norm2 += 2 - 2*(x+y);
            x--;
            y--;
        } else {
            norm2 += 2*x + 1;
            x++;
            acc++;
        }
    }

    time = System.currentTimeMillis() - time;
    System.err.println(acc);
    System.err.println(time);
}

}

EDIT: After the discussions in the comments I made som modifications in the Haskell code and did some performance tests. First I changed n to 2^29 to avoid overflows. Then I tried 6 different version: With Int64 or Int and with bangs before either norm2 or both and norm2 and acc in the declaration arcLength' xy !norm2 !acc . All are compiled with

ghc -O3 -prof -rtsopts -fforce-recomp -XBangPatterns arctest.hs

Here are the results:

(Int !norm2 !acc)
total time  =        3.00 secs   (150 ticks @ 20 ms)
total alloc =       2,892 bytes  (excludes profiling overheads)

(Int norm2 !acc)
total time  =        3.56 secs   (178 ticks @ 20 ms)
total alloc =       2,892 bytes  (excludes profiling overheads)

(Int norm2 acc)
total time  =        3.56 secs   (178 ticks @ 20 ms)
total alloc =       2,892 bytes  (excludes profiling overheads)

(Int64 norm2 acc)
arctest.exe: out of memory

(Int64 norm2 !acc)
total time  =       48.46 secs   (2423 ticks @ 20 ms)
total alloc = 26,246,173,228 bytes  (excludes profiling overheads)

(Int64 !norm2 !acc)
total time  =       31.46 secs   (1573 ticks @ 20 ms)
total alloc =       3,032 bytes  (excludes profiling overheads)

I'm using GHC 7.0.2 under a 64-bit Windows 7 (The Haskell platform binary distribution). According to the comments, the problem does not occur when compiling under other configurations. This makes me think that the Int64 type is broken in the Windows release.

Hm, I installed a fresh Haskell platform with 7.0.3, and get roughly the following core for your program ( -ddump-simpl ):

Main.$warcLength' =
   (ww_s1my :: GHC.Prim.Int64#) (ww1_s1mC :: GHC.Prim.Int64#)
    (ww2_s1mG :: GHC.Prim.Int64#) (ww3_s1mK :: GHC.Prim.Int64#) ->
    case {__pkg_ccall ghc-prim hs_gtInt64 [...]
           ww_s1my ww1_s1mC GHC.Prim.realWorld#
[...]

So GHC has realized that it can unpack your integers, which is good. But this hs_getInt64 call looks suspiciously like a C call. Looking at the assembler output ( -ddump-asm ), we see stuff like:

pushl %eax
movl 76(%esp),%eax
pushl %eax
call _hs_gtInt64
addl $16,%esp

So this looks very much like every operation on the Int64 get turned into a full-blown C call in the backend. Which is slow, obviously.

The source code of GHC.IntWord64 seems to verify that: In a 32-bit build (like the one currently shipped with the platform), you will have only emulation via the FFI interface.

Hmm, this is interesting. So I just compiled both of your programs, and tried them out:

% java -version                                                                                          
java version "1.6.0_18"
OpenJDK Runtime Environment (IcedTea6 1.8.7) (6b18-1.8.7-2~squeeze1)
OpenJDK 64-Bit Server VM (build 14.0-b16, mixed mode)
% javac ArcLength.java                                                                                   
% java ArcLength                                                                                         
843298604
6630

So about 6.6 seconds for the Java solution . Next is ghc with some optimization:

% ghc --version                                                                                          
The Glorious Glasgow Haskell Compilation System, version 6.12.1
% ghc --make -O arc.hs
% time ./arc                                                                                             
843298604
./arc  12.68s user 0.04s system 99% cpu 12.718 total

Just under 13 seconds for ghc -O

Trying with some further optimization:

% ghc --make -O3
% time ./arc                                                                                             [13:16]
843298604
./arc  5.75s user 0.00s system 99% cpu 5.754 total

With further optimization flags, the haskell solution took under 6 seconds

It would be interesting to know what version compiler you are using.

There's a couple of interesting things in your question.

You should be using -O2 primarily. It will just do a better job (in this case, identifying and removing laziness that was still present in the -O version).

Secondly, your Haskell isn't quite the same as the Java (it does different tests and branches). As with others, running your code on my Linux box results in around 6s runtime. It seems fine.

Make sure it is the same as the Java

One idea: let's do a literal transcription of your Java, with the same control flow, operations and types.

import Data.Bits
import Data.Int

loop :: Int -> Int
loop n = go 0 (n-1) 0 0
    where
        go :: Int -> Int -> Int -> Int -> Int
        go x y acc norm2
            | x <= y        = case () of { _
                | norm2 < 0         -> go (x+1) y     acc     (norm2 + 2*x + 1)
                | norm2 > 2 * (n-1) -> go (x-1) (y-1) acc     (norm2 + 2 - 2 * (x+y))
                | otherwise         -> go (x+1) y     (acc+1) (norm2 + 2*x + 1)
            }
            | otherwise     = acc

main = print $ loop (1 `shiftL` 30)

Peek at the core

We'll take a quick peek at the Core, using ghc-core , and it shows a very nice loop of unboxed type:

main_$s$wgo
  :: Int#
     -> Int#
     -> Int#
     -> Int#
     -> Int#

main_$s$wgo =
   (sc_sQa :: Int#)
    (sc1_sQb :: Int#)
    (sc2_sQc :: Int#)
    (sc3_sQd :: Int#) ->
    case <=# sc3_sQd sc2_sQc of _ {
      False -> sc1_sQb;
      True ->
        case <# sc_sQa 0 of _ {
          False ->
            case ># sc_sQa 2147483646 of _ {
              False ->
                main_$s$wgo
                  (+# (+# sc_sQa (*# 2 sc3_sQd)) 1)
                  (+# sc1_sQb 1)
                  sc2_sQc
                      (+# sc3_sQd 1);
              True ->
                main_$s$wgo
                  (-#
                     (+# sc_sQa 2)
                     (*# 2 (+# sc3_sQd sc2_sQc)))
                  sc1_sQb
                  (-# sc2_sQc 1)
                  (-# sc3_sQd 1)
            };
          True ->
            main_$s$wgo
              (+# (+# sc_sQa (*# 2 sc3_sQd)) 1)
              sc1_sQb
              sc2_sQc
              (+# sc3_sQd 1)

that is, all unboxed into registers. That loop looks great!

And performs just fine (Linux/x86-64/GHC 7.03):

./A  5.95s user 0.01s system 99% cpu 5.980 total

Checking the asm

We get reasonable assembly too, as a nice loop:

Main_mainzuzdszdwgo_info:
        cmpq    %rdi, %r8
        jg      .L8
.L3:
        testq   %r14, %r14
        movq    %r14, %rdx
        js      .L4
        cmpq    $2147483646, %r14
        jle     .L9
.L5:
        leaq    (%rdi,%r8), %r10
        addq    $2, %rdx
        leaq    -1(%rdi), %rdi
        addq    %r10, %r10
        movq    %rdx, %r14
        leaq    -1(%r8), %r8
        subq    %r10, %r14
        jmp     Main_mainzuzdszdwgo_info
.L9:
        leaq    1(%r14,%r8,2), %r14
        addq    $1, %rsi
        leaq    1(%r8), %r8
        jmp     Main_mainzuzdszdwgo_info
.L8:
        movq    %rsi, %rbx
        jmp     *0(%rbp)
.L4:
        leaq    1(%r14,%r8,2), %r14
        leaq    1(%r8), %r8
        jmp     Main_mainzuzdszdwgo_info

Using the -fvia-C backend.

So this looks fine!

My suspicion, as mentioned in the comment above, is something to do with the version of libgmp you have on 32 bit Windows generating poor code for 64 bit ints. First try upgrading to GHC 7.0.3, and then try some of the other code generator backends, then if you still have an issue with Int64 , file a bug report to GHC trac.

Broadly confirming that it is indeed the cost of making those C calls in the 32 bit emulation of 64 bit ints, we can replace Int64 with Integer , which is implemented with C calls to GMP on every machine, and indeed, runtime goes from 3s to well over a minute.

Lesson: use hardware 64 bits if at all possible.

链接地址: http://www.djcxy.com/p/24348.html

上一篇: 在C ++ 64位ntohl（）？

下一篇: Euler问题和Int64类型的递归性能问题