Performance problem with Euler problem and recursion on Int64 types
I'm currently learning Haskell using the project Euler problems as my playground. I was astound by how slow my Haskell programs turned out to be compared to similar programs written in other languages. I'm wondering if I've forseen something, or if this is the kind of performance penalties one has to expect when using Haskell.
The following program in inspired by Problem 331, but I've changed it before posting so I don't spoil anything for other people. It computes the arc length of a discrete circle drawn on a 2^30 x 2^30 grid. It is a simple tail recursive implementation and I make sure that the updates of the accumulation variable keeping track of the arc length is strict. Yet it takes almost one and a half minute to complete (compiled with the -O flag with ghc).
import Data.Int
arcLength :: Int64->Int64
arcLength n = arcLength' 0 (n-1) 0 0 where
arcLength' x y norm2 acc
| x > y = acc
| norm2 < 0 = arcLength' (x + 1) y (norm2 + 2*x +1) acc
| norm2 > 2*(n-1) = arcLength' (x - 1) (y-1) (norm2 - 2*(x + y) + 2) acc
| otherwise = arcLength' (x + 1) y (norm2 + 2*x + 1) $! (acc + 1)
main = print $ arcLength (2^30)
Here is a corresponding implementation in Java. It takes about 4.5 seconds to complete.
public class ArcLength {
public static void main(String args[]) {
long n = 1 << 30;
long x = 0;
long y = n-1;
long acc = 0;
long norm2 = 0;
long time = System.currentTimeMillis();
while(x <= y) {
if (norm2 < 0) {
norm2 += 2*x + 1;
x++;
} else if (norm2 > 2*(n-1)) {
norm2 += 2 - 2*(x+y);
x--;
y--;
} else {
norm2 += 2*x + 1;
x++;
acc++;
}
}
time = System.currentTimeMillis() - time;
System.err.println(acc);
System.err.println(time);
}
}
EDIT: After the discussions in the comments I made som modifications in the Haskell code and did some performance tests. First I changed n to 2^29 to avoid overflows. Then I tried 6 different version: With Int64 or Int and with bangs before either norm2 or both and norm2 and acc in the declaration arcLength' xy !norm2 !acc
. All are compiled with
ghc -O3 -prof -rtsopts -fforce-recomp -XBangPatterns arctest.hs
Here are the results:
(Int !norm2 !acc)
total time = 3.00 secs (150 ticks @ 20 ms)
total alloc = 2,892 bytes (excludes profiling overheads)
(Int norm2 !acc)
total time = 3.56 secs (178 ticks @ 20 ms)
total alloc = 2,892 bytes (excludes profiling overheads)
(Int norm2 acc)
total time = 3.56 secs (178 ticks @ 20 ms)
total alloc = 2,892 bytes (excludes profiling overheads)
(Int64 norm2 acc)
arctest.exe: out of memory
(Int64 norm2 !acc)
total time = 48.46 secs (2423 ticks @ 20 ms)
total alloc = 26,246,173,228 bytes (excludes profiling overheads)
(Int64 !norm2 !acc)
total time = 31.46 secs (1573 ticks @ 20 ms)
total alloc = 3,032 bytes (excludes profiling overheads)
I'm using GHC 7.0.2 under a 64-bit Windows 7 (The Haskell platform binary distribution). According to the comments, the problem does not occur when compiling under other configurations. This makes me think that the Int64 type is broken in the Windows release.
Hm, I installed a fresh Haskell platform with 7.0.3, and get roughly the following core for your program ( -ddump-simpl
):
Main.$warcLength' =
(ww_s1my :: GHC.Prim.Int64#) (ww1_s1mC :: GHC.Prim.Int64#)
(ww2_s1mG :: GHC.Prim.Int64#) (ww3_s1mK :: GHC.Prim.Int64#) ->
case {__pkg_ccall ghc-prim hs_gtInt64 [...]
ww_s1my ww1_s1mC GHC.Prim.realWorld#
[...]
So GHC has realized that it can unpack your integers, which is good. But this hs_getInt64
call looks suspiciously like a C call. Looking at the assembler output ( -ddump-asm
), we see stuff like:
pushl %eax
movl 76(%esp),%eax
pushl %eax
call _hs_gtInt64
addl $16,%esp
So this looks very much like every operation on the Int64
get turned into a full-blown C call in the backend. Which is slow, obviously.
The source code of GHC.IntWord64
seems to verify that: In a 32-bit build (like the one currently shipped with the platform), you will have only emulation via the FFI interface.
Hmm, this is interesting. So I just compiled both of your programs, and tried them out:
% java -version java version "1.6.0_18" OpenJDK Runtime Environment (IcedTea6 1.8.7) (6b18-1.8.7-2~squeeze1) OpenJDK 64-Bit Server VM (build 14.0-b16, mixed mode) % javac ArcLength.java % java ArcLength 843298604 6630
So about 6.6 seconds for the Java solution . Next is ghc with some optimization:
% ghc --version The Glorious Glasgow Haskell Compilation System, version 6.12.1 % ghc --make -O arc.hs % time ./arc 843298604 ./arc 12.68s user 0.04s system 99% cpu 12.718 total
Just under 13 seconds for ghc -O
Trying with some further optimization:
% ghc --make -O3 % time ./arc [13:16] 843298604 ./arc 5.75s user 0.00s system 99% cpu 5.754 total
With further optimization flags, the haskell solution took under 6 seconds
It would be interesting to know what version compiler you are using.
There's a couple of interesting things in your question.
You should be using -O2
primarily. It will just do a better job (in this case, identifying and removing laziness that was still present in the -O
version).
Secondly, your Haskell isn't quite the same as the Java (it does different tests and branches). As with others, running your code on my Linux box results in around 6s runtime. It seems fine.
Make sure it is the same as the Java
One idea: let's do a literal transcription of your Java, with the same control flow, operations and types.
import Data.Bits
import Data.Int
loop :: Int -> Int
loop n = go 0 (n-1) 0 0
where
go :: Int -> Int -> Int -> Int -> Int
go x y acc norm2
| x <= y = case () of { _
| norm2 < 0 -> go (x+1) y acc (norm2 + 2*x + 1)
| norm2 > 2 * (n-1) -> go (x-1) (y-1) acc (norm2 + 2 - 2 * (x+y))
| otherwise -> go (x+1) y (acc+1) (norm2 + 2*x + 1)
}
| otherwise = acc
main = print $ loop (1 `shiftL` 30)
Peek at the core
We'll take a quick peek at the Core, using ghc-core
, and it shows a very nice loop of unboxed type:
main_$s$wgo
:: Int#
-> Int#
-> Int#
-> Int#
-> Int#
main_$s$wgo =
(sc_sQa :: Int#)
(sc1_sQb :: Int#)
(sc2_sQc :: Int#)
(sc3_sQd :: Int#) ->
case <=# sc3_sQd sc2_sQc of _ {
False -> sc1_sQb;
True ->
case <# sc_sQa 0 of _ {
False ->
case ># sc_sQa 2147483646 of _ {
False ->
main_$s$wgo
(+# (+# sc_sQa (*# 2 sc3_sQd)) 1)
(+# sc1_sQb 1)
sc2_sQc
(+# sc3_sQd 1);
True ->
main_$s$wgo
(-#
(+# sc_sQa 2)
(*# 2 (+# sc3_sQd sc2_sQc)))
sc1_sQb
(-# sc2_sQc 1)
(-# sc3_sQd 1)
};
True ->
main_$s$wgo
(+# (+# sc_sQa (*# 2 sc3_sQd)) 1)
sc1_sQb
sc2_sQc
(+# sc3_sQd 1)
that is, all unboxed into registers. That loop looks great!
And performs just fine (Linux/x86-64/GHC 7.03):
./A 5.95s user 0.01s system 99% cpu 5.980 total
Checking the asm
We get reasonable assembly too, as a nice loop:
Main_mainzuzdszdwgo_info:
cmpq %rdi, %r8
jg .L8
.L3:
testq %r14, %r14
movq %r14, %rdx
js .L4
cmpq $2147483646, %r14
jle .L9
.L5:
leaq (%rdi,%r8), %r10
addq $2, %rdx
leaq -1(%rdi), %rdi
addq %r10, %r10
movq %rdx, %r14
leaq -1(%r8), %r8
subq %r10, %r14
jmp Main_mainzuzdszdwgo_info
.L9:
leaq 1(%r14,%r8,2), %r14
addq $1, %rsi
leaq 1(%r8), %r8
jmp Main_mainzuzdszdwgo_info
.L8:
movq %rsi, %rbx
jmp *0(%rbp)
.L4:
leaq 1(%r14,%r8,2), %r14
leaq 1(%r8), %r8
jmp Main_mainzuzdszdwgo_info
Using the -fvia-C
backend.
So this looks fine!
My suspicion, as mentioned in the comment above, is something to do with the version of libgmp
you have on 32 bit Windows generating poor code for 64 bit ints. First try upgrading to GHC 7.0.3, and then try some of the other code generator backends, then if you still have an issue with Int64
, file a bug report to GHC trac.
Broadly confirming that it is indeed the cost of making those C calls in the 32 bit emulation of 64 bit ints, we can replace Int64
with Integer
, which is implemented with C calls to GMP on every machine, and indeed, runtime goes from 3s to well over a minute.
Lesson: use hardware 64 bits if at all possible.
链接地址: http://www.djcxy.com/p/24348.html上一篇: 在C ++ 64位ntohl()?