Euler问题和Int64类型的递归性能问题

2018-06-08 00:14:11

我目前正在使用欧拉问题项目作为我的操场来学习Haskell。我惊讶于我的Haskell程序与其他语言编写的类似程序相比的速度有多慢。我想知道我是否有什么担心，或者如果这是使用Haskell时期望的性能惩罚。

下面的程序受问题331的启发，但我在发布之前更改了它，所以我不会为其他人破坏任何内容。它计算在2 ^ 30 x 2 ^ 30网格上绘制的离散圆弧的弧长。这是一个简单的尾部递归实现，我确保累积变量的更新跟踪弧长是严格的。然而完成需要花费将近一分半钟的时间（用ghc编译-O标志）。

import Data.Int

arcLength :: Int64->Int64
arcLength n = arcLength' 0 (n-1) 0 0 where
    arcLength' x y norm2 acc
        | x > y = acc
        | norm2 < 0 = arcLength' (x + 1) y (norm2 + 2*x +1) acc
        | norm2 > 2*(n-1) = arcLength' (x - 1) (y-1) (norm2 - 2*(x + y) + 2) acc
        | otherwise = arcLength' (x + 1) y (norm2 + 2*x + 1) $! (acc + 1)

main = print $ arcLength (2^30)

这是Java中的相应实现。大约需要4.5秒才能完成。

public class ArcLength {
public static void main(String args[]) {
    long n = 1 << 30;
    long x = 0;
    long y = n-1;
    long acc = 0;
    long norm2 = 0;
    long time = System.currentTimeMillis();

    while(x <= y) {
        if (norm2 < 0) {
            norm2 += 2*x + 1;
            x++;
        } else if (norm2 > 2*(n-1)) {
            norm2 += 2 - 2*(x+y);
            x--;
            y--;
        } else {
            norm2 += 2*x + 1;
            x++;
            acc++;
        }
    }

    time = System.currentTimeMillis() - time;
    System.err.println(acc);
    System.err.println(time);
}

}

编辑：在评论的讨论后，我在Haskell代码做了som修改，并做了一些性能测试。首先，我将n改为2 ^ 29以避免溢出。然后，我尝试了6种不同的版本：使用Int64或Int，并在norm2或两者之前使用bangs，并在声明arcLength' xy !norm2 !acc norm2和arcLength' xy !norm2 !acc 。所有的都与编译

ghc -O3 -prof -rtsopts -fforce-recomp -XBangPatterns arctest.hs

结果如下：

(Int !norm2 !acc)
total time  =        3.00 secs   (150 ticks @ 20 ms)
total alloc =       2,892 bytes  (excludes profiling overheads)

(Int norm2 !acc)
total time  =        3.56 secs   (178 ticks @ 20 ms)
total alloc =       2,892 bytes  (excludes profiling overheads)

(Int norm2 acc)
total time  =        3.56 secs   (178 ticks @ 20 ms)
total alloc =       2,892 bytes  (excludes profiling overheads)

(Int64 norm2 acc)
arctest.exe: out of memory

(Int64 norm2 !acc)
total time  =       48.46 secs   (2423 ticks @ 20 ms)
total alloc = 26,246,173,228 bytes  (excludes profiling overheads)

(Int64 !norm2 !acc)
total time  =       31.46 secs   (1573 ticks @ 20 ms)
total alloc =       3,032 bytes  (excludes profiling overheads)

我在64位Windows 7（Haskell平台二进制发行版）下使用GHC 7.0.2。根据评论，在其他配置下编译时不会发生该问题。这使我认为Int64类型在Windows版本中被破坏了。

嗯，我用7.0.3安装了一个新的Haskell平台，并为你的程序大致获得了以下核心（ -ddump-simpl ）：

Main.$warcLength' =
   (ww_s1my :: GHC.Prim.Int64#) (ww1_s1mC :: GHC.Prim.Int64#)
    (ww2_s1mG :: GHC.Prim.Int64#) (ww3_s1mK :: GHC.Prim.Int64#) ->
    case {__pkg_ccall ghc-prim hs_gtInt64 [...]
           ww_s1my ww1_s1mC GHC.Prim.realWorld#
[...]

所以GHC已经意识到它可以解开你的整数，这很好。但是这个hs_getInt64调用看起来像C调用一样可疑。看着汇编程序输出（ -ddump-asm ），我们看到如下的东西：

pushl %eax
movl 76(%esp),%eax
pushl %eax
call _hs_gtInt64
addl $16,%esp

所以这看起来非常像Int64上的每个操作都变成了后端的一个完整的C调用。很明显，这很慢。

GHC.IntWord64的源代码似乎验证：在32位版本（如目前平台附带的版本）中，您只能通过FFI接口进行仿真。

嗯，这很有趣。所以我只编译了你的两个程序，并试用了它们：

% java -version                                                                                          
java version "1.6.0_18"
OpenJDK Runtime Environment (IcedTea6 1.8.7) (6b18-1.8.7-2~squeeze1)
OpenJDK 64-Bit Server VM (build 14.0-b16, mixed mode)
% javac ArcLength.java                                                                                   
% java ArcLength                                                                                         
843298604
6630

所以Java解决方案大约需要6.6秒 。接下来是ghc进行一些优化：

% ghc --version                                                                                          
The Glorious Glasgow Haskell Compilation System, version 6.12.1
% ghc --make -O arc.hs
% time ./arc                                                                                             
843298604
./arc  12.68s user 0.04s system 99% cpu 12.718 total

ghc -O只有13秒

尝试进一步优化：

% ghc --make -O3
% time ./arc                                                                                             [13:16]
843298604
./arc  5.75s user 0.00s system 99% cpu 5.754 total

通过进一步优化标志，haskell解决方案不到6秒

知道你使用的是什么版本的编译器会很有趣。

在你的问题中有一些有趣的事情。

你应该主要使用-O2 。它会做得更好（在这种情况下，确定和消除在-O版中仍然存在的懒惰）。

其次，你的Haskell和Java不太一样（它有不同的测试和分支）。和其他人一样，在我的Linux机器上运行你的代码会产生大约6秒的运行时间。看起来很好。

确保它与Java相同

一个想法：让我们做一个Java的文字转录，使用相同的控制流程，操作和类型。

import Data.Bits
import Data.Int

loop :: Int -> Int
loop n = go 0 (n-1) 0 0
    where
        go :: Int -> Int -> Int -> Int -> Int
        go x y acc norm2
            | x <= y        = case () of { _
                | norm2 < 0         -> go (x+1) y     acc     (norm2 + 2*x + 1)
                | norm2 > 2 * (n-1) -> go (x-1) (y-1) acc     (norm2 + 2 - 2 * (x+y))
                | otherwise         -> go (x+1) y     (acc+1) (norm2 + 2*x + 1)
            }
            | otherwise     = acc

main = print $ loop (1 `shiftL` 30)

偷看核心

我们将通过使用ghc-core快速浏览Core，并展示非常好的unboxed类型的循环：

main_$s$wgo
  :: Int#
     -> Int#
     -> Int#
     -> Int#
     -> Int#

main_$s$wgo =
   (sc_sQa :: Int#)
    (sc1_sQb :: Int#)
    (sc2_sQc :: Int#)
    (sc3_sQd :: Int#) ->
    case <=# sc3_sQd sc2_sQc of _ {
      False -> sc1_sQb;
      True ->
        case <# sc_sQa 0 of _ {
          False ->
            case ># sc_sQa 2147483646 of _ {
              False ->
                main_$s$wgo
                  (+# (+# sc_sQa (*# 2 sc3_sQd)) 1)
                  (+# sc1_sQb 1)
                  sc2_sQc
                      (+# sc3_sQd 1);
              True ->
                main_$s$wgo
                  (-#
                     (+# sc_sQa 2)
                     (*# 2 (+# sc3_sQd sc2_sQc)))
                  sc1_sQb
                  (-# sc2_sQc 1)
                  (-# sc3_sQd 1)
            };
          True ->
            main_$s$wgo
              (+# (+# sc_sQa (*# 2 sc3_sQd)) 1)
              sc1_sQb
              sc2_sQc
              (+# sc3_sQd 1)

也就是说，所有的东西都会被解除封闭。 该循环看起来很棒！

并且执行得很好（Linux / x86-64 / GHC 7.03）：

./A  5.95s user 0.01s system 99% cpu 5.980 total

检查模块

我们也得到合理的组装，作为一个不错的循环：

Main_mainzuzdszdwgo_info:
        cmpq    %rdi, %r8
        jg      .L8
.L3:
        testq   %r14, %r14
        movq    %r14, %rdx
        js      .L4
        cmpq    $2147483646, %r14
        jle     .L9
.L5:
        leaq    (%rdi,%r8), %r10
        addq    $2, %rdx
        leaq    -1(%rdi), %rdi
        addq    %r10, %r10
        movq    %rdx, %r14
        leaq    -1(%r8), %r8
        subq    %r10, %r14
        jmp     Main_mainzuzdszdwgo_info
.L9:
        leaq    1(%r14,%r8,2), %r14
        addq    $1, %rsi
        leaq    1(%r8), %r8
        jmp     Main_mainzuzdszdwgo_info
.L8:
        movq    %rsi, %rbx
        jmp     *0(%rbp)
.L4:
        leaq    1(%r14,%r8,2), %r14
        leaq    1(%r8), %r8
        jmp     Main_mainzuzdszdwgo_info

使用-fvia-C后端。

所以这看起来很好！

正如在上面的评论中提到的，我的怀疑与你在32位Windows上为64位整数生成糟糕代码的libgmp版本有关。首先尝试升级到GHC 7.0.3，然后尝试一些其他代码生成器后端，如果仍然存在Int64问题，请向GHC trac提交错误报告。

我们可以广泛地确认，在64位整数的32位模拟中实现这些C调用的成本，我们可以用Integer代替Int64 ， Integer是在每台机器上用C调用GMP来实现的，实际上，运行时间从3s到好一分钟以上。

课程：尽可能使用硬件64位。

链接地址: http://www.djcxy.com/p/24347.html

上一篇: Performance problem with Euler problem and recursion on Int64 types

下一篇: How to improve the performance of this Haskell program?