Performance Analysis via Core

The following code runs in about 1.5ms on my computer (compiled with GHC 8.0.1 and -02):

import Criterion
import Data.Bits
import Data.Int
import Criterion.Main

main :: IO ()
main = defaultMain [bench "mybench" $ nf (mybench 3840) (0 :: Int)]

mybench :: Int -> Int -> Double
{-# INLINE mybench #-}
mybench n = go 0 n
  where go s i g | i == 0 = s
                 | otherwise =
                      let (w,_) = f 1 0 g
                          --w = f 1 0 g
                          f mag v gen | mag >= 18446744073709551616000 = (v,gen)
                          --f mag v gen | mag >= 18446744073709551616000 = v
                                      | otherwise = v' `seq` f (mag*18446744073709551616 :: Integer) v' gen where
                                            x = -8499970308474009078 :: Int
                                            v' = (v * 18446744073709551616 + (fromIntegral x + 9223372036854775808))

                          y = fromInteger ((-9223372036854775808) + w `mod` 18446744073709551616)
                          coef = (fromIntegral (9007199254740991 .&. (y::Int64)) :: Double) /  9007199254740992
                          z = 2.0 * (-0.5 + coef)
                      in go (z+s) (i-1) g

However, if I use the commented alternates of w and f , the code runs in ~31μs! This was surprising to me since I changed very little, and because f runs twice for each of 3,840 iterations (ie, the code is barely used).

I went to the core to investigate. Here's the relevant portions of -ddump-simpl from the slow version and fast version.

Unfortunately, I can't see from the core what is making such a huge difference. The primary difference that I see is that in the fast version, GHC has realized that f doesn't need the gen argument. But surely that can't make a 45x/2 orders of magnitude performance difference.

The source code is a bit contrived (several args aren't needed or used), so my main question is about the core: I don't see any differences that would indicate such a drastic performance difference. What am I missing when analyzing the core? As a followup, what could I do at the source level of the first/slow version to get it to perform like the second/fast version?


It looks like in the fast version GHC lifted the computation:

 y = fromInteger ((-9223372036854775808) + w `mod` 18446744073709551616)

out of the definition of go . Just look at where modInteger and plusInteger occur in both dumps.

It looks like in the assignment w = f 1 0 g it inlined the definition of f so that it doesn't have to compute w on each call to go . More specifically, f 1 0 g doesn't depend on any the parameters to go - ie. s , i or g , and so it's computation can be lifted out.

Even though g is passed to f in the expression f 1 0 g , it doesn't actually get used.

链接地址: http://www.djcxy.com/p/43178.html

上一篇: 关于GHC实施的好介绍性文字?

下一篇: 通过Core进行性能分析