Performance Analysis via Core
The following code runs in about 1.5ms on my computer (compiled with GHC 8.0.1 and -02):
import Criterion
import Data.Bits
import Data.Int
import Criterion.Main
main :: IO ()
main = defaultMain [bench "mybench" $ nf (mybench 3840) (0 :: Int)]
mybench :: Int -> Int -> Double
{-# INLINE mybench #-}
mybench n = go 0 n
where go s i g | i == 0 = s
| otherwise =
let (w,_) = f 1 0 g
--w = f 1 0 g
f mag v gen | mag >= 18446744073709551616000 = (v,gen)
--f mag v gen | mag >= 18446744073709551616000 = v
| otherwise = v' `seq` f (mag*18446744073709551616 :: Integer) v' gen where
x = -8499970308474009078 :: Int
v' = (v * 18446744073709551616 + (fromIntegral x + 9223372036854775808))
y = fromInteger ((-9223372036854775808) + w `mod` 18446744073709551616)
coef = (fromIntegral (9007199254740991 .&. (y::Int64)) :: Double) / 9007199254740992
z = 2.0 * (-0.5 + coef)
in go (z+s) (i-1) g
However, if I use the commented alternates of w
and f
, the code runs in ~31μs! This was surprising to me since I changed very little, and because f
runs twice for each of 3,840 iterations (ie, the code is barely used).
I went to the core to investigate. Here's the relevant portions of -ddump-simpl
from the slow version and fast version.
Unfortunately, I can't see from the core what is making such a huge difference. The primary difference that I see is that in the fast version, GHC has realized that f
doesn't need the gen
argument. But surely that can't make a 45x/2 orders of magnitude performance difference.
The source code is a bit contrived (several args aren't needed or used), so my main question is about the core: I don't see any differences that would indicate such a drastic performance difference. What am I missing when analyzing the core? As a followup, what could I do at the source level of the first/slow version to get it to perform like the second/fast version?
It looks like in the fast version GHC lifted the computation:
y = fromInteger ((-9223372036854775808) + w `mod` 18446744073709551616)
out of the definition of go
. Just look at where modInteger
and plusInteger
occur in both dumps.
It looks like in the assignment w = f 1 0 g
it inlined the definition of f
so that it doesn't have to compute w
on each call to go
. More specifically, f 1 0 g
doesn't depend on any the parameters to go
- ie. s
, i
or g
, and so it's computation can be lifted out.
Even though g
is passed to f
in the expression f 1 0 g
, it doesn't actually get used.
上一篇: 关于GHC实施的好介绍性文字?
下一篇: 通过Core进行性能分析