Haskell Performance by Example
I have these basic types in my code:
newtype Foo (m :: Factored) = Foo Int64 deriving (NFData)
foo :: forall m . (Fact m) => Foo m -> Foo m
class T t where t :: (Fact m ) => t m -> t m
instance T Foo where t = foo
newtype Bar t (m :: Factored) = Bar (t m) deriving (NFData)
bar :: (Fact m, T t) => Bar t m -> Bar t m
bar (Bar v) = Bar $ t v
(ignore Fact
and Factored
for the moment). I'm benchmarking the code at different levels, comparing the performance of foo
, t
, and bar
. In the benchmarks, t = foo
, and bar
just applies t
through a newtype
. Thus their runtime should be essentially identical, but criterion reports that foo
takes 9.2ns, t
takes about twice that at 17.45ns, and bar
takes a whopping 268.1ns.
I've experimented with adding INLINABLE
and even a SPECIALIZE
pragma, but they aren't helping. I want to believe that GHC has some magic syntax/optimization/etc that can be consistently applied to solve these types of performance issues. For example, I've seen cases where writing code in point-free style has dramatic performance implications.
The complete code can be found here. I promise it's not intimidating. The modules are:
Foo
, foo
, and T
Bar
and bar
foo
t
bar
Fact
and Factored
using singletons Most of the modules are tiny; I defined the three benchmarks in separate files so that I could examine their core. I generated the core for the three *Bench
modules and aligned them the best I could. They're only ~250 lines each, and the first ~200 lines are identical, up to renaming. The problem is that I don't know what to make of the last 50 or so lines. The (core) diff for FooBench
vs TBench
is here, the diff for TBench
vs BarBench
is here, and the diff for FooBench
vs BarBench
is here.
Just a few of the questions I have:
At a high level, what is the essential difference between the core files? I'm looking for something like "Here you can see that GHC isn't inlining the call to x
." What should I be looking for?
What can be done to make the three benchmarks all run in about 9.2ns? GHC optimizations? INLINE
/ INLINABLE
pragmas? SPECIALIZE
pragmas that I missed? (You aren't allowed to specialized for F128::Factored
; in my real library this value may be reified at runtime.) Restricting/delaying inlining to a particular phase?
Although I'm looking for an actual solution to make the benchmarks all fast, it's possible that tricks that work for this example won't scale to my real library. As a result, I'm also looking for the "high-level" explanation of why a particular technique should work.
First, looking at bar
:
bar :: (Fact m, T t) => Bar t m -> Bar t m
bar (Bar v) = Bar $ t v
we can write this without needing the argument using coerce
:
bar :: (Fact m, T t) => Bar t m -> Bar t m
bar = (coerce :: (t m -> t m) -> Bar t m -> Bar t m) t
this (as we'd hope) gets bar
performing the same as t
. (In fact the core for TBench
and BarBench
is exactly the same, excluding type signatures).
I'm not entirely sure why, but using INLINE
instead of INLINEABLE
makes t
and bar
perform the same as foo
. I'm not an expert but it's normally better to use INLINE
for small functions that you're sure you want to inline.
That said I think some of these issues are from how criterion does benchmarks to stop ghc
cheating. For instance, writing bench "Bar" $ nf (GHC.Magic.inline bar) x
on your original code has bar
performing the same as foo
. I suspect a "real" program wouldn't be so delicate.
上一篇: 专门化没有内联的相关多态函数
下一篇: Haskell性能示例