Specialization with Constraints
I'm having problems getting GHC to specialize a function with a class constraint. I have a minimal example of my problem here: Foo.hs and Main.hs. The two files compile (GHC 7.6.2, ghc -O3 Main
) and run.
NOTE: Foo.hs
is really stripped down. If you want to see why the constraint is needed, you can see a little more code here. If I put the code in a single file or make many other minor changes, GHC simply inlines the call to plusFastCyc
. This will not happen in the real code because plusFastCyc
is too large for GHC to inline, even when marked INLINE
. The point is to specialize the call to plusFastCyc
, not inline it. plusFastCyc
is called in many places in the real code, so duplicating such a large function would not be desirable even if I could force GHC to do it.
The code of interest is the plusFastCyc
in Foo.hs
, reproduced here:
{-# INLINEABLE plusFastCyc #-}
{-# SPECIALIZE plusFastCyc ::
forall m . (Factored m Int) =>
(FastCyc (VT U.Vector m) Int) ->
(FastCyc (VT U.Vector m) Int) ->
(FastCyc (VT U.Vector m) Int) #-}
-- Although the next specialization makes `fcTest` fast,
-- it isn't useful to me in my real program because the phantom type M is reified
-- {-# SPECIALIZE plusFastCyc ::
-- FastCyc (VT U.Vector M) Int ->
-- FastCyc (VT U.Vector M) Int ->
-- FastCyc (VT U.Vector M) Int #-}
plusFastCyc :: (Num (t r)) => (FastCyc t r) -> (FastCyc t r) -> (FastCyc t r)
plusFastCyc (PowBasis v1) (PowBasis v2) = PowBasis $ v1 + v2
The Main.hs
file has two drivers: vtTest
, which runs in ~3 seconds, and fcTest
, which runs in ~83 seconds when compiled with -O3 using the forall
'd specialization.
The core shows that for the vtTest
test, the addition code is being specialized to Unboxed
vectors over Int
s, etc, while generic vector code is used for fcTest
. On line 10, you can see that GHC does write a specialized version of plusFastCyc
, compared to the generic version on line 167. The rule for the specialization is on line 225. I believe this rule should fire on line 270. ( main6
calls iterate main8 y
, so main8
is where plusFastCyc
should be specialized.)
My goal is to make fcTest
as fast as vtTest
by specializing plusFastCyc
. I've found two ways to do this:
inline
from GHC.Exts
in fcTest
. Factored m Int
constraint on plusFastCyc
. Option 1 is unsatisfactory because in the actual code base plusFastCyc
is a frequently used operation and a very large function, so it should not be inlined at every use. Rather, GHC should call a specialized version of plusFastCyc
. Option 2 is not really an option because I need the constraint in the real code.
I've tried a variety of options using (and not using) INLINE
, INLINABLE
, and SPECIALIZE
, but nothing seems to work. ( EDIT : I may have stripped out too much of plusFastCyc
to make my example small, so INLINE
might cause the function to be inlined. This doesn't happen in my real code because plusFastCyc
is so large.) In this particular example, I'm not getting any match_co: needs more cases
or RULE: LHS too complicated to desugar
(and here) warnings, though I was getting many match_co
warnings before minimizing the example. Presumably, the "problem" is the Factored m Int
constraint in the rule; if I make changes to that constraint, fcTest
runs as fast as vtTest
.
Am I doing something GHC just doesn't like? Why won't GHC specialize the plusFastCyc
, and how can I make it?
UPDATE
The problem persists in GHC 7.8.2, so this question is still relevant.
GHC also gives an option to SPECIALIZE
a type-class instance declaration. I tried this with the (expanded) code of Foo.hs
, by putting the following:
instance (Num r, V.Vector v r, Factored m r) => Num (VT v m r) where
{-# SPECIALIZE instance ( Factored m Int => Num (VT U.Vector m Int)) #-}
VT x + VT y = VT $ V.zipWith (+) x y
This change, though, did not achieve the desired speedup. What did achieve that performance improvement was manually adding a specialized instance for the type VT U.Vector m Int
with the same function definitions, as follows:
instance (Factored m Int) => Num (VT U.Vector m Int) where
VT x + VT y = VT $ V.zipWith (+) x y
This requires adding OverlappingInstances
and FlexibleInstances
in LANGUAGE
.
Interestingly, in the example program, the speedup obtained with the overlapping instance remains even if you remove every SPECIALIZE
and INLINABLE
pragma.
上一篇: 跟踪约束的技巧
下一篇: 具有约束的专业化