Specialization with Constraints

I'm having problems getting GHC to specialize a function with a class constraint. I have a minimal example of my problem here: Foo.hs and Main.hs. The two files compile (GHC 7.6.2, ghc -O3 Main ) and run.

NOTE: Foo.hs is really stripped down. If you want to see why the constraint is needed, you can see a little more code here. If I put the code in a single file or make many other minor changes, GHC simply inlines the call to plusFastCyc . This will not happen in the real code because plusFastCyc is too large for GHC to inline, even when marked INLINE . The point is to specialize the call to plusFastCyc , not inline it. plusFastCyc is called in many places in the real code, so duplicating such a large function would not be desirable even if I could force GHC to do it.

The code of interest is the plusFastCyc in Foo.hs , reproduced here:

{-# INLINEABLE plusFastCyc #-}
{-# SPECIALIZE plusFastCyc :: 
         forall m . (Factored m Int) => 
              (FastCyc (VT U.Vector m) Int) -> 
                   (FastCyc (VT U.Vector m) Int) -> 
                        (FastCyc (VT U.Vector m) Int) #-}

-- Although the next specialization makes `fcTest` fast,
-- it isn't useful to me in my real program because the phantom type M is reified
-- {-# SPECIALIZE plusFastCyc :: 
--          FastCyc (VT U.Vector M) Int -> 
--               FastCyc (VT U.Vector M) Int -> 
--                    FastCyc (VT U.Vector M) Int #-}

plusFastCyc :: (Num (t r)) => (FastCyc t r) -> (FastCyc t r) -> (FastCyc t r)
plusFastCyc (PowBasis v1) (PowBasis v2) = PowBasis $ v1 + v2

The Main.hs file has two drivers: vtTest , which runs in ~3 seconds, and fcTest , which runs in ~83 seconds when compiled with -O3 using the forall 'd specialization.

The core shows that for the vtTest test, the addition code is being specialized to Unboxed vectors over Int s, etc, while generic vector code is used for fcTest . On line 10, you can see that GHC does write a specialized version of plusFastCyc , compared to the generic version on line 167. The rule for the specialization is on line 225. I believe this rule should fire on line 270. ( main6 calls iterate main8 y , so main8 is where plusFastCyc should be specialized.)

My goal is to make fcTest as fast as vtTest by specializing plusFastCyc . I've found two ways to do this:

  • Explicity call inline from GHC.Exts in fcTest .
  • Remove the Factored m Int constraint on plusFastCyc .
  • Option 1 is unsatisfactory because in the actual code base plusFastCyc is a frequently used operation and a very large function, so it should not be inlined at every use. Rather, GHC should call a specialized version of plusFastCyc . Option 2 is not really an option because I need the constraint in the real code.

    I've tried a variety of options using (and not using) INLINE , INLINABLE , and SPECIALIZE , but nothing seems to work. ( EDIT : I may have stripped out too much of plusFastCyc to make my example small, so INLINE might cause the function to be inlined. This doesn't happen in my real code because plusFastCyc is so large.) In this particular example, I'm not getting any match_co: needs more cases or RULE: LHS too complicated to desugar (and here) warnings, though I was getting many match_co warnings before minimizing the example. Presumably, the "problem" is the Factored m Int constraint in the rule; if I make changes to that constraint, fcTest runs as fast as vtTest .

    Am I doing something GHC just doesn't like? Why won't GHC specialize the plusFastCyc , and how can I make it?

    UPDATE

    The problem persists in GHC 7.8.2, so this question is still relevant.


    GHC also gives an option to SPECIALIZE a type-class instance declaration. I tried this with the (expanded) code of Foo.hs , by putting the following:

    instance (Num r, V.Vector v r, Factored m r) => Num (VT v m r) where 
        {-# SPECIALIZE instance ( Factored m Int => Num (VT U.Vector m Int)) #-}
        VT x + VT y = VT $ V.zipWith (+) x y
    

    This change, though, did not achieve the desired speedup. What did achieve that performance improvement was manually adding a specialized instance for the type VT U.Vector m Int with the same function definitions, as follows:

    instance (Factored m Int) => Num (VT U.Vector m Int) where 
        VT x + VT y = VT $ V.zipWith (+) x y
    

    This requires adding OverlappingInstances and FlexibleInstances in LANGUAGE .

    Interestingly, in the example program, the speedup obtained with the overlapping instance remains even if you remove every SPECIALIZE and INLINABLE pragma.

    链接地址: http://www.djcxy.com/p/5676.html

    上一篇: 跟踪约束的技巧

    下一篇: 具有约束的专业化