Repa 3 performance and correct usage of 'now'
There is a basic monad question in here, unrelated to Repa, plus several Repa-specific questions.
I am working on a library using Repa3. I am having trouble getting efficient parallel code. If I make my functions return delayed arrays, I get excruciatingly slow code that scales very well up to 8 cores. This code takes over 20GB of memory per the GHC profiler, and runs several orders of magnitude slower than the basic Haskell unboxed vectors.
Alternatively, if I make all of my functions return Unboxed manifest arrays (still attempting to use fusion within the functions, for example when I do a 'map'), I get MUCH faster code (still slower than using Haskell unboxed vectors) that doesn't scale at all, and in fact tends to get slightly slower with more cores.
Based on the FFT example code in Repa-Algorithms, it seems the correct approach is to always return manifest arrays. Is there ever a case where I should be returning delayed arrays?
The FFT code also makes plentiful use of the 'now' function. However, I get a type error when I try to use it in my code:
type Arr t r = Array t DIM1 r
data CycRingRepa m r = CRTBasis (Arr U r)
| PowBasis (Arr U r)
fromArray :: forall m r t. (BaseRing m r, Unbox r, Repr t r) => Arr t r -> CycRingRepa m r
fromArray =
let mval = reflectNum (Proxy::Proxy m)
in x ->
let sh:.n = extent x
in assert (mval == 2*n) PowBasis $ now $ computeUnboxedP $ bitrev x
The code compiles fine without the 'now'. With the 'now', I get the following error:
Couldn't match type r' with
Array U (Z :. Int) r' `r' is a rigid type variable bound by the type signature for fromArray :: (BaseRing mr, Unbox r, Repr tr) => Arr tr -> CycRingRepa mr at C:UserscrockeeaDocumentsCodeLatticeLibCycRingRepa.hs:50:1 Expected type: CycRingRepa mr Actual type: CycRingRepa m (Array U DIM1 r)
I don't think this is my problem. It would be helpful if someone could explain the how the Monad works in 'now'. By my best estimation, the monad seems to be creating a 'Arr U (Arr U r)'. I'm expecting a 'Arr U r', which would then match the data constructor pattern. What is going on and how do I fix this?
The type signatures are:
computeUnboxedP :: Fill r1 U sh e => Array r1 sh e -> Array U sh e
now :: (Shape sh, Repr r e, Monad m) => Array r sh e -> m (Array r sh e)
It would be helpful to have a better idea of when it is appropriate to use 'now'.
A couple other Repa questions: Should I explicitly call computeUnboxedP (as in the FFT example code), or should I use the more general computeP (because the unbox part is inferred by my data type)? Should I store delayed or manifest arrays in the data type CycRingRepa? Eventually I would also like this code to work with Haskell Integers. Will this require me to write new code that uses something other than U arrays, or could I write polymorphic code that creates U arrays for unbox types and some other array for Integers/boxed types?
I realize there are a lot of questions in here, and I appreciate any/all answers!
Repa 3.1 no longer requires the explict use of now
. The parallel computation functions are all monadic, and automatically apply deepSeqArray
to their results. The repa-examples package also contains a new implementation of matrix multiply that demonstrates their use.
Here's the source code for now
:
now arr = do
arr `deepSeqArray` return ()
return arr
So it's really just a monadic version of deepSeqArray
. You can use either of these to force evaluation, rather than hanging on to a thunk. This "evalulation" is different than the "computation" forced when computeP
is called.
In your code, now
doesn't apply, since you're not in a monad. But in this context deepSeqArray
wouldn't help either. Consider this situation:
x :: Array U Int Double
x = ...
y :: Array U Int Double
y = computeUnboxedP $ map f x
Since y
refers to x
, we'd like to be sure x
is computed before starting to compute y
. If not, the available work won't be distributed correctly among the gang of threads. To get this to work out, it's better to write y
as
y = deepSeqArray x . computeUnboxedP $ map f x
Now, for a delayed array, we have
deepSeqArray (ADelayed sh f) y = sh `deepSeq` f `seq` y
Rather than computing all the elements, this just makes sure the shape is computed, and reduces f
to weak-head normal form.
As for manifest vs delayed arrays, there are certainly time delayed arrays are preferable.
multiplyMM arr brr
= [arr, brr] `deepSeqArrays`
A.sumP (A.zipWith (*) arrRepl brrRepl)
where trr = computeUnboxedP $ transpose2D brr
arrRepl = trr `deepSeqArray` A.extend (Z :. All :. colsB :. All) arr
brrRepl = trr `deepSeqArray` A.extend (Z :. rowsA :. All :. All) trr
(Z :. _ :. rowsA) = extent arr
(Z :. colsB :. _ ) = extent brr
Here "extend" generates a new array by copying the values across some set of new dimensions. In particular, this means that
arrRepl ! (Z :. i :. j :. k) == arrRepl ! (Z :. i :. j' :. k)
Thankfully, extend
produces a delayed array, since it would be a waste to go through the trouble of all this copying.
Delayed arrays also allow the possiblity of fusion, which is impossible if the array is manifest.
Finally, computeUnboxedP
is just computeP
with a specialized type. Giving computeUnboxedP
explicitly might allow GHC to optimize better, and makes the code a little clearer.
上一篇: 并行mapM在Repa阵列上
下一篇: 重新3表现和正确使用“现在”