Idiomatic option pricing and risk using Repa parallel arrays

2018-06-21 08:33:33

Suppose I want to price a call option using a finite difference method and repa then the following does the job:

import Data.Array.Repa as Repa

r, sigma, k, t, xMax, deltaX, deltaT :: Double
m, n, p :: Int
r = 0.05
sigma = 0.2
k = 50.0
t = 3.0
m = 3
p = 1
xMax = 150
deltaX = xMax / (fromIntegral m)
n = 800
deltaT = t / (fromIntegral n)

singleUpdater a = traverse a id f
  where
    Z :. m = extent a
    f _get (Z :. ix) | ix == 0   = 0.0
    f _get (Z :. ix) | ix == m-1 = xMax - k
    f  get (Z :. ix)             = a * get (Z :. ix-1) +
                                   b * get (Z :. ix) +
                                   c * get (Z :. ix+1)
      where
        a = deltaT * (sigma^2 * (fromIntegral ix)^2 - r * (fromIntegral ix)) / 2
        b = 1 - deltaT * (r  + sigma^2 * (fromIntegral ix)^2)
        c = deltaT * (sigma^2 * (fromIntegral ix)^2 + r * (fromIntegral ix)) / 2

priceAtT :: Array U DIM1 Double
priceAtT = fromListUnboxed (Z :. m+1) [max 0 (deltaX * (fromIntegral j) - k) | j <- [0..m]]

testSingle :: IO (Array U DIM1 Double)
testSingle = computeP $ singleUpdater priceAtT

But now suppose I want to price options in parallel (say to do a spot ladder) then I can do this:

multiUpdater a = fromFunction (extent a) f
     where
       f :: DIM2 -> Double
       f (Z :. ix :. jx) = (singleUpdater x)!(Z :. ix)
         where
           x :: Array D DIM1 Double
           x = slice a (Any :. jx)

priceAtTMulti :: Array U DIM2 Double
priceAtTMulti = fromListUnboxed (Z :. m+1 :. p+1)
                [max 0 (deltaX * (fromIntegral j) - k) | j <- [0..m], _l <- [0..p]]

testMulti :: IO (Array U DIM2 Double)
testMulti = computeP $ multiUpdater priceAtTMulti

Questions:

Is there a more idiomatic way of doing this in repa?

Will the above method actually price in parallel?

How can I determine whether my code really is generating something that will be executed in parallel?

I tried this for 3 but sadly encountered a bug in ghc:

bash-3.2$ ghc -fext-core --make Test.hs
[1 of 1] Compiling Main             ( Test.hs, Test.o )
ghc: panic! (the 'impossible' happened)
 (GHC version 7.4.1 for x86_64-apple-darwin):
    MkExternalCore died: make_lit

Your bug is unrelated to your code -- it is your use of -fext-core to print the output of compilation in the External Core format. Just don't do that (to see the core, I use ghc-core).

Compile with -O2 and -threaded :

$ ghc -O2 -rtsopts --make A.hs -threaded 
[1 of 1] Compiling Main             ( A.hs, A.o )
Linking A ...

Then run with +RTS -N4 , for example, to use 4 threads:

$ time ./A +RTS -N4
[0.0,0.0,8.4375e-3,8.4375e-3,50.009375,50.009375,100.0,100.0]
./A  0.00s user 0.00s system 85% cpu 0.008 total

So it completes too quickly to see a result. I'll increase your m and p parameters to 1k and 3k

$ time ./A +RTS -N2
./A +RTS -N2  3.03s user 1.33s system 159% cpu 2.735 total

So yes, it does run in parallel. 1.6x on a 2 core machine, at a first attempt. Whether or not it is efficient is another question. Use +RTS -s you can see the runtime stats:

TASKS: 4 (1 bound, 3 peak workers (3 total), using -N2)

So we had 3 threads running in parallel (2 presumably for the algo, one for the IO manager).

You can reduce running time by adjusting the GC settings. Eg by using -A we can reduce GC overhead, and yield genuine parallel speedups.

$ time ./A +RTS -N1 -A100M   
./A +RTS -N1 -A100M  1.99s user 0.29s system 99% cpu 2.287 total

$ time ./A +RTS -N2 -A100M   
./A +RTS -N2 -A100M  2.30s user 0.86s system 147% cpu 2.145 total

You can improve numeric performance sometimes by using the LLVM backend. This seems to be the case here as well:

$ ghc -O2 -rtsopts --make A.hs -threaded -fforce-recomp -fllvm
[1 of 1] Compiling Main             ( A.hs, A.o )
Linking A ...

$ time ./A +RTS -N2 -A100M                                    
./A +RTS -N2 -A100M  2.09s user 0.95s system 147% cpu 2.065 total

Nothing spectacular, but you are improving running time over your single threaded version, and I've not modified your original code in any way. To really improve things, you'll need to profile and optimize.

Revisiting the -A flags, we can bring the time down yet further using a tigher bound on the initial thread allocation area.

$ ghc -Odph -rtsopts --make A.hs -threaded -fforce-recomp -fllvm

$ time ./A +RTS -N2 -A60M -s
./A +RTS -N2 -A60M 1.99s user 0.73s system 144% cpu 1.880 total

So have brought it down to 1.8s from 2.7 (30% improvement) by using the parallel runtime, the LLVM backend, and being careful with GC flags. You can look at the GC flag surface to find the optimum:

在这里输入图像描述

With the trough around -A64 -N2 being ideal for the data set size.

I would also strongly consider using manual common subexpression elimination in your inner kernel, to avoid recomputing things excessively.

As Alp suggests, to see the runtime behavior of the program, compile threadscope (from Hackage) and run as follows:

$ ghc -O2 -fllvm -rtsopts -threaded -eventlog --make A.hs

$ ./A +RTS -ls -N2 -A60M

And you get an event trace for your two cores like so:

So what's going on here? You have an initial period (0.8s) of setup time -- allocating your big list and encoding it into a repa array -- as you can see by the single threaded interleaving of GC and execution. Then there's another 0.8s of something on a single core, before your actual parallel work occurs for the last 300ms.

So while your actual algorithm may be parallelizing nicely, all the surrounding test setup basically swamps the result. If we serialize your dataset, and then just load it back from disk, we can get better behavior:

$ time ./A +RTS -N2 -A60M
./A +RTS -N2 -A60M  1.76s user 0.25s system 186% cpu 1.073 total

and now your profile looks healthier:

在这里输入图像描述

This looks great! Very little GC (98.9% productivity), and my two cores running in parallel happily.

So, finally, we can see you get good parallelism:

With 1 core, 1.855s

$ time ./A +RTS -N1 -A25M
./A +RTS -N1 -A25M  1.75s user 0.11s system 100% cpu 1.855 total

and with 2 cores, 1.014s

$ time ./A +RTS -N2 -A25M   
./A +RTS -N2 -A25M  1.78s user 0.13s system 188% cpu 1.014 total

Now, specifically answer your questions:

Is there a more idiomatic way of doing this in repa?

In general, repa code should consist of parallel traverals, consumers and produces, and inlinable kernel functions. So as long as you are doing that, then the code is probably idiomatic. If in doubt, look at the tutorial. I'd in general mark your worker kernels (like f ) as inlined.

Will the above method actually price in parallel?

The code will execute in parallel if you use parallel combinators like computeP or the various maps and folds. So yes, it should and does run in parallel.

How can I determine whether my code really is generating something that will be executed in parallel?

Generally, you will know a priori because you use parallel operations. If in doubt, run the code and observe the speedup. You may then need to optimize the code.

链接地址: http://www.djcxy.com/p/59976.html

上一篇: 并行修复代码不会产生火花

下一篇: 使用Repa并行阵列的习惯性选项定价和风险