100x faster than HMatrix for small matrices?

2018-06-07 10:50:01

We spend the majority of our CPU cycles on operations involving small matrices, so I wondered if it was possible to optimize for that case. Consider the following code:

module Main where

import Numeric.LinearAlgebra.HMatrix
import Criterion.Main


data Matrix2x2 = Matrix2x2 {-# UNPACK #-} !Double !Double !Double !Double

mul2x2p :: Matrix2x2 -> Matrix2x2 -> Matrix2x2
mul2x2p (Matrix2x2 a1 b1 c1 d1) (Matrix2x2 a2 b2 c2 d2) =
  Matrix2x2 (a1*a2 + b1*c2) (a1*b2 + b1*d2) (c1*a2 + d1*c2) (c1*b2 + d1*d2)

inv2x2 :: Matrix2x2 -> Matrix2x2
inv2x2 (Matrix2x2 a b c d) =
  let detInv = a * d - b * c
  in Matrix2x2 (d / detInv) (-b / detInv) (-c / detInv) (a / detInv)

add2x2 (Matrix2x2 a1 b1 c1 d1) (Matrix2x2 a2 b2 c2 d2) =
  Matrix2x2 (a1+a2) (b1+b2) (c1+c2) (d1+d2)

hm1 = matrix 2 [1, 2, 3, 4]
hm2 = matrix 2 [5, 6, 7, 8]

pm1 = Matrix2x2 1 2 3 4
pm2 = Matrix2x2 5 6 7 8

main = defaultMain [
  bgroup "matrix tests" [ bench "pure mult"     $ whnf (mul2x2p pm1) pm2
                        , bench "hmatrix mult"  $ whnf (hm1 <>) hm2
                        , bench "pure add"      $ whnf (add2x2 pm1) pm2
                        , bench "hmatrix add"   $ whnf (hm1 +) hm2
                        , bench "pure inv"      $ whnf inv2x2 pm1
                        , bench "hmatrix inv"   $ whnf inv hm1
                        ]]

The results:

benchmarking matrix tests/pure mult
time                 6.461 ns   (6.368 ns .. 6.553 ns)
                     0.999 R²   (0.998 R² .. 0.999 R²)
mean                 6.482 ns   (6.394 ns .. 6.594 ns)
std dev              345.1 ps   (271.4 ps .. 477.3 ps)
variance introduced by outliers: 77% (severely inflated)

benchmarking matrix tests/hmatrix mult
time                 180.6 ns   (178.2 ns .. 183.1 ns)
                     0.999 R²   (0.998 R² .. 0.999 R²)
mean                 183.0 ns   (180.6 ns .. 186.3 ns)
std dev              9.363 ns   (7.405 ns .. 12.73 ns)
variance introduced by outliers: 71% (severely inflated)

benchmarking matrix tests/pure add
time                 6.262 ns   (6.223 ns .. 6.297 ns)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 6.281 ns   (6.220 ns .. 6.355 ns)
std dev              235.0 ps   (183.3 ps .. 321.0 ps)
variance introduced by outliers: 62% (severely inflated)

benchmarking matrix tests/hmatrix add
time                 116.4 ns   (115.0 ns .. 117.9 ns)
                     0.999 R²   (0.998 R² .. 0.999 R²)
mean                 116.3 ns   (115.2 ns .. 117.7 ns)
std dev              4.176 ns   (3.447 ns .. 5.150 ns)
variance introduced by outliers: 55% (severely inflated)

benchmarking matrix tests/pure inv
time                 7.811 ns   (7.718 ns .. 7.931 ns)
                     0.999 R²   (0.998 R² .. 0.999 R²)
mean                 7.895 ns   (7.808 ns .. 7.988 ns)
std dev              296.4 ps   (247.2 ps .. 358.3 ps)
variance introduced by outliers: 62% (severely inflated)

benchmarking matrix tests/hmatrix inv
time                 908.5 ns   (901.3 ns .. 916.6 ns)
                     0.999 R²   (0.998 R² .. 0.999 R²)
mean                 934.0 ns   (917.6 ns .. 961.3 ns)
std dev              73.92 ns   (50.53 ns .. 108.6 ns)
variance introduced by outliers: 84% (severely inflated)

My questions are:

1) Is the speed up real or due to an artifact with the benchmarking process?

2) If the speed up is real, is there an existing library that will handle, say, 1x1, 2x2, 3x3, 4x4 matrices as special cases?

3) If not, what is the best way to wrap HMatrix so that we can take the fast path when the matrix is small? ghc can only unpack records with one constructor. Is there a way to automatically generate different versions of our code, etc.

example-test.cabal:

name:                example-test
version:             0.1.0.0
build-type:          Simple
cabal-version:       >=1.10
executable example-test
  main-is:
    Main.hs
  build-depends:
    base >=4.7 && <4.8,
    criterion,
    hmatrix
  default-language:
    Haskell2010
  ghc-options:
    -H12G -O3 -optc-O3 -fllvm -rtsopts -threaded -fexcess-precision -j6 +RTS -N6 -RTS  -fno-ignore-asserts -fcontext-stack=150
    -- -fforce-recomp

HMatrix provides a functional interface to standard dense linear algebra (svd, eigenvalues, linear systems, etc.) based on the excellent and highly optimized BLAS/LAPACK libraries.

There is an overhead caused by the FFI calls, memory allocation, error checks, code to avoid some matrix transposes, etc. This is negligible in comparison to the time required by typical matrix computations of moderate and big dimensions, but for very small arrays and simple operations the overhead can be large.

There are no immediate plans to reimplement computations for small arrays. It is not easy to do things better than BLAS/LAPACK, and I am not sure that some speed gain for a particular application is more important than code simplicity and generality.

If your application requires a big number of a simple operations on 2x2, or 3x3 matrices, then other libraries are probably more appropriate for this task.

But I recommend that you run your program in profiling mode to find out the true bottlenecks. A benchmark using constant data may not be fully representative of the times spent in a "normal" program.

Even if you find that most time is spent in those computations with small matrices, perhaps you can rewrite your algorithm using an equivalent matrix computation of bigger dimension.

对于小向量和矩阵，经常使用线性包：https：//hackage.haskell.org/package/linear

链接地址: http://www.djcxy.com/p/22804.html

上一篇: Spring Data Rest：检测到具有相同关系类型的多个关联链接

下一篇: 小矩阵比HMatrix快100倍？