Cython either marginally faster or slower than pure Python

2018-06-22 10:09:54

I am using several techniques ( NumPy , Weave and Cython ) to perform a Python performance benchmark. What the code basically does mathematically is C = AB , where A, B and C are N x N matrices ( NOTE: this is a matrix product and not an element-wise multiplication).

I have written 5 distinct implementations of the code:

Pure python (Loop over 2D Python lists)

NumPy (Dot product of 2D NumPy arrays)

Weave inline (C++ loop over 2D arrays)

Cython (Loop over 2D Python lists + static typing)

Cython-Numpy (Loop over 2D NumPy arrays + static typing)

My expectation is that implementations 2 through 5 will be substantially faster than implementation 1. My results however indicate otherwise. These are my normalised speed-up results relative to the pure Python implementation:

python_list: 1.00

numpy_array: 330.09

weave_inline: 30.72

cython_list: 2.80

cython_array: 0.14

I am quite happy with the performance of NumPy, however I am less enthusiastic about Weave's performance and Cython's performance makes me cry. My entire code is separated over two files. Everything is automated and you simply need to run the first file to see all results. Could someone please aid me by indicating what I could do to obtain better results?

matmul.py:

import time

import numpy as np
from scipy import weave
from scipy.weave import converters

import pyximport
pyximport.install()
import cython_matmul as cml


def python_list_matmul(A, B):
    C = np.zeros(A.shape, dtype=float).tolist()
    A = A.tolist()
    B = B.tolist()
    for k in xrange(len(A)):
        for i in xrange(len(A)):
            for j in xrange(len(A)):
                C[i][k] += A[i][j] * B[j][k]
    return C


def numpy_array_matmul(A, B):
    return np.dot(A, B)


def weave_inline_matmul(A, B):
    code = """
       int i, j, k;
       for (k = 0; k < N; ++k)
       {
           for (i = 0; i < N; ++i)
           {
               for (j = 0; j < N; ++j)
               {
                   C(i, k) += A(i, j) * B(j, k);
               }
           }
       }
       """

    C = np.zeros(A.shape, dtype=float)
    weave.inline(code, ['A', 'B', 'C', 'N'], type_converters=converters.blitz, compiler='gcc')
    return C


N = 100
A = np.random.rand(N, N)
B = np.random.rand(N, N)

function = []
function.append([python_list_matmul, 'python_list'])
function.append([numpy_array_matmul, 'numpy_array'])
function.append([weave_inline_matmul, 'weave_inline'])
function.append([cml.cython_list_matmul, 'cython_list'])
function.append([cml.cython_array_matmul, 'cython_array'])

t = []
for i in xrange(len(function)):
    t1 = time.time()
    C = function[i][0](A, B)
    t2 = time.time()
    t.append(t2 - t1)
    print function[i][1] + ' t: ' + '{:10.6f}'.format(t[0] / t[-1])

cython_matmul.pyx:

import numpy as np
cimport numpy as np

import cython
cimport cython

DTYPE = np.float
ctypedef np.float_t DTYPE_t


@cython.boundscheck(False)
@cython.wraparound(False)
@cython.nonecheck(False)
cpdef cython_list_matmul(A, B):

    cdef int i, j, k
    cdef int N = len(A)

    A = A.tolist()
    B = B.tolist()
    C = np.zeros([N, N]).tolist()

    for k in xrange(N):
        for i in xrange(N):
            for j in xrange(N):
                C[i][k] += A[i][j] * B[j][k]
    return C


@cython.boundscheck(False)
@cython.wraparound(False)
@cython.nonecheck(False)
cpdef cython_array_matmul(np.ndarray[DTYPE_t, ndim=2] A, np.ndarray[DTYPE_t, ndim=2] B):

    cdef int i, j, k, N = A.shape[0]
    cdef np.ndarray[DTYPE_t, ndim=2] C = np.zeros([N, N], dtype=DTYPE)

    for k in xrange(N):
        for i in xrange(N):
            for j in xrange(N):
                C[i][k] += A[i][j] * B[j][k]
    return C

Python lists and high performance math are incompatible, forget about cython_list_matmul .

The only problem with your cython_array_matmul is incorrect usage of indexing. It should be

C[i,k] += A[i,j] * B[j,k]

That's how numpy arrays are indexed in Python and that's the syntax Cython optimizes. With this change you should get decent performance.

Cython's annotation feature is really helpful in spotting optimization problems like this one. You could notice that A[i][j] produces a ton of Python API calls, while A[i,j] produces none.

Also, if you initialize all entries by hand, np.empty is more appropriate than np.zeros .

链接地址: http://www.djcxy.com/p/62926.html

上一篇: 迭代cython中的数组，列表比np.array更快吗？

下一篇: Cython比纯Python要稍快或慢一些