Cython for NumPy users

This tutorial is aimed at NumPy users who have no experience with Cython at all. If you have some knowledge of Cython you may want to skip to the ‘’Efficient indexing’’ section.

The main scenario considered is NumPy end-use rather than NumPy/SciPy development. The reason is that Cython is not (yet) able to support functions that are generic with respect to the number of dimensions in a high-level fashion. This restriction is much more severe for SciPy development than more specific, “end-user” functions. See the last section for more information on this.

The style of this tutorial will not fit everybody, so you can also consider:

Cython at a glance

Cython is a compiler which compiles Python-like code files to C code. Still, ‘’Cython is not a Python to C translator’‘. That is, it doesn’t take your full program and “turns it into C” – rather, the result makes full use of the Python runtime environment. A way of looking at it may be that your code is still Python in that it runs within the Python runtime environment, but rather than compiling to interpreted Python bytecode one compiles to native machine code (but with the addition of extra syntax for easy embedding of faster C-like code).

This has two important consequences:

  • Speed. How much depends very much on the program involved though. Typical Python numerical programs would tend to gain very little as most time is spent in lower-level C that is used in a high-level fashion. However for-loop-style programs can gain many orders of magnitude, when typing information is added (and is so made possible as a realistic alternative).
  • Easy calling into C code. One of Cython’s purposes is to allow easy wrapping of C libraries. When writing code in Cython you can call into C code as easily as into Python code.

Very few Python constructs are not yet supported, though making Cython compile all Python code is a stated goal, you can see the differences with Python in limitations.

Your Cython environment

Using Cython consists of these steps:

  1. Write a .pyx source file
  2. Run the Cython compiler to generate a C file
  3. Run a C compiler to generate a compiled library
  4. Run the Python interpreter and ask it to import the module

However there are several options to automate these steps:

  1. The SAGE mathematics software system provides excellent support for using Cython and NumPy from an interactive command line or through a notebook interface (like Maple/Mathematica). See this documentation.
  2. Cython can be used as an extension within a Jupyter notebook, making it easy to compile and use Cython code with just a %%cython at the top of a cell. For more information see Using the Jupyter Notebook.
  3. A version of pyximport is shipped with Cython, so that you can import pyx-files dynamically into Python and have them compiled automatically (See Compiling with pyximport).
  4. Cython supports distutils so that you can very easily create build scripts which automate the process, this is the preferred method for Cython implemented libraries and packages. See Compiling with distutils.
  5. Manual compilation (see below)

Note

If using another interactive command line environment than SAGE, like IPython or Python itself, it is important that you restart the process when you recompile the module. It is not enough to issue an “import” statement again.

Installation

If you already have a C compiler, just do:

pip install Cython

otherwise, see the installation page.

As of this writing SAGE comes with an older release of Cython than required for this tutorial. So if using SAGE you should download the newest Cython and then execute

$ cd path/to/cython-distro
$ path-to-sage/sage -python setup.py install

This will install the newest Cython into SAGE.

Manual compilation

As it is always important to know what is going on, I’ll describe the manual method here. First Cython is run:

$ cython yourmod.pyx

This creates yourmod.c which is the C source for a Python extension module. A useful additional switch is -a which will generate a document yourmod.html) that shows which Cython code translates to which C code line by line.

Then we compile the C file. This may vary according to your system, but the C file should be built like Python was built. Python documentation for writing extensions should have some details. On Linux this often means something like:

$ gcc -shared -pthread -fPIC -fwrapv -O2 -Wall -fno-strict-aliasing -I/usr/include/python2.7 -o yourmod.so yourmod.c

gcc should have access to the NumPy C header files so if they are not installed at /usr/include/numpy or similar you may need to pass another option for those. You only need to provide the NumPy headers if you write:

cimport numpy

in your Cython code.

This creates yourmod.so in the same directory, which is importable by Python by using a normal import yourmod statement.

The first Cython program

The code below does 2D discrete convolution of an image with a filter (and I’m sure you can do better!, let it serve for demonstration purposes). It is both valid Python and valid Cython code. I’ll refer to it as both convolve_py.py for the Python version and convolve_cy.pyx for the Cython version – Cython uses “.pyx” as its file suffix.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
from __future__ import division
import numpy as np
def naive_convolve(f, g):
    # f is an image and is indexed by (v, w)
    # g is a filter kernel and is indexed by (s, t),
    #   it needs odd dimensions
    # h is the output image and is indexed by (x, y),
    #   it is not cropped
    if g.shape[0] % 2 != 1 or g.shape[1] % 2 != 1:
        raise ValueError("Only odd dimensions on filter supported")
    # smid and tmid are number of pixels between the center pixel
    # and the edge, ie for a 5x5 filter they will be 2.
    #
    # The output size is calculated by adding smid, tmid to each
    # side of the dimensions of the input image.
    vmax = f.shape[0]
    wmax = f.shape[1]
    smax = g.shape[0]
    tmax = g.shape[1]
    smid = smax // 2
    tmid = tmax // 2
    xmax = vmax + 2*smid
    ymax = wmax + 2*tmid
    # Allocate result image.
    h = np.zeros([xmax, ymax], dtype=f.dtype)
    # Do convolution
    for x in range(xmax):
        for y in range(ymax):
            # Calculate pixel value for h at (x,y). Sum one component
            # for each pixel (s, t) of the filter g.
            s_from = max(smid - x, -smid)
            s_to = min((xmax - x) - smid, smid + 1)
            t_from = max(tmid - y, -tmid)
            t_to = min((ymax - y) - tmid, tmid + 1)
            value = 0
            for s in range(s_from, s_to):
                for t in range(t_from, t_to):
                    v = x - smid + s
                    w = y - tmid + t
                    value += g[smid - s, tmid - t] * f[v, w]
            h[x, y] = value
    return h

This should be compiled to produce convolve_cy.so (for Linux systems, on Windows systems, this will be a .pyd file). We run a Python session to test both the Python version (imported from .py-file) and the compiled Cython module.

In [1]: import numpy as np
In [2]: import convolve_py
In [3]: convolve_py.naive_convolve(np.array([[1, 1, 1]], dtype=np.int),
...     np.array([[1],[2],[1]], dtype=np.int))
Out [3]:
array([[1, 1, 1],
    [2, 2, 2],
    [1, 1, 1]])
In [4]: import convolve_cy
In [4]: convolve_cy.naive_convolve(np.array([[1, 1, 1]], dtype=np.int),
...     np.array([[1],[2],[1]], dtype=np.int))
Out [4]:
array([[1, 1, 1],
    [2, 2, 2],
    [1, 1, 1]])
In [11]: N = 600
In [12]: f = np.arange(N*N, dtype=np.int).reshape((N,N))
In [13]: g = np.arange(81, dtype=np.int).reshape((9, 9))
In [19]: %timeit convolve_py.naive_convolve(f, g)
16 s ± 70.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [20]: %timeit convolve_cy.naive_convolve(f, g)
13.5 s ± 99.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

There’s not such a huge difference yet; because the C code still does exactly what the Python interpreter does (meaning, for instance, that a new object is allocated for each number used).

You can look at the Python interaction and the generated C code by using -a when calling Cython from the command line, %%cython -a when using a Jupyter Notebook, or by using cythonize('convolve_cy.pyx', annotate=True) when using a setup.py. Look at the generated html file and see what is needed for even the simplest statements. You get the point quickly. We need to give Cython more information; we need to add types.

Adding types

To add types we use custom Cython syntax, so we are now breaking Python source compatibility. Here’s convolve_typed.pyx. Read the comments!

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import numpy as np

# We now need to fix a datatype for our arrays. I've used the variable
# DTYPE for this, which is assigned to the usual NumPy runtime
# type info object.
DTYPE = np.intc

def naive_convolve(f, g):
    if g.shape[0] % 2 != 1 or g.shape[1] % 2 != 1:
        raise ValueError("Only odd dimensions on filter supported")
    assert f.dtype == DTYPE and g.dtype == DTYPE
    # The "cdef" keyword is also used within functions to type variables. It
    # can only be used at the top indentation level (there are non-trivial
    # problems with allowing them in other places, though we'd love to see
    # good and thought out proposals for it).

    # Py_ssize_t is the proper C type for Python array indices.
    cdef Py_ssize_t x, y, s, t, v, w, s_from, s_to, t_from, t_to

    cdef Py_ssize_t vmax = f.shape[0]
    cdef Py_ssize_t wmax = f.shape[1]
    cdef Py_ssize_t smax = g.shape[0]
    cdef Py_ssize_t tmax = g.shape[1]
    cdef Py_ssize_t smid = smax // 2
    cdef Py_ssize_t tmid = tmax // 2
    cdef Py_ssize_t xmax = vmax + 2*smid
    cdef Py_ssize_t ymax = wmax + 2*tmid
    h = np.zeros([xmax, ymax], dtype=DTYPE)
    # It is very important to type ALL your variables. You do not get any
    # warnings if not, only much slower code (they are implicitly typed as
    # Python objects).
    # For the value variable, we want to use the same data type as is
    # stored in the array, so we use int because it correspond to np.intc.
    # NB! An important side-effect of this is that if "value" overflows its
    # datatype size, it will simply wrap around like in C, rather than raise
    # an error like in Python.
    cdef int value
    for x in range(xmax):
        for y in range(ymax):
            # Cython has built-in C functions for min and max.
            # This makes the following lines very fast.
            s_from = max(smid - x, -smid)
            s_to = min((xmax - x) - smid, smid + 1)
            t_from = max(tmid - y, -tmid)
            t_to = min((ymax - y) - tmid, tmid + 1)
            value = 0
            for s in range(s_from, s_to):
                for t in range(t_from, t_to):
                    v = x - smid + s
                    w = y - tmid + t
                    value += g[smid - s, tmid - t] * f[v, w]
            h[x, y] = value
    return h
../../_images/convolve_types_html.png

At this point, have a look at the generated C code for convolve_cy.pyx and convolve_typed.pyx. Click on the lines to expand them and see corresponding C.

Especially have a look at the for-loops: In convolve_cy.c, these are ~20 lines of C code to set up while in convolve_typed.c a normal C for loop is used.

After building this and continuing my (very informal) benchmarks, I get:

In [22]: %timeit convolve_typed.naive_convolve(f, g)
55.8 s ± 199 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So in the end, adding types make the Cython code slower?

What happened is that most of the time spend in this code is spent on line 54.

value += g[smid - s, tmid - t] * f[v, w]

So what made this line so much slower than in the pure Python version?

g and f are still NumPy arrays, so Python objects, and expect Python integers as indexes. Here we pass C int values. So every time Cython reaches this line, it has to convert all the C integers to Python int objects. Since this line is called very often, it outweighs the speed benefits of the pure C loops that were created from the range() earlier.

Furthermore, g[smid - s, tmid - t] * f[v, w] returns a Python integer and value is a C integer, so Cython has to do type conversions again. In the end those types conversions add up. And made our convolution really slow. But this problem can be solved easily by using memoryviews.

Efficient indexing with memoryviews

There are still two bottlenecks that degrade the performance, and that is the array lookups and assignments, as well as C/Python types conversion. The []-operator still uses full Python operations – what we would like to do instead is to access the data buffer directly at C speed.

What we need to do then is to type the contents of the ndarray objects. We do this with a memoryview. There is a page in the Cython documentation dedicated to it.

In short, memoryviews are C structures that can hold a pointer to the data of a NumPy array and all the necessary buffer metadata to provide efficient and safe access: dimensions, strides, item size, item type information, etc… They also support slices, so they work even if the NumPy array isn’t contiguous in memory. They can be indexed by C integers, thus allowing fast access to the NumPy array data.

Here is how to declare a memoryview of integers:

cdef int [:] foo         # 1D memoryview
cdef int [:, :] foo      # 2D memoryview
cdef int [:, :, :] foo   # 3D memoryview
...                      # You get the idea.

No data is copied from the NumPy array to the memoryview in our example. As the name implies, it is only a “view” of the memory. So we can use the view h for efficient indexing and at the end return the real NumPy array h_np that holds the data that we operated on.

Here is how to use them in our code:

convolve_memview.pyx

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import numpy as np

DTYPE = np.intc

# It is possible to declare types in the function declaration.
def naive_convolve(int [:,:] f, int [:,:] g):
    if g.shape[0] % 2 != 1 or g.shape[1] % 2 != 1:
        raise ValueError("Only odd dimensions on filter supported")

    # We don't need to check for the type of NumPy array here because
    # a check is already performed when calling the function.
    cdef Py_ssize_t x, y, s, t, v, w, s_from, s_to, t_from, t_to
    cdef Py_ssize_t vmax = f.shape[0]
    cdef Py_ssize_t wmax = f.shape[1]
    cdef Py_ssize_t smax = g.shape[0]
    cdef Py_ssize_t tmax = g.shape[1]
    cdef Py_ssize_t smid = smax // 2
    cdef Py_ssize_t tmid = tmax // 2
    cdef Py_ssize_t xmax = vmax + 2*smid
    cdef Py_ssize_t ymax = wmax + 2*tmid

    h_np =  np.zeros([xmax, ymax], dtype=DTYPE)
    cdef int [:,:] h = h_np

    cdef int value
    for x in range(xmax):
        for y in range(ymax):
            s_from = max(smid - x, -smid)
            s_to = min((xmax - x) - smid, smid + 1)
            t_from = max(tmid - y, -tmid)
            t_to = min((ymax - y) - tmid, tmid + 1)
            value = 0
            for s in range(s_from, s_to):
                for t in range(t_from, t_to):
                    v = x - smid + s
                    w = y - tmid + t
                    value += g[smid - s, tmid - t] * f[v, w]
            h[x, y] = value
    return h_np

Let’s see how much faster accessing is now.

In [22]: %timeit convolve_memview.naive_convolve(f, g)
57.1 ms ± 268 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Note the importance of this change. We’re now 280 times faster than an interpreted version of Python.

Memoryviews can be used with slices too, or even with Python arrays. Check out the memoryview page to see what they can do for you.

Tuning indexing further

The array lookups are still slowed down by two factors:

  1. Bounds checking is performed.

  2. Negative indices are checked for and handled correctly. The code above is explicitly coded so that it doesn’t use negative indices, and it (hopefully) always access within bounds.

    With decorators, we can deactivate those checks:

    ...
    cimport cython
    @cython.boundscheck(False)  # Deactivate bounds checking
    @cython.wraparound(False)   # Deactivate negative indexing.
    def naive_convolve(int [:, :] f, int [:, :] g):
    ...
    

Now bounds checking is not performed (and, as a side-effect, if you ‘’do’’ happen to access out of bounds you will in the best case crash your program and in the worst case corrupt data). It is possible to switch bounds-checking mode in many ways, see Compiler directives for more information.

In [23]: %timeit convolve_index.naive_convolve(f, g)
19.8 ms ± 299 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

We’re now 800 times faster than the interpreted Python version.

Warning

Speed comes with some cost. Especially it can be dangerous to set typed objects (like f, g and h in our sample code) to None. Setting such objects to None is entirely legal, but all you can do with them is check whether they are None. All other use (attribute lookup or indexing) can potentially segfault or corrupt data (rather than raising exceptions as they would in Python).

The actual rules are a bit more complicated but the main message is clear: Do not use typed objects without knowing that they are not set to None.

Declaring the NumPy arrays as contiguous

For extra speed gains, if you know that the NumPy arrays you are providing are contiguous in memory, you can declare the memoryview as contiguous.

We give an example on an array that has 3 dimensions. If you want to give Cython the information that the data is C-contiguous you have to declare the memoryview like this:

cdef int [:,:,::1] a

If you want to give Cython the information that the data is Fortran-contiguous you have to declare the memoryview like this:

cdef int [::1, :, :] a

If all this makes no sense to you, you can skip this part, the performance gains are not that important. If you still want to understand what contiguous arrays are all about, you can see this answer on StackOverflow.

For the sake of giving numbers, here are the speed gains that you should get by declaring the memoryviews as contiguous:

In [23]: %timeit convolve_contiguous.naive_convolve(f, g)
21.3 ms ± 489 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

We’re still around 800 times faster than the interpreted Python version.

Making the function cleaner

Declaring types can make your code quite verbose. If you don’t mind Cython inferring the C types of your variables, you can use the infer_types=True compiler directive at the top of the file. It will save you quite a bit of typing.

Note that since type declarations must happen at the top indentation level, Cython won’t infer the type of variables declared for the first time in other indentation levels. It would change too much the meaning of our code. This is why, we must still declare manually the type of the value variable.

And actually, manually giving the type of the value variable will be useful when using fused types.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# cython: infer_types=True
import numpy as np
cimport cython

DTYPE = np.intc

@cython.boundscheck(False)
@cython.wraparound(False)
def naive_convolve(int [:,::1] f, int [:,::1] g):
    if g.shape[0] % 2 != 1 or g.shape[1] % 2 != 1:
        raise ValueError("Only odd dimensions on filter supported")

    vmax = f.shape[0]
    wmax = f.shape[1]
    smax = g.shape[0]
    tmax = g.shape[1]
    smid = smax // 2
    tmid = tmax // 2
    xmax = vmax + 2*smid
    ymax = wmax + 2*tmid

    h_np =  np.zeros([xmax, ymax], dtype=DTYPE)
    cdef int [:,::1] h = h_np

    cdef int value
    for x in range(xmax):
        for y in range(ymax):
            s_from = max(smid - x, -smid)
            s_to = min((xmax - x) - smid, smid + 1)
            t_from = max(tmid - y, -tmid)
            t_to = min((ymax - y) - tmid, tmid + 1)
            value = 0
            for s in range(s_from, s_to):
                for t in range(t_from, t_to):
                    v = x - smid + s
                    w = y - tmid + t
                    value += g[smid - s, tmid - t] * f[v, w]
            h[x, y] = value
    return h_np

We now do a speed test:

In [24]: %timeit convolve_infer_types.naive_convolve(f, g)
21.3 ms ± 344 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

We’re still around 800 times faster than the interpreted Python version.

More generic code

All those speed gains are nice, but adding types constrains our code. At the moment, it would mean that our function can only work with NumPy arrays with the np.intc type. Is it possible to make our code work for multiple NumPy data types?

Yes, with the help of a new feature called fused types. You can learn more about it at this section of the documentation. It is similar to C++ ‘s templates. It generates multiple function declarations at compile time, and then chooses the right one at run-time based on the types of the arguments provided. By comparing types in if-conditions, it is also possible to execute entirely different code paths depending on the specific data type.

In our example, since we don’t have access anymore to the NumPy’s dtype of our input arrays, we use those if-else statements to know what NumPy data type we should use for our output array.

In this case, our function now works for ints, doubles and floats.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
# cython: infer_types=True
import numpy as np
cimport cython

ctypedef fused my_type:
    int
    double
    long

@cython.boundscheck(False)
@cython.wraparound(False)
cpdef naive_convolve(my_type [:,:] f, my_type [:,:] g):
    if g.shape[0] % 2 != 1 or g.shape[1] % 2 != 1:
        raise ValueError("Only odd dimensions on filter supported")

    vmax = f.shape[0]
    wmax = f.shape[1]
    smax = g.shape[0]
    tmax = g.shape[1]
    smid = smax // 2
    tmid = tmax // 2
    xmax = vmax + 2*smid
    ymax = wmax + 2*tmid

    if my_type is int:
        dtype = np.intc
    elif my_type is double:
        dtype = np.double
    else:
        dtype = np.long

    h_np =  np.zeros([xmax, ymax], dtype=dtype)
    cdef my_type [:,:] h = h_np

    cdef my_type value
    for x in range(xmax):
        for y in range(ymax):
            s_from = max(smid - x, -smid)
            s_to = min((xmax - x) - smid, smid + 1)
            t_from = max(tmid - y, -tmid)
            t_to = min((ymax - y) - tmid, tmid + 1)
            value = 0
            for s in range(s_from, s_to):
                for t in range(t_from, t_to):
                    v = x - smid + s
                    w = y - tmid + t
                    value += g[smid - s, tmid - t] * f[v, w]
            h[x, y] = value
    return h_np

We can check that the output type is the right one:

>>>naive_convolve_fused_types(f, g).dtype
dtype('int32')
>>>naive_convolve_fused_types(f.astype(np.double), g.astype(np.double)).dtype
dtype('float64')

We now do a speed test:

In [25]: %timeit convolve_fused_types.naive_convolve(f, g)
20 ms ± 392 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

We’re still around 800 times faster than the interpreted Python version.

Where to go from here?