Skip to main content
Software development for researchers

Parallelization and foreign function interfaces

This handout is tied to the following lecture

12. Parallelization and foreign function interfaces
Table of contents

Code examples

MPI

Dependencies

  • Arch: pacman -S base-devel openmpi
  • Ubuntu: apt install build-essential openmpi-bin openmpi-common libopenmpi-dev

Then install mpi4py with pip install mpi4py. If the installation process fails, probably you are lacking some basic build dependency, have a look at the docs to see what is missing.

The typical hello world example - but in parallel!
1
2
3
4
5
6
#!/usr/bin/env python
from mpi4py import MPI

comm = MPI.COMM_WORLD
name = MPI.Get_processor_name()
print(f"hello! name: {name}, my rank is {comm.rank}")
A more involved example with actual communication and processing logic
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
#!/usr/bin/env python
import time
from mpi4py import MPI
import numpy as np

comm = MPI.COMM_WORLD

t0_all = time.time()
print(f"rank-{comm.rank}/{comm.size} starting")

if comm.rank == 0:
    items = [np.random.random(1000) + x for x in range(10000)]
else:
    items = None

items = comm.bcast(items, root=0)
t0 = time.time()

for ind in range(comm.rank, len(items), comm.size):
    items[ind] = items[ind] - np.mean(items[ind])

dt = time.time() - t0

if comm.rank == 0:
    for rank in range(1, comm.size):
        for ind in range(rank, len(items), comm.size):
            items[ind] = comm.recv(source=rank, tag=ind)
else:
    for ind in range(comm.rank, len(items), comm.size):
        comm.send(items[ind], dest=0, tag=ind)

if comm.rank == 0:
    su = 0
    for dat in items:
        su += np.mean(dat)

dt_all = time.time() - t0_all
print(f"rank-{comm.rank}/{comm.size}: work time {dt} s (total {dt_all} s)")

Threading

A threading/multiprocessing example highlighting the Global interpreter lock
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import time
from threading import Thread
from multiprocessing import Pool

iter_num = 60000000


def counter(num):
    while num > 0:
        num -= 1


t1 = Thread(target=counter, args=(iter_num // 2,))
t2 = Thread(target=counter, args=(iter_num // 2,))

start = time.time()
t1.start()
t2.start()
t1.join()
t2.join()
end = time.time()

print("threading exec time -", end - start)


start = time.time()
counter(iter_num)
end = time.time()

print("linear exec time -", end - start)

pool = Pool(2)

start = time.time()

r1 = pool.apply_async(counter, [iter_num // 2])
r2 = pool.apply_async(counter, [iter_num // 2])
pool.close()
pool.join()

end = time.time()

print("multiprocessing exec time -", end - start)

Ctypes

A minimalist example of ctypes loading compiled C code and numpy arrays
main.py
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
import ctypes

import numpy as np
import numpy.typing as npt
import numpy.ctypeslib as npct

lib = ctypes.cdll.LoadLibrary("./main.so")
# Define the data types we need
ro_f8_vec = npct.ndpointer(
    np.float64, ndim=1, flags="aligned, contiguous",
)
rw_f8_vec = npct.ndpointer(
    np.float64, ndim=1, flags="aligned, contiguous, writeable",
)

# Define the C-interface
lib.twice.restype = None
lib.twice.argtypes = [
    ro_f8_vec,
    rw_f8_vec,
    ctypes.c_int,
]


def twice(values: npt.NDArray[np.float64]):
    result = np.empty_like(values)
    length = ctypes.c_int(len(values))

    lib.twice(values, result, length)

    return result


if __name__ == "__main__":
    x = np.arange(10, dtype=np.float64)

    print("running c-twice")
    y = twice(x)

    print(f"INPUT : {x}")
    print(f"OUTPUT: {y}")
main.c
1
2
3
4
5
6
void twice(double* in_array, double* out_array, int length) {
    int i;
    for(i = 0; i<length; i++) {
        out_array[i] = in_array[i]*2;
    }
}
Makefile
1
2
all:
	gcc main.c -fPIC -shared -o main.so

Python.h

A minimalist example of building a Python module in C/C++
bar.c
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
#define PY_SSIZE_T_CLEAN
#include <Python.h>
#include <string.h>

static PyObject* say_hello(PyObject *self, PyObject *args)
{
    const char *name;
    if (!PyArg_ParseTuple(args, "s", &name)) 
    {
        return NULL;
    }
    
    size_t n = strlen(name);
    size_t response_len = 7 + n + 1;
    char *response = (char *)malloc(response_len);
    if (!response) {
        return PyErr_NoMemory();
    }

    strcpy(response, "Hello, ");
    strcat(response, name);

    PyObject *ret = Py_BuildValue("s", response);
    free(response);
    return ret;
}

/* Table of what is in this modeule*/
static PyMethodDef BarMethods[] = 
{
    {"hello", say_hello, METH_VARARGS,
     "Greets you!"},
    {NULL, NULL, 0, NULL}        /* Sentinel */
};

/* Definition of the module */
static struct PyModuleDef barmodule = {
    PyModuleDef_HEAD_INIT,
    "bar",    /* name of module */
    NULL,     /* module documentation, may be NULL */
    -1,       /* size of per-interpreter state of the module,
                 or -1 if the module keeps state in global variables. */
    BarMethods
};

/* This actually creates the module using the module definition */
PyMODINIT_FUNC PyInit_bar(void)
{
    return PyModule_Create(&barmodule);
}
main.py
1
2
3
4
import bar

print("bar content: ", dir(bar))
print(bar.hello("Daniel"))
Makefile
1
2
3
4
5
CFLAGS=$(shell python-config --cflags) -fPIC -shared
INCLUDES=$(shell python-config --includes)

build:
	gcc $(INCLUDES) $(CFLAGS) bar.c -o bar.so

Cython

An examples of using Cython to transpile Python to compiled C
py_calc.py
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
def calc(vec, mat, repeat):
    n = len(vec)
    m = len(mat)
    ret = [0] * m
    tmp = [0] * m

    for row in range(m):
        ret[row] = vec[row]
    for ind in range(repeat):
        for row in range(m):
            val = 0
            for col in range(n):
                val += ret[col] * mat[row][col]
            tmp[row] = val
        for row in range(m):
            ret[row] = tmp[row]

    return ret
compiled_py_calc.py
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
def calc(vec, mat, repeat):
    n = len(vec)
    m = len(mat)
    ret = [0] * m
    tmp = [0] * m

    for row in range(m):
        ret[row] = vec[row]
    for ind in range(repeat):
        for row in range(m):
            val = 0
            for col in range(n):
                val += ret[col] * mat[row][col]
            tmp[row] = val
        for row in range(m):
            ret[row] = tmp[row]

    return ret
cy_calc.py
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
def calc(vec, mat, repeat):
    cdef int n, m
    cdef float val

    n = len(vec)
    m = len(mat)
    ret = [0] * m
    tmp = [0] * m

    for row in range(m):
        ret[row] = vec[row]
    for ind in range(repeat):
        for row in range(m):
            val = 0
            for col in range(n):
                val += ret[col] * mat[row][col]
            tmp[row] = val
        for row in range(m):
            ret[row] = tmp[row]

    return ret
main.py
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import py_calc
import compiled_py_calc
import cy_calc
import time
import random


def test_func(func):
    random.seed(3245)
    vec = [float(x) for x in range(500)]
    mat = [[random.random() for row in range(500)] for col in range(500)]
    repeat = 50

    t0 = time.process_time()
    func(vec, mat, repeat)
    dt = time.process_time() - t0

    return dt


py_dt = test_func(py_calc.calc)
cpy_dt = test_func(compiled_py_calc.calc)
cy_dt = test_func(cy_calc.calc)

print(f"  python: {py_dt:.2e} s (speedup: {py_dt/py_dt:.2f})")
print(f"compiled: {cpy_dt:.2e} s (speedup: {py_dt/cpy_dt:.2f})")
print(f"  cython: {cy_dt:.2e} s (speedup: {py_dt/cy_dt:.2f})")
setup.py
1
2
3
4
5
6
7
8
9
from setuptools import setup
from Cython.Build import cythonize

sources = [
    "compiled_py_calc.py",
    "cy_calc.pyx",
]

setup(ext_modules=cythonize(sources))
Makefile
1
2
all:
	python setup.py build_ext --inplace

Extensions in other languages

The most common way to build extensions to your Python package in other languages is trough a setuptools Extension. This system can e.g. compile C code as a part of the installation and building process of your package. However there are other methods, such as using meson.

There are also systems for automatically creating interfaces such as pybind11 or the numpy f2py.

Increasing performance

The The Computer Language Benchmarks Game is a quite fun website that micro-benchmarks languages. Remember, this is of course no-where near a real metric for if a language is more performant than another (for that you need a real domain-specific example with actual throughput), but it does show some things about its basic functionality and its compiler/interpreter.

High-performance computing system tools

Languages

So far I have only found one programming language that is specifically designed to be used for parallel or high-performance computing. That language is Chapel. So far it seems very interesting and I’m keeping my eye on it!

Hewlett Packard Enterprise - Chapel: Making parallel computing as easy as Py(thon), from laptops to supercomputers

Also, Chapel does support interoperability with other languages trough C, which means you could probably interface to a chapel program trough ctypes from Python and similarly parallelize your existing C code by calling the functions from a shared library in chapel.

One example of using Chapel to implement HPC calculations is arkouda, a Python package for executing data processing on a cluster where the cluster server is written in Chapel.

Schedulers

Data systems

Map-reduce systems

More stuff

There are more links to HPC tooling at awesome-high-performance-computing.

Application binary interface

An application binary interface is a interface for accessing in-process machine code, e.g. calling a function from a compiled binary from a process.

The blog post mentioned in the lecture can be found here. Again, if you are not much for reading, there is a reaction video for the post.

ThePrimeTime - C Is Not A Language Anymore

IRF and HPC2N

Homework

Parallelization

Parallelize some piece of software you have written, either during the course or elsewhere. The important part is that this software should theoretically benefit from parallelization (you don’t have to fit your portion coefficients if you know most of the program you can trivially parallelize). You will then show the code before and after so that we can discuss it, as well as show benchmarks of before and after.

Python as glue

Choose some small function from a software you have written, either during the course or elsewhere. The function should be doing mostly “Python things”, not just be one line of numpy. Then implement this function in C or another low level language of your choice and create an extension for your code. You should be able to call the function via your Python code without using something like subprocess, rather it should use ctypes or something like the Python.h and f2py approach. Benchmark the performance before and after. We will review the compiled low level code and the Python code before and after.

To top
× Zoomed Image