Parallel programming in Python: multiprocessing (part 1)

Parallel programming solves big numerical problems by dividing them into smaller sub-tasks, and hence reduces the overall computational time on multi-processor and/or multi-core machines. Parallel programming is well supported in traditional programming languages like C and FORTRAN, which are suitable for “heavy-duty” computational tasks. Traditionally, Python is considered to not support parallel programming very well, partly because of the global interpreter lock (GIL). However, things have changed over time. Thanks to the development of a rich variety of libraries and packages, the support for parallel programming in Python is now much better.

This post (and the following part) will briefly introduce the multiprocessing module in Python, which effectively side-steps the GIL by using subprocesses instead of threads. The multiprocessing module provides many useful features and is very suitable for symmetric multiprocessing (SMP) and shared memory systems. In this post we focus on the Pool class of the multiprocessing module, which controls a pool of worker processes and supports both synchronous and asynchronous parallel execution.

The Pool class

Creating a Pool object

To use the multiprocessing module, you need to import it first.

import multiprocessing as mp

Documentation for the module can be displayed with the help method.

help(mp)

The module can detect the number of available CPU cores via the cpu_count method. (Note that we use the Python3 syntax for printing the resulting number.)

nprocs = mp.cpu_count()
print(f"Number of CPU cores: {nprocs}")

In practice it is desirable to have one process per CPU core, so it is a good idea to set nprocs to be the number of available CPU cores. A Pool object can be created by passing the desired number of processes to the constructor.

pool = mp.Pool(processes=nprocs)

The map method

To demonstrate the usage of the Pool class, let’s define a simple function that calculates the square of a number.

def square(x):
    return x * x

Suppose we want to use the square method to calculate the squares of a list of integers. In serial programming we can use the following code to compute and print the result, via list comprehension.

result = [square(x) for x in range(20)]
print(result)

To execute the computation in parallel, we can use the map method of the Pool class, which is similar to the built-in map function in Python.

result = pool.map(square, range(20))
print(result)

The above parallel code will print exactly the same result as the serial code, but the computations are actually distributed and executed in parallel on the worker processes. The map method will guarantee that the order of the output is correct.

The starmap method

You may have noticed that the map method is only applicable to computational routines that accept a single argument (e.g. the previously defined square function). For routines that accept multiple arguments, the Pool class also provides the starmap method. For example, we can define a more general routine that computes a power of arbitrary order

def power_n(x, n):
    return x ** n

and pass this power_n routine and a list of input arguments to the starmap method.

result = pool.starmap(power_n, [(x, 2) for x in range(20)])
print(result)

Note that both map and starmap are synchronous methods. In other words, if a worker process finishes its sub-task very early, it will wait for the other worker processes to finish. This may lead to performance degradation if the workload is not well balanced among the worker processes.

The apply_async method

The Pool class also provides the apply_async method that makes asynchronous execution of the worker processes possible. Unlike the map method, which executes a computational routine over a list of inputs, the apply_async method executes the routine only once. Therefore, in the previous example, we would need to define another routine, power_n_list, that computes the values of a list of numbers raised to a particular power.

def power_n_list(x_list, n):
    return [x ** n for x in x_list]

To use the apply_async method, we also need to divide the whole input list range(20) into sub-lists (which are known as slices) and distribute them to the worker processes. The slices can be prepared by the following slice_data method.

def slice_data(data, nprocs):
    aver, res = divmod(len(data), nprocs)
    nums = []
    for proc in range(nprocs):
        if proc < res:
            nums.append(aver + 1)
        else:
            nums.append(aver)
    count = 0
    slices = []
    for proc in range(nprocs):
        slices.append(data[count: count+nums[proc]])
        count += nums[proc]
    return slices

Then we can pass the power_n_list routine and the sliced input lists to the apply_async method.

inp_lists = slice_data(range(20), nprocs)
multi_result = [pool.apply_async(power_n_list, (inp, 2)) for inp in inp_lists]

The actual result can be obtained using the get method and nested list comprehension.

result = [x for p in multi_result for x in p.get()]
print(result)

Note that the apply_async method itself does not guarantee the correct order of the output. In the above example, apply_async was used with list comprehension so that the result remained ordered (see also the examples).

Example: computing π

After that brief introduction, we can use the Pool class to do useful things. Here we use the calculation of π as a simple example to demonstrate the parallelization of Python code. The formula for computing π is given below.

Serial code

With the above formula we can compute the value of π via numerical integration over a large number of points. For example we can choose to use 10 million points. The serial code is shown below.

nsteps = 10000000
dx = 1.0 / nsteps
pi = 0.0
for i in range(nsteps):
    x = (i + 0.5) * dx
    pi += 4.0 / (1.0 + x * x)
pi *= dx

Parallel code

To parallelize the serial code for computing π, we need to divide the for loop into sub-tasks and distribute them to the worker processes. In other words, we need to evenly distribute the task of evaluating the integrand at 10 million points. This can be conveniently done by providing the start, stop and step arguments to the built-in range function. The first integer in the range function is the start of the sequence, and should be set as the index or rank of the process. The second integer is the number of integration points, namely the end of the sequence. The third integer is the step between the adjacent elements in the sequence, and is set as the number of processes to avoid double counting. For example, the following calc_partial_pi function uses the range function for the sub-task on a worker process.

def calc_partial_pi(rank, nprocs, nsteps, dx):
    partial_pi = 0.0
    for i in range(rank, nsteps, nprocs):
        x = (i + 0.5) * dx
        partial_pi += 4.0 / (1.0 + x * x)
    partial_pi *= dx
    return partial_pi

With the calc_partial_pi function we can prepare the input arguments for the sub-tasks and compute the value of π using the starmap method of the Pool class, as shown below.

nprocs = mp.cpu_count()
inputs = [(rank, nprocs, nsteps, dx) for rank in range(nprocs)]

pool = mp.Pool(processes=nprocs)
result = pool.starmap(calc_partial_pi, inputs)
pi = sum(result)

Asynchronous parallel calculation can be carried out with the apply_async method of the Pool class. We can make use of the calc_partial_pi function and the inputs list since both starmap and apply_async support multiple arguments. The difference is that starmap returns the results from all processes, while apply_async returns the result from a single process. The code using the apply_async method is shown below.

multi_result = [pool.apply_async(calc_partial_pi, inp) for inp in inputs]
result = [p.get() for p in multi_result]
pi = sum(result)

In a previous post we have discussed the scaling of parallel programs. We can also run a scaling test for the parallel Python code based on the starmap and apply_async methods of the Pool class. From the figure below we can see that the two methods provide very similar scaling for computing the value of π.

Summary

We have briefly shown the basics of the map, starmap and apply_async methods from the Pool class.

map and starmap are synchronous methods.
map and starmap guarantee the correct order of output.
starmap and apply_async support multiple arguments.

You may read the Python documentation page for details about other methods in the Pool class.