38 Lines of Parallel Load Test

38 Lines of Parallel Load Test

We live in an interesting time for coding in that there is a project on github for nearly everything you want to do. However what's really interesting is just how quickly you can slam something reasonable out in Python and forgo adding some new project, dependency, and learning curve to your daily coding.

I recently wanted to sanity check load on an elastic search cluster. I didn't care much except to confirm that load balancing was working and get an upperbound for estimated throughput.

This is what I came up with in a few minutes instead of attempting to install some existing load balancing solution, configure it, and run it. I've adapted my code to running Google Search Queries, but the idea is the same: start several processes all make requests at the same time and track the processing rate.

import multiprocessing
import requests
from time import time
import random

queries = ['some', 'sample', 'data']
url = 'http://www.google.com/search'

def apply_load(iterations):
    s = time()
    success = 0
    error = 0
    for i in range(iterations):
        try:
            r = requests.get(url, params={'as_q': random.choice(queries)})
            r.raise_for_status()
            success += 1
        except:
            error += 1

    rate = (iterations / (time() - s))

    print 'Iterations %s/s' % rate
    return rate


jobs = multiprocessing.cpu_count()
pool = multiprocessing.Pool(jobs)

if __name__ == '__main__':

    size = 50 / jobs
    inputs = [size for i in range(jobs)]

    rates = pool.map(apply_load, inputs)
    total = reduce(lambda a, b: a+b, rates)

    print 'Total: %s/s'

The only thing that might seem strange to someone new to Python is pool.map. Ignoring pool for a second, map is a concept most widely known for it's part in map/reduce which is a concept built in to Python.

The idea behind map is simple. You apply a function to a list. Map is exactly this code condenced:

def times_two(number):
    return number * 2

items = [1,2,3,4]
results = []

for i in items:
    results.append(times_two(i))

That can be simplified with map to be:

results = map(times_two, items)

Now times_two has to be a function, but items doesn't have to be a list. It just needs to be an iterable. It could be a queue from another process constantly adding more work to be processed, it could be a generator reading a file, or a cursor returning results from a database connection.

The idea of map is powerful, which is how it wound up in the data processing paradigm of map/reduce. So it's common to see the idea of map elsewhere.

Multiprocessing is just one of the many places you can find a map implementation and pool.map is exactly the same as map except each call to the function on the iterable is computed in a separate process in parallel instead of serially in your current process. So you get used to a small concept, and you yield big parallel computing wins later.

So for the load testing parallelism is handled with the included multiprocessing package. We make a pool of processes with the number of cpus on the machine. Then we take the number of requests we want to make, divide them by the pool size and leverage the map implementation on the pool to let the sub processes run the function and do all the work in parallel.

Since the jobs return their individual rates we can combine them with reduce into a single total.

I was shocked when this came out to 38 lines of python that took minutes to write. It's by no means comprehensive but it stuck out as a quick example of how powerful Python can be.

Brad WillardComment