As you may or not know, I am a huge fan of Python, and love hacking with it. I was recently at the PyconFr were I attended several talks on Python stuff:
The conference was eventified here if you feel like checking it out. Careful though, talks are in French!
Anyway, one of the lightning talks on optimization got me thinking: “I should try to solve some optimization problems with Python!”.
Although the wiki article goes in detail on the subject, I thought it was a good idea to try applying the optimization on this very inefficient method of computation.
Note that I used Python 3 in each of the examples, and had them run on my machine (dual-core i3 2.4 GHz / 4 Gb RAM).
No need for another explanation, the concept is rather simple:
For example, consider a circle inscribed in a unit square. Given that the circle and the square have a ratio of areas that is π/4, the value of π can be approximated using a Monte Carlo method: 1. Draw a square on the ground, then inscribe a circle within it. 2. Uniformly scatter some objects of uniform size (grains of rice or sand) over the square. 3. Count the number of objects inside the circle and the total number of objects. 4. The ratio of the two counts is an estimate of the ratio of the two areas, which is π/4. Multiply the result by 4 to estimate π.
Pulled right off Wikipedia. I read the great book “High performance Python” (Practical Performant Programming for Humans), by Micha Gorelick and Ian Ozsvald (link), in which they also use this problem to illustrate parallelism benefits.
Keeping it simple as always, here’s what we’ll do:
We’ll be focusing on optimizing
throw_darts_in_unit_square, holding most of
the problem’s complexity: throwing a large number of darts.
A pure python version of the dart throw would look like this:
As usual in Python, code is pretty self-explanatory: select a random point in
the unit square, then test if it is in the unit circle (eg.
x² + y² <= 1).
Again, I’m using Python 3, so the use of
range() isn’t a problem. In previous
versions you’d probably use
xrange() to make sure you don’t build a humongous
As you can also imagine, this is a naive version, since it doesn’t have any optimization whatsoever. Variables are all PyObjects and there no vectorized computations (no need to remind you why Python is slow).
Below is the benchmark results of this version. Note the log scales for both the number of iterations and the time complexity of the method. You’ll also be glad to notice (well, at least I was) that we get a nice linear complexity, that corresponds is the nature of the problem.
To understand better what’s happening under the hood, I profiled the code with Robert Kern’s line_profiler. The following results are set for 1 000 000 iterations.
Hits Time Per Hit % Time Line Contents ===================================================== def throw_darts_pure_python(amount): 1 2 2.0 0.0 hits = 0 1000001 693509 0.7 11.7 for _ in range(int(amount)): 1000000 1905288 1.9 32.1 x = random.uniform(0, 1) 1000000 1699462 1.7 28.6 y = random.uniform(0, 1) 1000000 1638124 1.6 27.6 hits += x**2 + y**2 <= 1 1 1 1.0 0.0 return hits
You can see that obviously, most of the time is spent inside that 1 million
long loop (no kidding huh?). I’m a bit surprised by the results, mainly that
random.uniform() isn’t as slow as I expected it to be. It’s actually really
fast. That’s because the implementation uses a low-level C module. Although
the difference between the two calls puzzles me… The first person that
figures out why the first call to
random.uniform() is faster than the second
gets a pint.
Now, you’ve probably already heard of Numpy, famous for n-dimension array manipulation (eg. arrays and matrices) in Python. There’s a lot of things Numpy can do, but in my case I’m just going to be using it for optimizing random uniform arrays. Since Numpy is written in C, most of the calls are indeed faster than python code. Though we shouldn’t forget that there is an overhead to calling C code from Python.
Below is a new version, making use of Numpy’s own implementation of random module:
And now for the benchmark results:
Yes, this version is in fact slower than the pure python version, even though we’re generating the 2 million random numbers as a batch. A line coverage is needed to explain why this happened:
Hits Time Per Hit % Time Line Contents ===================================================== def throw_darts_numpy_random_sample(amount): 1 2 2.0 0.0 hits = 0 1 19874 19874.0 0.3 xs = numpy.random.random(amount) 1 19697 19697.0 0.2 ys = numpy.random.random(amount) 1000001 914805 0.9 11.6 for i in range(int(amount)): 1000000 6956696 7.0 87.9 hits += xs[i]**2 + ys[i]**2 <= 1 1 1 1.0 0.0 return hits
Ok, you should notice quickly that most of the time is spent in that huge loop
(99.5% actually). That’s because of all that simple yet sub-optimal python
looping code. That simple
for i in range(int(amount)) has to increment a
PyObject 1 million times, and that adds up to a lot. Oh and don’t even get
me started on that single line inside the loop: two Numpy array indexing, with
conversion to PyObject in order to apply the exponent operator, that’s far from
Besides the slugginess of the loop, things have sped up. Numpy allocates two
million random elements a lot faster than the cumulated time of two million
random.uniform(). About a hundred times faster actually (1905288 +
1699462 = 3604750 vs 19874 + 19697 = 39571).
In order to make this piece of code faster than the pure python version, and as you may have guessed it: we’re going to have to move that final line in the loop down to Numpy via vectorization.
I think the results speak for themselves from now on, so I’ll let you interpret the results yourself (hint: they’re good).
Wait, what’s that non-linear curve at the beginning? Oh sorry, not commenting.
Hits Time Per Hit % Time Line Contents ================================================== def throw_darts_numpy_random_sample_vectorized(amount): 1 19780 19780.0 3.0 xs = numpy.random.random(amount) 1 19662 19662.0 2.9 ys = numpy.random.random(amount) 1 64 64.0 0.0 throw = numpy.vectorize(lambda s: s <= 1, otypes=[numpy.uint8]) 1 628962 628962.0 93.9 hits = throw(xs * xs + ys * ys) 1 1172 1172.0 0.2 return numpy.sum(hits)
At this point I believe we’re near the best trade-off optimization/time spent optimizing. The next step for this kind of problem is parallel computing, which is fairly easy in this case.
AKA how to make old, single-core code multi-core in less than 10 minutes.
That’s it. I’m not too fast of a typer, but the coding took me about 5 minutes.
I’m not going to repeat for the n-th time that Python is awesome, cause that’s
import antigravity is for. What I will say is that you can achieve a lot
with it, and if speed is a major factor, you can easily move the
allocation/vectorization problems down to the C level with a little more work.
Moreover, if your problem is easily splittable, you can parallelize the
computation in a very short amount of time, with little downsides.
All the code used for this article is open-source and available on Github, should you want to re-use it.