justin appears online

"act like a man of thought; think like a man of action." -sallust 

why is it so hard to avoid premature optimizations?

I am amused each time I learn about a completely unnecessary "optimization" that has quietly snuck into one of my programs. These changes happen automatically while programs are being composed, often before they have been run, and almost always independent of profiling. They make programs longer, more difficult to read and sometimes even slower. So why do I write such programs? It is because I, like most programmers, am shockingly bad at intuiting the location of real bottlenecks.

To illustrate this I am going to use as an example a program written by someone else. I am using this program not because I feel it contains particularly egregious unnecessary optimizations 1. Rather, I think it is an interesting example because of just how far off I was in understanding what makes it fast.

I have been spending a fair amount of time with Clojure lately. When approaching a new language, I will often start by solving some problems I am already familiar with so that I can focus on learning the language's idioms 2. I had solved some Project Euler problems and also wrote a solution to the number combo problem that does not make use of local mutable state. Then I remembered the neat solution jcl had posted in a Hacker News comment thread and started thinking about how I could implement this algorithm 3 in Clojure.

I had initially assumed the memoization and float math were important so I started out trying to decide a) which of Clojure's reference types I was going to use for the tuple cache and b) how I was going to validate that the floating point arithmetic was correct 4. As it turned out, neither are actually necessary for a fast solution 5.

First off, if one eliminates memoization and just repeats the recursive calls as needed, the solution is just as fast 6! While the main function is still called considerably more times, most of these hit the base case (i.e. 1-tuples containing single values) and return quickly. Not memoizing also eliminates a few lines for the cache check and lets you return the calculated solution set directly instead of having to first write it to the cache.

Switching to exact arithmetic from float math was even more illuminating. Not surprisingly, simply using bignums and arbitrary-precision fractions for the division steps works out to be around 25x slower than the original. However, the vast majority of the division operations do not produce intermediate results that eventually yield integer solutions 7. In fact, we can actually ignore every division operation that does not have an integer result and still generate all of the answers! This eliminates the need for the (rather expensive) arbitrary-precision fractions and makes the solution 28x *faster* than the original that used floats and memoization 8.

The real reason the float version is so much faster is not that using exact arithmetic is too slow for calculating the intermediate values we actually need. Rather, it is because the penalty for calculating all of the dead-end intermediate values is so much smaller. If one can avoid that penalty some other way the answer will be just as fast.

The point of this discussion 9 is that I often give up on trying to write clear code too quickly. The final implementation, aside from being the fastest, is also the shortest and the easiest to read. While those three do not always coincide, I seldom find myself in need of profiling a program because it runs unacceptably slow, but often experience difficulty reading code that does not closely enough reflect the problem it is solving. Rather than automatically asking myself "is this fast enough?", I need to cultivate the instinct to ask "is this clear enough?".


footnotes:

  1. It was certainly simple enough, fast enough, and far easier to read than my solution to the same problem.
  2. If you're not going to learn to write idiomatic code, why bother learning a new language?
  3. Of course, the primary reason jcl's program is fast is because it uses a good algorithm. By orienting around the set of values that can be computed rather than the set of equations that can be evaluated it performs orders of magnitude fewer calculations.
  4. That is, it's possible--though probably unlikely--that one of the solutions generated only looks correct because of loss of information in the floating point operations. One could, for instance, go through the generated solutions and evaluate equivalent integer expressions with exact arithmetic to verify that they give the desired results.
  5. Although I had initially worked through this series of solutions in Clojure, I am going to show them in Python because jcl's original solution is Python and using a different language does not add anything to the discussion that I would like to focus on.
  6. The average over 5 runs of original.py and no-memo.py were within 1% of each other on my machine.
  7. This is of course also true for intermediate values resulting from float divisions in the original.
  8. You might think that I am no longer comparing apples to apples, but bear with me a moment.
  9. Not, as I may have lead you to believe, that one ought to spend obscene amounts of time polishing simple programs.

generating all numbers from 1900 through 2100

Some friends at work were discussing the problem of generating all of the numbers from 1900 through 2100 with the following constraints: each calculation must consist of the numbers 1 through 9 in order, combined using the binary arithmetic operators +, -, x, ÷, and arbitrary bracketing. for example,

((1 + ((2 + 3) + (((4 + (5 * 6)) * 7) * 8))) + 9) = 1919

Zach approached this by generating random expressions and evaluating them until he'd obtained the entire set. Ben made the observation that if obvious duplicates (i.e. expressions that are identical modulo bracketing) are removed there are not all that many solutions to check. In particular, the solution space is the set of all binary trees with all possible combinations of the aforementioned operators as nodes and the numbers 1 to 9 in order (depth first) as leaves. However, he wasn't generating all of these trees, only a random subset.

While this strategy often obtains the solution set, I thought a nicer approach would be to just build all of the possible tree shapes and then iterate through each using all of the possible operator combinations.

I think the solution is pleasantly simple. The code is about 30 lines (provided you aren't using gratuitous line breaks in an attempt to avoid the horizontal scrollbar on the embedded Gist) and reads rather like the description of my approach.

I don't (usually) enjoy optimizing code for speed and in this case I've deliberately avoided it. I'm sure there are a number of minor changes that could make this program much, much faster. However, I've instead tried to optimize for readability.

Having said that I'm quite pleased with how fast it *does* run. With CCL 1.4 on a 2.5GHz Core2Duo laptop in energy-saving mode this simple ~30 line program produces a correct answer in sixty seconds or so. Think about that for a moment--this performance is being acheived using generic dispatch and the full numeric tower everywhere.

notes:

  • the results of pretty-print-tree are actually valid s-expressions that can be evaluated at a lisp prompt or via eval. out of curiosity I tried replacing evaluate with (eval (pretty-print-tree ...)) in the final block. eval is actually more than twice as slow as my unoptimized evaluate that's built on tree-reduce. keep this in mind the next time someone is whining about how one should never use eval
  • running this program will produce four compiler warnings for unused lexical variables. I could have added a (declare (ignore variable)) to silence each of these but wanted to keep the code free of annotations
  • there are many ways to generate all of the target numbers (317.29 solutions for each on average, min 4, max 6406). the ordering of the trees in the main loop isn't optimal, but I haven't thought of a simple way to reorder them that would work better. in the ordering produced by btrees-of-size, 340 out of 1430 possible trees need to be examined to generate all of the target numbers