As a developer who is not responsible for many infrastructure modules, I shouldn’t really waste so much of time micro-benchmarking. Profiling my code after the fact should be sufficient if even that is necessary.
But it seems I’ve got into a bad habit.
I’m not the only person that thinks that MooseX loading all those pre-requisites is what is taking a lot of time. Dave Rolsky says in the latest Moose blog post:
That gets me thinking about the speed of the perl compilation step. Although using eval doesn’t have the same disk IO as use
, it still exercises the compiler.
Loop Unrolling with Eval
Does anyone remember back in the day when we used to unroll loops? For example, if we had an instruction sequence that was do_something, add, jne and do_something was relatively short, you could save some time unrolling that to do_something, add, do_something, add, jne – you would only have to execute approximately half as many jne instructions across a loop lifetime.
That is something I never worry about in perl. Unless I have to come up with contrived examples for my blog.
sub eval_loop { my $count = 0; my $s = ''; for (my $i = 0; $i < 100_000; ++$i) { $s .= '$count += ' . $i . ";\n"; } eval $s; $count; } sub vanilla_loop { my $count = 0; for (my $i = 0; $i < 100_000; ++$i) { $count += $i; } $count; } my $code = ''; for (my $i = 0; $i < 100_000; ++$i) { $code .= '$count += ' . $i . ";\n"; } $code = ' sub totally_unrolled { my $count = 0;' . "\n" . $code . ' $count; };'; eval $code;
eval_loop
compiles the unrolled loop every time the subroutine is called whereas totally_unrolled
is compiled prior to benchmarking.
I also wanted some partially unrolled loops for comparison sake. Obviously these examples are not suitable for production use. For example, consider what would happen if unrolled_loop_4 needed to execute a number of times that was not a multiple of 4.
sub unrolled_loop_2 { my $count = 0; for (my $i = 0; $i < 100_000; ++$i) { $count += $i; $count += ++$i; } $count; } sub unrolled_loop_4 { my $count = 0; for (my $i = 0; $i < 100_000; ++$i) { $count += $i; $count += ++$i; $count += ++$i; $count += ++$i; } $count; }
Benchmarking
The full benchmarking code is as follows. I test the result of each function to ensure I’m getting the same answer.
use Benchmark qw(:hireswallclock); use strict; use warnings; sub vanilla_loop { my $count = 0; for (my $i = 0; $i < 100_000; ++$i) { $count += $i; } $count; } sub unrolled_loop_2 { my $count = 0; for (my $i = 0; $i < 100_000; ++$i) { $count += $i; $count += ++$i; } $count; } sub unrolled_loop_4 { my $count = 0; for (my $i = 0; $i < 100_000; ++$i) { $count += $i; $count += ++$i; $count += ++$i; $count += ++$i; } $count; } sub eval_loop { my $count = 0; my $s = ''; for (my $i = 0; $i < 100_000; ++$i) { $s .= '$count += ' . $i . ";\n"; } eval $s; $count; } my $code = ''; for (my $i = 0; $i < 100_000; ++$i) { $code .= '$count += ' . $i . ";\n"; } $code = ' sub totally_unrolled { my $count = 0;' . "\n" . $code . ' $count; };'; eval $code; print vanilla_loop(), "\n"; print unrolled_loop_2(), "\n"; print unrolled_loop_4(), "\n"; print eval_loop(), "\n"; print totally_unrolled(), "\n"; Benchmark::cmpthese(-1, { 'vanilla' => \&vanilla_loop, 'unrolled-2' => \&unrolled_loop_2, 'unrolled-4' => \&unrolled_loop_4, 'eval' => \&eval_loop, 'unrolled-max' => \&totally_unrolled, });
And the results are:
$ perl unrolling.pl 4999950000 4999950000 4999950000 4999950000 4999950000 Rate eval unrolled-2 unrolled-4 vanilla unrolled-max eval 3.25/s -- -93% -95% -95% -96% unrolled-2 48.1/s 1381% -- -25% -26% -45% unrolled-4 63.9/s 1865% 33% -- -2% -27% vanilla 65.1/s 1902% 35% 2% -- -26% unrolled-max 87.7/s 2598% 82% 37% 35% --
Conclusions
I am actually surprised that eval is as fast as it is. It runs less than 20 times slower than the vanilla loop, yet it has to construct a string by appending to it 100,000 times and run the compiler on the result. Or to put it another way, it compiles and runs 100,000 lines of code in under a third of a second. Good job perl compiler guys! Or did you put something in there for patholgically ridiculous examples like this?
And I was right never to worry about unrolling my loops. unrolled-2 is quite significantly slower than vanilla. As I can’t think of any good reason for that, it makes me worried that I’ve done something wrong.
Even fully unrolling the loop, which is quite ridiculous gives me a relatively tiny 35% speed increase.
Hi Jared,
I enjoyed reading your post. I wanted to point out that for dynamic languages the *combination* of unrolling and inlining create opportunities for code optimizations. It would be interesting if you could measure function-inlining.
In my research (I am a Phd candidate), I benchmarked automatic compiler optimizations (unroll, inline) in Python. It would be interesting to compare this to perl.
http://cs.haifa.ac.il/~rotemn/papers/systor09-python.html
In unrolled_loop_2/4 you’re doing the same number of iterations but doing more additions per iteration. That’s why it’s slower, you have to reduce the number of iterations as well.
I’d be interested how it affects the results
Hi Nadav,
Thanks for the link. I’m sure I successfully viewed the pdf of your paper before but now it takes me to a create login screen.
Hi Matt,
I think the iterations are correctly reduced inline with the number of additions per iteration. If you look carefully, you can see I increment the loop counter within the loop.
Also, I was concerned that I might be comparing different functions, so I output the result of each function before starting the benchmark.
Looks good to me.