Feeds:
Posts
Comments

Posts Tagged ‘benchmarking’

Why Micro-Benchmark?

I have had some feedback along the lines of why write a micro-benchmark such as Tokyo Cabinet vs Berkeley DB. Everyone knows that:

"premature optimization is the root of all evil"

and

“The First Rule of Program Optimization: Don’t do it. The Second Rule of Program Optimization (for experts only!): Don’t do it yet.”

That kinda misses the point. Micro-benchmarking is nothing to do with optimization. Its purpose is to avoid premature pessimization due to choosing the wrong technology.

I have a project to replace an existing system where the backing store is by far the biggest bottleneck. The new system needs to be signicantly faster than the existing system and I have a lot of flexibility in choosing the backing store.

The two alternatives that immediately come to mind are.

  1. create a nice abstraction layer that can easily switch between Perl DBI, Redis, Tokyo Cabinet and a bunch of other alternatives… or
  2. micro-benchmark them with something similar to my read/write data profile.

Time pressure unfortunately compels me to choose #2. Maybe it’s time to take a second look at KiokuDB, for inspiration at least, even if I can’t use it directly.

Read Full Post »

In response to my Berkeley DB benchmarking post, Pedro Melo points out that Tokyo Cabinet is faster and that JSON::XS is faster than Storable.

I couldn’t find an up to date Ubuntu package that included the TC perl libraries so I had to build everything from source. It was pretty straightforward though.

First we need to get the database handle.

my $tc_file = "$ENV{HOME}/test.tc";
unlink $tc_file;
my $hdb = TokyoCabinet::HDB->new();

if(!$hdb->open($tc_file, $hdb->OWRITER | $hdb->OCREAT)){
    my $ecode = $hdb->ecode();
    printf STDERR ("open error: %s\n", $hdb->errmsg($ecode));
}

Presumably putasync is the fastest database put method.

my $ORDER_ID = 0;

sub store_record_tc
{
    my ($db, $ref_record, $no_sync, $json) = @_;
    $json //= 0;
    $no_sync //= 0;
    $ref_record->{'order_id'} = ++$ORDER_ID;
    my $key = "/order/$ref_record->{'order_id'}";
    $db->putasync($key, $json ? encode_json($ref_record)
                              : Storable::freeze($ref_record));
}

I needed to amend store_record to compare json and storable too.

sub store_record
{
    my ($db, $ref_record, $no_sync, $json) = @_;
    $json //= 0;
    $no_sync //= 0;
    $ref_record->{'order_id'} = ++$ORDER_ID;
    my $key = "/order/$ref_record->{'order_id'}";
    $db->db_put($key, $json ? encode_json($ref_record)
                            : Storable::freeze($ref_record));
    $db->db_sync() unless $no_sync;
}

The benchmarking code looks like this.

Benchmark::cmpthese(-1, {
    'json-only-50/50' => sub { json_only($db, $rec_50_50) },
    'freeze-only-50/50' => sub { freeze_only($db, $rec_50_50) },

    'freeze-no-sync-50/50' => sub { store_record($db, $rec_50_50, 1) },
    'freeze-no-sync-50/50-tc' => sub { store_record_tc($hdb, $rec_50_50, 1) },

    'json-no-sync-50/50' => sub { store_record($db, $rec_50_50, 1, 1) },
    'json-no-sync-50/50-tc' => sub { store_record_tc($hdb, $rec_50_50, 1, 1) },
});

And the results are as follows:

        Rate freeze-no-sync-50/50 json-no-sync-50/50 freeze-no-sync-50/50-tc json-no-sync-50/50-tc freeze-only-50/50 json-only-50/50
freeze-no-sync-50/50     7791/s                   --                -9%                    -39%                  -47%              -59%            -81%
json-no-sync-50/50       8605/s                  10%                 --                    -33%                  -41%              -55%            -79%
freeze-no-sync-50/50-tc 12800/s                  64%                49%                      --                  -13%              -33%            -69%
json-no-sync-50/50-tc   14698/s                  89%                71%                     15%                    --              -23%            -64%
freeze-only-50/50       19166/s                 146%               123%                     50%                   30%                --            -54%
json-only-50/50         41353/s                 431%               381%                    223%                  181%              116%              --

Pedro was right. Tokyo Cabinet is significantly faster than Berkeley DB, at least in this simple benchmark.

Edit: json and no_sync parameter switch has been fixed.

Read Full Post »

I’m a big fan of Antirez. I’ve been subscribed to his blog for a while.

I’m also a big fan of shoddy benchmarks. Maybe that makes me biased.

So, the sys/toilet guy writes a post benchmarking Redis vs Memcached and concludes that Memcached is faster and Redis doesn’t do so well at setup and teardown.

Antirez responds: (emphasis mine)

Unfortunately I also think that our dude is part of the problem, since running crappy benchmarks is not going to help our pop culture.

Okay, fine. Sys/Toilet guy used some robust language in his post. Antirez has every right to respond in kind.

But the thing is…

Still this benchmark captured my interest. Even with all this problems, why on the earth Redis was showing so low numbers in multi-get (MGET) operations?

I checked the implementation better and found a problem in the query parsing code that resulted in a quadratic algorithm due to a wrong positioned strchr() call.

Hmmm… sounds like the “crappy” benchmark helped. And what’s more, there is already a (presumably decent) Redis benchmark that didn’t pick up the issue.

In Redis land we already have redis-benchmark that assisted us in the course of the latest one and half years, and worked as a great advertising tool: we were able to show that our system was performing well.

So, sys/toilet guy, thanks for making Redis better. I appreciate it.

Read Full Post »

Recently I’ve been thinking about storing records on disk quickly. For my general use case, an RDBMS isn’t quite fast enough. My first thought, probably like many a NoSQL person before me, is how fast can I go if I give up ACID?

Even if I’m intending to do the final implementation in C++, I’ll often experiment with Perl first. It’s often possible to use the C libraries anyway.

First up, how about serialising records using Storable to an on-disk hash table like Berkeley DB.

(Aside: I’m probably going to appear obsessed with benchmarking now, but really I’m just sticking a finger in the air to get an idea about how various approaches perform. I can estimate 90cm given a metre stick. I don’t need a more precise way to do a rough estimate.)

use Storable;
use Benchmark qw(:hireswallclock);

use BerkeleyDB;

I need to make a random record to store in the DB.

sub make_record
{
    my ($order_id, $fields, $key_len, $val_len) = @_;

    my %record;
    $record{'order_id'} = $order_id;

    for my $field_no (1..$fields-1) {
        my $key = 'key';
        my $val = 'val';
        $key .= chr(65 + rand(26)) for (1..$key_len - length($key));
        $val .= chr(65 + rand(26)) for (1..$val_len - length($val));
        $record{$key} = $val;
        print "$key -> $val\n";
    }
    return \%record;
}

And a wrapper handles the general case I’ll be testing – key and value equal length and order_id starting at 1.

sub rec
{
    my ($fields, $len) = @_;
    return make_record(1, $fields, $len, $len);
}

I’ll compare serialisation only against actually storing the data to disk to see what the upper limit I could achieve is if, for example, I was using an SSD.

sub freeze_only
{
    my ($db, $ref_record, $no_sync) = @_;
    $no_sync //= 0;
    my $key = "/order/$ref_record->{'order_id'}";
    Storable::freeze($ref_record);
}

And I’m curious to know how much overhead syncing to disk adds.

my $ORDER_ID = 0;

sub store_record
{
    my ($db, $ref_record, $no_sync) = @_;
    $no_sync //= 0;
    $ref_record->{'order_id'} = ++$ORDER_ID;
    my $key = "/order/$ref_record->{'order_id'}";
    $db->db_put($key, Storable::freeze($ref_record));
    $db->db_sync() unless $no_sync;
}

The Test Program

A record with 50 fields, each of size 50 seems reasonable.

my $filename = "$ENV{HOME}/test.db";
unlink $filename;

my $db = new BerkeleyDB::Hash
    -Filename => $filename,
    -Flags    => DB_CREATE
    or die "Cannot open file $filename: $! $BerkeleyDB::Error\n" ;

my $rec_50_50 = rec(50, 50);

Benchmark::cmpthese(-1, {
    'freeze-only-50/50' => sub { freeze_only($db, $rec_50_50) },
    'freeze-sync-50/50' => sub { store_record($db, $rec_50_50) },
    'freeze-no-sync-50/50' => sub { store_record($db, $rec_50_50, 1) },
});

The Results

                        Rate freeze-sync-50/50 freeze-no-sync-50/50 freeze-only-50/50
freeze-sync-50/50     1543/s                --                 -80%              -93%
freeze-no-sync-50/50  7696/s              399%                   --              -63%
freeze-only-50/50    21081/s             1267%                 174%                --

Conclusion

Unsurprisingly syncing is expensive – it adds 400% overhead. However, even with the sync, we’re still able to store 5.5 million records an hour. Is that fast enough for me? (I need some level of reliability) It might well be.

Berkeley DB is fast. It only adds 170% overhead to the serialisation itself. I’m impressed.

In case anyone is interested. I ran a more comprehensive set of benchmarks.

freeze-sync-50/50       1846/s
freeze-sync-50/05       2262/s
freeze-sync-05/50       2546/s
freeze-sync-05/05       2799/s
freeze-no-sync-50/50    7313/s
freeze-no-sync-50/05    9514/s
freeze-no-sync-05/50   11395/s
freeze-no-sync-05/05   12589/s
freeze-only-50/50      20031/s
freeze-only-50/05      21920/s
freeze-only-05/05      26547/s
freeze-only-05/50      26547/s
fcall-only           2975364/s

Read Full Post »

Perl Compiler Speed – It’s Fast!

As a developer who is not responsible for many infrastructure modules, I shouldn’t really waste so much of time micro-benchmarking. Profiling my code after the fact should be sufficient if even that is necessary.

But it seems I’ve got into a bad habit.


I’m not the only person that thinks that MooseX loading all those pre-requisites is what is taking a lot of time. Dave Rolsky says in the latest Moose blog post:

What’s a bit sad, however, is that this only appears to save 6-10% of the compilation time in real usage. I’m not sure why that is, but I think the issue is that the perl core’s compilation time may be a big factor. As I said, loading that one Markdent modules loads a lot of modules (203 individual .pm files), and the raw overhead of simply reading all those files and compiling them may be a significant chunk of the compilation time.

That gets me thinking about the speed of the perl compilation step. Although using eval doesn’t have the same disk IO as use, it still exercises the compiler.

Loop Unrolling with Eval

Does anyone remember back in the day when we used to unroll loops? For example, if we had an instruction sequence that was do_something, add, jne and do_something was relatively short, you could save some time unrolling that to do_something, add, do_something, add, jne – you would only have to execute approximately half as many jne instructions across a loop lifetime.

That is something I never worry about in perl. Unless I have to come up with contrived examples for my blog.

sub eval_loop
{
    my $count = 0;
    my $s = '';
    for (my $i = 0; $i < 100_000; ++$i) {
        $s .= '$count += ' . $i . ";\n";
    }
    eval $s;
    $count;
}

sub vanilla_loop
{
    my $count = 0;
    for (my $i = 0; $i < 100_000; ++$i) {
        $count += $i;
    }
    $count;
}

my $code = '';
for (my $i = 0; $i < 100_000; ++$i) {
    $code .= '$count += ' . $i . ";\n";
}

$code = '
sub totally_unrolled {
    my $count = 0;' . "\n" . $code . '
    $count;
};';

eval $code;

eval_loop compiles the unrolled loop every time the subroutine is called whereas totally_unrolled is compiled prior to benchmarking.

I also wanted some partially unrolled loops for comparison sake. Obviously these examples are not suitable for production use. For example, consider what would happen if unrolled_loop_4 needed to execute a number of times that was not a multiple of 4.

sub unrolled_loop_2
{
    my $count = 0;
    for (my $i = 0; $i < 100_000; ++$i) {
        $count += $i;
        $count += ++$i;
    }
    $count;
}

sub unrolled_loop_4
{
    my $count = 0;
    for (my $i = 0; $i < 100_000; ++$i) {
        $count += $i;
        $count += ++$i;
        $count += ++$i;
        $count += ++$i;
    }
    $count;
}

Benchmarking

The full benchmarking code is as follows. I test the result of each function to ensure I’m getting the same answer.

use Benchmark qw(:hireswallclock);

use strict;
use warnings;

sub vanilla_loop
{
    my $count = 0;
    for (my $i = 0; $i < 100_000; ++$i) {
        $count += $i;
    }
    $count;
}

sub unrolled_loop_2
{
    my $count = 0;
    for (my $i = 0; $i < 100_000; ++$i) {
        $count += $i;
        $count += ++$i;
    }
    $count;
}

sub unrolled_loop_4
{
    my $count = 0;
    for (my $i = 0; $i < 100_000; ++$i) {
        $count += $i;
        $count += ++$i;
        $count += ++$i;
        $count += ++$i;
    }
    $count;
}


sub eval_loop
{
    my $count = 0;
    my $s = '';
    for (my $i = 0; $i < 100_000; ++$i) {
        $s .= '$count += ' . $i . ";\n";
    }
    eval $s;
    $count;
}

my $code = '';
for (my $i = 0; $i < 100_000; ++$i) {
    $code .= '$count += ' . $i . ";\n";
}

$code = '
sub totally_unrolled {
    my $count = 0;' . "\n" . $code . '
    $count;
};';

eval $code;

print vanilla_loop(), "\n";
print unrolled_loop_2(), "\n";
print unrolled_loop_4(), "\n";
print eval_loop(), "\n";
print totally_unrolled(), "\n";

Benchmark::cmpthese(-1, {
    'vanilla' => \&vanilla_loop,
    'unrolled-2' => \&unrolled_loop_2,
    'unrolled-4' => \&unrolled_loop_4,
    'eval' => \&eval_loop,
    'unrolled-max' => \&totally_unrolled,
});

And the results are:

$ perl unrolling.pl
4999950000
4999950000
4999950000
4999950000
4999950000
               Rate        eval  unrolled-2  unrolled-4     vanilla unrolled-max
eval         3.25/s          --        -93%        -95%        -95%         -96%
unrolled-2   48.1/s       1381%          --        -25%        -26%         -45%
unrolled-4   63.9/s       1865%         33%          --         -2%         -27%
vanilla      65.1/s       1902%         35%          2%          --         -26%
unrolled-max 87.7/s       2598%         82%         37%         35%           --

Conclusions

I am actually surprised that eval is as fast as it is. It runs less than 20 times slower than the vanilla loop, yet it has to construct a string by appending to it 100,000 times and run the compiler on the result. Or to put it another way, it compiles and runs 100,000 lines of code in under a third of a second. Good job perl compiler guys! Or did you put something in there for patholgically ridiculous examples like this?

And I was right never to worry about unrolling my loops. unrolled-2 is quite significantly slower than vanilla. As I can’t think of any good reason for that, it makes me worried that I’ve done something wrong.

Even fully unrolling the loop, which is quite ridiculous gives me a relatively tiny 35% speed increase.


Read Full Post »

Spurious Benchmark.pm Warnings

My most recent post has a few comments, but because I’ve been on holiday I didn’t reply to them in a timely manner. Sorry about that folks, I did read all of them.

And time has moved on so if I reply to those comments in the comments section, I’m guessing the original commenters won’t read the response. Having said that, they might not read this post either, but I think that is less likely.

Therefore this is a post to respond to the commenters, and to keep up with the IronMan schedule even though I’m not sure I’m being measured.

First of all, Roman, thanks for the comment. The figures you provide are interesting and yours is a second complaint about opaque error messages. That would bother me as well – I get enough of them from emacs macros and I have at least got tools (well, a tool – macroexpand) to debug those.

@Max – Hi. I’m not sure how you got the impression that I was interested in object creation time. We have primarily been talking about Moose start-up overhead here. I think I’m measuring the right thing.

@Anonymous – when you say "Run each benchmark item for 5 seconds or more." I think you missed this bit of the post:

Benchmark: timing 200 iterations of 1just-moose, 2all-includes, 3nocreate, 4create...
  1just-moose: 100.196 wallclock secs ( 0.05 usr +  0.25 sys =  0.30 CPU) @ 673.40/s (n=200)
2all-includes: 279.252 wallclock secs ( 0.08 usr +  0.34 sys =  0.42 CPU) @ 475.06/s (n=200)
    3nocreate: 318.256 wallclock secs ( 0.06 usr +  0.28 sys =  0.34 CPU) @ 581.40/s (n=200)
      4create: 320.979 wallclock secs ( 0.06 usr +  0.27 sys =  0.33 CPU) @ 611.62/s (n=200)

Not one of my benchmarks runs for less than a minute and a half. Most of them are around the five minute mark. And I have deadlines!

So where is this message coming from? I had a look in Benchmark.pm.

    print "            (warning: too few iterations for a reliable count)\n"
        if     $n < $Min_Count # 4
            || ($t->real < 1 && $n < 1000)
            || $t->cpu_a < $Min_CPU; # 0.4

My iterations is 200 so $Min_Count is not the problem. $t->real is at least 100 so that leaves $Min_CPU. Hmmm… yeah, I would have had to double the number of iterations to be sure. Look at 4create – 321 wallclock seconds of which 0.3 is CPU. No worries, I’m good – the warnings were spurious as I originally thought.

Read Full Post »

The debate over whether or not Moose is too slow rumbles on. On one side, are the folks using it in production (and presumably happpy with it). On the other, are folks saying it adds multiple seconds to the start-up time.

Following on from my benchmarking Moose startup overhead post, Sue D. Nymme provided code that actually created an object. Thanks Sue!

I took a look and Sue’s code uses MooseX::Declare that I haven’t installed. One magic CPAN invocation and, oh my gosh, more than 20 minutes later, MooseX::Declare is finally installed. Now I’m thinking that the benchmark time is actually measuring the time for perl to load in and parse MooseX::Declare and tens of thousands of lines of prerequisites rather than how slow Moose itself is.

I’ll compare just loading the modules Moose vs MooseX::Declare.

Then I’ll add the rest of Sue’s code.

And finally I’ll actually create a Moose object.

just-moose.pl

use Modern::Perl;

use Moose;
use Moose::Util::TypeConstraints;
use DateTime;

1;

all-includes.pl

use Modern::Perl;

use MooseX::Declare;
use Moose::Util::TypeConstraints;
use DateTime;

1;

And moose.pl (taken from Sue D. Nymme’s comment here.

bench.pl

use Benchmark qw(:hireswallclock);

use strict;
use warnings;

my $perl = 'c:/strawberry/perl/bin/perl';

Benchmark::timethese(200, {
    '1just-moose' =>
        sub { system(qq{$perl -e "require 'just-moose.pl';"}) },
    '2all-includes' =>
        sub { system(qq{$perl -e "require 'all-includes.pl';"}) },
    '3nocreate' => sub { system(qq{$perl -e "require 'moose.pl';"}) },
    '4create' =>
        sub { system(qq/$perl -e "require 'moose.pl'; MooseObject->new({ name => 'Bob' })"/) },
});

The Results

I’ve elided the too few iterations warnings.

Benchmark: timing 200 iterations of 1just-moose, 2all-includes, 3nocreate, 4create...
  1just-moose: 100.196 wallclock secs ( 0.05 usr +  0.25 sys =  0.30 CPU) @ 673.40/s (n=200)
2all-includes: 279.252 wallclock secs ( 0.08 usr +  0.34 sys =  0.42 CPU) @ 475.06/s (n=200)
    3nocreate: 318.256 wallclock secs ( 0.06 usr +  0.28 sys =  0.34 CPU) @ 581.40/s (n=200)
      4create: 320.979 wallclock secs ( 0.06 usr +  0.27 sys =  0.33 CPU) @ 611.62/s (n=200)

Conclusions

The bulk of the time is in loading the MooseX::Declare library. But on the other hand, declaring a single object did take a significant amount of time (approximately 40 seconds over 200 runs). I can now believe that declaring a lot of objects would be prohibitively expensive.

But that is using MooseX::Declare. I’m sure it is nowhere near as bad if we used Plain Ole Moose.

Read Full Post »

Older Posts »

Follow

Get every new post delivered to your Inbox.