Feeds:
Posts
Comments

Posts Tagged ‘berkeley db’

In response to my Berkeley DB benchmarking post, Pedro Melo points out that Tokyo Cabinet is faster and that JSON::XS is faster than Storable.

I couldn’t find an up to date Ubuntu package that included the TC perl libraries so I had to build everything from source. It was pretty straightforward though.

First we need to get the database handle.

my $tc_file = "$ENV{HOME}/test.tc";
unlink $tc_file;
my $hdb = TokyoCabinet::HDB->new();

if(!$hdb->open($tc_file, $hdb->OWRITER | $hdb->OCREAT)){
    my $ecode = $hdb->ecode();
    printf STDERR ("open error: %s\n", $hdb->errmsg($ecode));
}

Presumably putasync is the fastest database put method.

my $ORDER_ID = 0;

sub store_record_tc
{
    my ($db, $ref_record, $no_sync, $json) = @_;
    $json //= 0;
    $no_sync //= 0;
    $ref_record->{'order_id'} = ++$ORDER_ID;
    my $key = "/order/$ref_record->{'order_id'}";
    $db->putasync($key, $json ? encode_json($ref_record)
                              : Storable::freeze($ref_record));
}

I needed to amend store_record to compare json and storable too.

sub store_record
{
    my ($db, $ref_record, $no_sync, $json) = @_;
    $json //= 0;
    $no_sync //= 0;
    $ref_record->{'order_id'} = ++$ORDER_ID;
    my $key = "/order/$ref_record->{'order_id'}";
    $db->db_put($key, $json ? encode_json($ref_record)
                            : Storable::freeze($ref_record));
    $db->db_sync() unless $no_sync;
}

The benchmarking code looks like this.

Benchmark::cmpthese(-1, {
    'json-only-50/50' => sub { json_only($db, $rec_50_50) },
    'freeze-only-50/50' => sub { freeze_only($db, $rec_50_50) },

    'freeze-no-sync-50/50' => sub { store_record($db, $rec_50_50, 1) },
    'freeze-no-sync-50/50-tc' => sub { store_record_tc($hdb, $rec_50_50, 1) },

    'json-no-sync-50/50' => sub { store_record($db, $rec_50_50, 1, 1) },
    'json-no-sync-50/50-tc' => sub { store_record_tc($hdb, $rec_50_50, 1, 1) },
});

And the results are as follows:

        Rate freeze-no-sync-50/50 json-no-sync-50/50 freeze-no-sync-50/50-tc json-no-sync-50/50-tc freeze-only-50/50 json-only-50/50
freeze-no-sync-50/50     7791/s                   --                -9%                    -39%                  -47%              -59%            -81%
json-no-sync-50/50       8605/s                  10%                 --                    -33%                  -41%              -55%            -79%
freeze-no-sync-50/50-tc 12800/s                  64%                49%                      --                  -13%              -33%            -69%
json-no-sync-50/50-tc   14698/s                  89%                71%                     15%                    --              -23%            -64%
freeze-only-50/50       19166/s                 146%               123%                     50%                   30%                --            -54%
json-only-50/50         41353/s                 431%               381%                    223%                  181%              116%              --

Pedro was right. Tokyo Cabinet is significantly faster than Berkeley DB, at least in this simple benchmark.

Edit: json and no_sync parameter switch has been fixed.

Read Full Post »

Recently I’ve been thinking about storing records on disk quickly. For my general use case, an RDBMS isn’t quite fast enough. My first thought, probably like many a NoSQL person before me, is how fast can I go if I give up ACID?

Even if I’m intending to do the final implementation in C++, I’ll often experiment with Perl first. It’s often possible to use the C libraries anyway.

First up, how about serialising records using Storable to an on-disk hash table like Berkeley DB.

(Aside: I’m probably going to appear obsessed with benchmarking now, but really I’m just sticking a finger in the air to get an idea about how various approaches perform. I can estimate 90cm given a metre stick. I don’t need a more precise way to do a rough estimate.)

use Storable;
use Benchmark qw(:hireswallclock);

use BerkeleyDB;

I need to make a random record to store in the DB.

sub make_record
{
    my ($order_id, $fields, $key_len, $val_len) = @_;

    my %record;
    $record{'order_id'} = $order_id;

    for my $field_no (1..$fields-1) {
        my $key = 'key';
        my $val = 'val';
        $key .= chr(65 + rand(26)) for (1..$key_len - length($key));
        $val .= chr(65 + rand(26)) for (1..$val_len - length($val));
        $record{$key} = $val;
        print "$key -> $val\n";
    }
    return \%record;
}

And a wrapper handles the general case I’ll be testing – key and value equal length and order_id starting at 1.

sub rec
{
    my ($fields, $len) = @_;
    return make_record(1, $fields, $len, $len);
}

I’ll compare serialisation only against actually storing the data to disk to see what the upper limit I could achieve is if, for example, I was using an SSD.

sub freeze_only
{
    my ($db, $ref_record, $no_sync) = @_;
    $no_sync //= 0;
    my $key = "/order/$ref_record->{'order_id'}";
    Storable::freeze($ref_record);
}

And I’m curious to know how much overhead syncing to disk adds.

my $ORDER_ID = 0;

sub store_record
{
    my ($db, $ref_record, $no_sync) = @_;
    $no_sync //= 0;
    $ref_record->{'order_id'} = ++$ORDER_ID;
    my $key = "/order/$ref_record->{'order_id'}";
    $db->db_put($key, Storable::freeze($ref_record));
    $db->db_sync() unless $no_sync;
}

The Test Program

A record with 50 fields, each of size 50 seems reasonable.

my $filename = "$ENV{HOME}/test.db";
unlink $filename;

my $db = new BerkeleyDB::Hash
    -Filename => $filename,
    -Flags    => DB_CREATE
    or die "Cannot open file $filename: $! $BerkeleyDB::Error\n" ;

my $rec_50_50 = rec(50, 50);

Benchmark::cmpthese(-1, {
    'freeze-only-50/50' => sub { freeze_only($db, $rec_50_50) },
    'freeze-sync-50/50' => sub { store_record($db, $rec_50_50) },
    'freeze-no-sync-50/50' => sub { store_record($db, $rec_50_50, 1) },
});

The Results

                        Rate freeze-sync-50/50 freeze-no-sync-50/50 freeze-only-50/50
freeze-sync-50/50     1543/s                --                 -80%              -93%
freeze-no-sync-50/50  7696/s              399%                   --              -63%
freeze-only-50/50    21081/s             1267%                 174%                --

Conclusion

Unsurprisingly syncing is expensive – it adds 400% overhead. However, even with the sync, we’re still able to store 5.5 million records an hour. Is that fast enough for me? (I need some level of reliability) It might well be.

Berkeley DB is fast. It only adds 170% overhead to the serialisation itself. I’m impressed.

In case anyone is interested. I ran a more comprehensive set of benchmarks.

freeze-sync-50/50       1846/s
freeze-sync-50/05       2262/s
freeze-sync-05/50       2546/s
freeze-sync-05/05       2799/s
freeze-no-sync-50/50    7313/s
freeze-no-sync-50/05    9514/s
freeze-no-sync-05/50   11395/s
freeze-no-sync-05/05   12589/s
freeze-only-50/50      20031/s
freeze-only-50/05      21920/s
freeze-only-05/05      26547/s
freeze-only-05/50      26547/s
fcall-only           2975364/s

Read Full Post »

Follow

Get every new post delivered to your Inbox.