Recently I’ve been thinking about storing records on disk quickly. For my general use case, an RDBMS isn’t quite fast enough. My first thought, probably like many a NoSQL person before me, is how fast can I go if I give up ACID?
Even if I’m intending to do the final implementation in C++, I’ll often experiment with Perl first. It’s often possible to use the C libraries anyway.
First up, how about serialising records using Storable to an on-disk hash table like Berkeley DB.
(Aside: I’m probably going to appear obsessed with benchmarking now, but really I’m just sticking a finger in the air to get an idea about how various approaches perform. I can estimate 90cm given a metre stick. I don’t need a more precise way to do a rough estimate.)
use Storable; use Benchmark qw(:hireswallclock); use BerkeleyDB;
I need to make a random record to store in the DB.
sub make_record { my ($order_id, $fields, $key_len, $val_len) = @_; my %record; $record{'order_id'} = $order_id; for my $field_no (1..$fields-1) { my $key = 'key'; my $val = 'val'; $key .= chr(65 + rand(26)) for (1..$key_len - length($key)); $val .= chr(65 + rand(26)) for (1..$val_len - length($val)); $record{$key} = $val; print "$key -> $val\n"; } return \%record; }
And a wrapper handles the general case I’ll be testing – key and value equal length and order_id starting at 1.
sub rec { my ($fields, $len) = @_; return make_record(1, $fields, $len, $len); }
I’ll compare serialisation only against actually storing the data to disk to see what the upper limit I could achieve is if, for example, I was using an SSD.
sub freeze_only { my ($db, $ref_record, $no_sync) = @_; $no_sync //= 0; my $key = "/order/$ref_record->{'order_id'}"; Storable::freeze($ref_record); }
And I’m curious to know how much overhead syncing to disk adds.
my $ORDER_ID = 0; sub store_record { my ($db, $ref_record, $no_sync) = @_; $no_sync //= 0; $ref_record->{'order_id'} = ++$ORDER_ID; my $key = "/order/$ref_record->{'order_id'}"; $db->db_put($key, Storable::freeze($ref_record)); $db->db_sync() unless $no_sync; }
The Test Program
A record with 50 fields, each of size 50 seems reasonable.
my $filename = "$ENV{HOME}/test.db"; unlink $filename; my $db = new BerkeleyDB::Hash -Filename => $filename, -Flags => DB_CREATE or die "Cannot open file $filename: $! $BerkeleyDB::Error\n" ; my $rec_50_50 = rec(50, 50); Benchmark::cmpthese(-1, { 'freeze-only-50/50' => sub { freeze_only($db, $rec_50_50) }, 'freeze-sync-50/50' => sub { store_record($db, $rec_50_50) }, 'freeze-no-sync-50/50' => sub { store_record($db, $rec_50_50, 1) }, });
The Results
Rate freeze-sync-50/50 freeze-no-sync-50/50 freeze-only-50/50 freeze-sync-50/50 1543/s -- -80% -93% freeze-no-sync-50/50 7696/s 399% -- -63% freeze-only-50/50 21081/s 1267% 174% --
Conclusion
Unsurprisingly syncing is expensive – it adds 400% overhead. However, even with the sync, we’re still able to store 5.5 million records an hour. Is that fast enough for me? (I need some level of reliability) It might well be.
Berkeley DB is fast. It only adds 170% overhead to the serialisation itself. I’m impressed.
In case anyone is interested. I ran a more comprehensive set of benchmarks.
freeze-sync-50/50 1846/s freeze-sync-50/05 2262/s freeze-sync-05/50 2546/s freeze-sync-05/05 2799/s freeze-no-sync-50/50 7313/s freeze-no-sync-50/05 9514/s freeze-no-sync-05/50 11395/s freeze-no-sync-05/05 12589/s freeze-only-50/50 20031/s freeze-only-50/05 21920/s freeze-only-05/05 26547/s freeze-only-05/50 26547/s fcall-only 2975364/s
Several comments:
* if you don’t need the full power of Storable, try JSON::XS: on my benchmarks is faster than Storable, and the end result is user readable and reusable from other languages;
* BerkleyDB is not the fastest disk DB in town. I would also experiment with Tokyo Cabinet (http://fallabs.com/tokyocabinet/);
* Also interesting to try would be Redis. Given that it would run on a separate process, with pipeline commands you should be able to get decent performance.
Hi Pedro,
This is exactly the kind of comment that makes me keep blogging. Great information! I hadn’t come across Tokyo Cabinet and it looks like exactly what I have been looking for, for more than 5 years!
I had been intending to benchmark Redis next.
Cheers
By the way did you measure the RDBMs solutions? I am curious how SQLite would fare here – it can be quite fast for simple operations.
Hi Zbigniew,
I haven’t done the like-for-like RDBMS comparison. However, I do have an RDBMS based system I looking to replace that can handle a peak of about 120 updates/sec on significantly superior hardware than my laptop. It’s also doing a lot more work than is demonstrated in the benchmark. If I added the rest of the updates and required indices, I suspect I’d probably be able to get 500 updates/sec with Berkeley DB and syncing after every request. That probably isn’t quite fast enough for my needs.