Archive for September, 2010

Hashes And Trees Are Not Fungible

Nathan makes an interesting comment on my blog.

"I think most languages don’t include trees because hash tables are close enough for most purposes. You don’t usually need to traverse key/value pairs in the map in sorted order, which is about the only thing a tree buys you over a hash. And if you know that you care about sortedness, you probably know enough to go implement your own."

With my Perl background, I’ve got a lot of sympathy for that position. It’s almost, but not quite, completely wrong though.

Extensible Containers are all the same and yet different

Vectors or trees or hashes or whatever, do have something in common. They are all capable of storing a bunch of things. You can have operations such as store and retrieve. That’s true of course. The difference lies in how long it takes to do an operation.

  • You want to find a particular item in a singly linked list? That’ll take N/2. Insertion is constant.
  • Insertion into a balanced binary tree takes log2 N. Finding an item also takes log2 N.
  • Iterating over a sorted range of keys in a tree? log2 N + length of range. In a hash? N log2 N.
  • When I want to implement a database, or whatever, this sort of thing matters. I could always implement it on top of singly linked list, or a hash, or a vector but it would be terribly inefficient.

    The English Effect

    Paul Graham once mentioned The Blub Paradox where a user of a particular programming language doesn’t understand the value in unfamiliar (more powerful) constructs in another language1.

    Actually the effect is not just restricted to languages, it affects all kinds of useful things. And they don’t have to be more or less powerful than other things, just different. So someone who is constrained into a scalar/array/hash view of the world, won’t necessarily see the value in binary trees2.

    I think of this as The English Effect. English isn’t more or less powerful than other languages. But a native english speaker will tend to look at other languages and not see any value in learning them. They will be unaware that there are ideas that can be expressed much more elegantly in other languages.

    The same ideas can be described, albeit in a clumsy way, just the same as binary trees can be implemented in assembler.

    Take for example, the German doch. The number of times I have wished there was a simple disagreement with a negative assertion in English and the working efficiency it would impart is uncountable.

    Oh, that idea you have, it is never going to work, the boss will never okay it and it will take to long and there are instances it won’t cover and, and, and.

    You simply respond with Doch! And your negative colleague has no possible comeback. You are all able to proceed with your work and do so gladly 😉

    1. I tend to find that the way it is phrased is quite condescending. And how is it a paradox anyway?

    2. Looking at Nathan’s blog though, he does Lisp. And in Lisp, the basic datatype is a tree. So he’s not unfamiliar with that stuff which makes his comment even more strange to me.


    Read Full Post »

As Emacs is open source, we don’t need to speculate. We can just read the code. I’m lazy though, so I prefer to speculate 😉 If you were making a text editor, what would you use for your buffer type?

Array of Chars

The obvious, naive choice is an array of chars. But that means that, while append may be fast (until you need to resize your array at least), insertion is very slow. To insert a character in the middle of an array of chars, you need to move all of the characters following the insertion point up by one character. If we assume that insertions are evenly distributed, inserting a character will take, on average, time proportional to N/2.

Queue up Inserts

We can be a bit smarter than that. With normal text editor usage, for any given insert, we will probably insert a bunch of characters in the same place. We could append those characters into a preallocated array, and after a certain amount of time or number of characters, insert those characters into the array of chars. This brings the amortized speed cost down to N/M/2 where M is the average number of characters inserted in one go.

A Tree of Characters

So much for arrays. How about a tree? Assuming the tree, is balanced, insertion will take log2 N. The downside is that traversal is a lot slower (still order N, but a decent constant factor slower). And we are also likely to lose cache locality.

Side note: I often surprised that the main ‘scripting’ languages don’t provide much more than the equivalent of perl scalars, arrays and hashes. Sure, Python has tuples and Ruby has symbols (okay, I do miss symbols in Perl), but where are the built-in trees? Even Lisp has them. I had a quick look for trees in the CPAN and found Tree::Binary and Tree::RedBlack. I couldn’t find at a quick glance whether Tree::Binary self-balances or not. Hmmm…

A Rope

The final simple option is a rope (PDF). This is a binary tree where the leaves, instead of being characters, are arrays of characters. This improves cache locality (depending on the average size of the array), and traversal speed, although it isn’t as fast as a simple array of characters.

What would you use?

Read Full Post »

I’m a big fan of Antirez. I’ve been subscribed to his blog for a while.

I’m also a big fan of shoddy benchmarks. Maybe that makes me biased.

So, the sys/toilet guy writes a post benchmarking Redis vs Memcached and concludes that Memcached is faster and Redis doesn’t do so well at setup and teardown.

Antirez responds: (emphasis mine)

Unfortunately I also think that our dude is part of the problem, since running crappy benchmarks is not going to help our pop culture.

Okay, fine. Sys/Toilet guy used some robust language in his post. Antirez has every right to respond in kind.

But the thing is…

Still this benchmark captured my interest. Even with all this problems, why on the earth Redis was showing so low numbers in multi-get (MGET) operations?

I checked the implementation better and found a problem in the query parsing code that resulted in a quadratic algorithm due to a wrong positioned strchr() call.

Hmmm… sounds like the “crappy” benchmark helped. And what’s more, there is already a (presumably decent) Redis benchmark that didn’t pick up the issue.

In Redis land we already have redis-benchmark that assisted us in the course of the latest one and half years, and worked as a great advertising tool: we were able to show that our system was performing well.

So, sys/toilet guy, thanks for making Redis better. I appreciate it.

Read Full Post »

Recently I’ve been thinking about storing records on disk quickly. For my general use case, an RDBMS isn’t quite fast enough. My first thought, probably like many a NoSQL person before me, is how fast can I go if I give up ACID?

Even if I’m intending to do the final implementation in C++, I’ll often experiment with Perl first. It’s often possible to use the C libraries anyway.

First up, how about serialising records using Storable to an on-disk hash table like Berkeley DB.

(Aside: I’m probably going to appear obsessed with benchmarking now, but really I’m just sticking a finger in the air to get an idea about how various approaches perform. I can estimate 90cm given a metre stick. I don’t need a more precise way to do a rough estimate.)

use Storable;
use Benchmark qw(:hireswallclock);

use BerkeleyDB;

I need to make a random record to store in the DB.

sub make_record
    my ($order_id, $fields, $key_len, $val_len) = @_;

    my %record;
    $record{'order_id'} = $order_id;

    for my $field_no (1..$fields-1) {
        my $key = 'key';
        my $val = 'val';
        $key .= chr(65 + rand(26)) for (1..$key_len - length($key));
        $val .= chr(65 + rand(26)) for (1..$val_len - length($val));
        $record{$key} = $val;
        print "$key -> $val\n";
    return \%record;

And a wrapper handles the general case I’ll be testing – key and value equal length and order_id starting at 1.

sub rec
    my ($fields, $len) = @_;
    return make_record(1, $fields, $len, $len);

I’ll compare serialisation only against actually storing the data to disk to see what the upper limit I could achieve is if, for example, I was using an SSD.

sub freeze_only
    my ($db, $ref_record, $no_sync) = @_;
    $no_sync //= 0;
    my $key = "/order/$ref_record->{'order_id'}";

And I’m curious to know how much overhead syncing to disk adds.

my $ORDER_ID = 0;

sub store_record
    my ($db, $ref_record, $no_sync) = @_;
    $no_sync //= 0;
    $ref_record->{'order_id'} = ++$ORDER_ID;
    my $key = "/order/$ref_record->{'order_id'}";
    $db->db_put($key, Storable::freeze($ref_record));
    $db->db_sync() unless $no_sync;

The Test Program

A record with 50 fields, each of size 50 seems reasonable.

my $filename = "$ENV{HOME}/test.db";
unlink $filename;

my $db = new BerkeleyDB::Hash
    -Filename => $filename,
    -Flags    => DB_CREATE
    or die "Cannot open file $filename: $! $BerkeleyDB::Error\n" ;

my $rec_50_50 = rec(50, 50);

Benchmark::cmpthese(-1, {
    'freeze-only-50/50' => sub { freeze_only($db, $rec_50_50) },
    'freeze-sync-50/50' => sub { store_record($db, $rec_50_50) },
    'freeze-no-sync-50/50' => sub { store_record($db, $rec_50_50, 1) },

The Results

                        Rate freeze-sync-50/50 freeze-no-sync-50/50 freeze-only-50/50
freeze-sync-50/50     1543/s                --                 -80%              -93%
freeze-no-sync-50/50  7696/s              399%                   --              -63%
freeze-only-50/50    21081/s             1267%                 174%                --


Unsurprisingly syncing is expensive – it adds 400% overhead. However, even with the sync, we’re still able to store 5.5 million records an hour. Is that fast enough for me? (I need some level of reliability) It might well be.

Berkeley DB is fast. It only adds 170% overhead to the serialisation itself. I’m impressed.

In case anyone is interested. I ran a more comprehensive set of benchmarks.

freeze-sync-50/50       1846/s
freeze-sync-50/05       2262/s
freeze-sync-05/50       2546/s
freeze-sync-05/05       2799/s
freeze-no-sync-50/50    7313/s
freeze-no-sync-50/05    9514/s
freeze-no-sync-05/50   11395/s
freeze-no-sync-05/05   12589/s
freeze-only-50/50      20031/s
freeze-only-50/05      21920/s
freeze-only-05/05      26547/s
freeze-only-05/50      26547/s
fcall-only           2975364/s

Read Full Post »

When I set up an emacs process filter, I never know how much input the process will deliver me at one time. Therefore I generally buffer the input and process (say) a line at a time.

(defvar buffer "")

(defun process-input (input)
  (setq buffer (concat input buffer))
  (let ((count 0)
        (pos 0))
    (while (string-match "^\\([^\n]+\\)\n" buffer pos)
      (message "Before: [%s] [%d]" (match-string 1 buffer) (match-end 1))
      (process-line (match-string 1 buffer))
      (message "After: [%s] [%d]" (match-string 1 buffer) (match-end 1))

      ;; error handling in case of runaway loops
      (incf count)
      (when (> count 10)
        (error "Looped too many times!"))

      (setq pos (1+ (match-end 1))))
    (when pos (setq buffer (substring buffer pos))))
  (message "REMAINDER: [%s]" buffer))

Given a simple (process-line ...) we can see the results from a simple test.

(defun process-line (line)
  (message "[%s]" line))

(process-input "XXX (line 1)\n13.15 (line 2)\n")
Before: XXX (line 1) 12
[XXX (line 1)]
After: XXX (line 1) 12
Before: 13.15 (line 2) 27
[13.15 (line 2)]
After: 13.15 (line 2) 27

An Infinite Loop

Now say I want to add some special handling for lines with a number in, what do you think happens?

(defun process-line (line)
  (when (string-match "\\([0-9]+\.[0-9]+\\)" line)
    (message "[%s]" line)))
Debugger entered--Lisp error: (error "Looped too many times!")
Before: [XXX (line 1)] [12]
After: [XXX (line 1)] [12]
Before: [13.15 (line 2)] [27]
[13.15 (line 2)]
After: [XXX (] [5]
Before: [13.15 (line 2)] [27]
[13.15 (line 2)]
After: [XXX (] [5]

The string match against the number has reset a hidden global variable used by (match-end ...) which has sent us into an infinite loop. Surprising action at a distance indeed!

Perl Regex Globals

Perl also uses global variables such as $1, $2, @- and @+ for its regular expressions, so could this cause a similar problem?

No. Fortunately, these globals are dynamically scoped so the appropriate values are maintained up the call stack.

Read Full Post »

Disappearing CPAN Modules

Somewhere in my Super Programming Languages post, I asserted that in some of my scripts, 95% of the work has already been done by third party libraries. Such is the power of the CPAN! And other folks obviously use Perl for similar reasons – there are libraries available that solve at least part of their problem.

So, I’m surprised when people complain that libraries have been provided that help them out.

Steve obviously had second thoughts over publishing this post (google cache). Is blogpost necrophilia reasonable here? I’m afraid I’ve seen other people make similar arguments, and I can’t remember the location of their posts. Sorry Steve.

Ok, so from that, I can gather this: I’m screwed because I used perl for a script, and now the module author has decided to completely abandon the module, because it is based upon another module that doesn’t install worth anything. Cool!

He used an extremely useful and mature scripting language and a module that solves part of his problem. Because the module doesn’t work with a later version of that excellent scripting language, he’s doomed, doomed, DOOMED I tell you. And of course, it’s Perl’s fault.

The first thing is that he’s not doomed. He is simply going to have to find an alternative module or implement the functionality provided by the module himself. He’d have had to do that if the module wasn’t provided anyway. So in no way was he disadvantaged by having the module (and Perl) available.

The other thing is that this happens in every changing environment. Say your C++ compiler upgrades and you were inadvertently depending on a particular bug. Oops! (Don’t laugh – it happens). If your firm upgrades Excel to version 2000007, maybe your macros aren’t going to run any more. We’re moving to a 64-bit OS and I’m glad there’s a bunch of Perl that I can be confident will more or less work. I’m not looking forward to porting the C++ though.

To try and add a fair and balanced viewpoint, I should point out that part of the argument does make sense. It is disappointing that Daemon::Simpler has gone from Backpan. Having said that, sometimes it is important to take some responsibility. Any module that I bring into our production environment is archived as is the current version of perl we’re using.

Read Full Post »