Posts Tagged ‘datatypes’

Hashes And Trees Are Not Fungible

Nathan makes an interesting comment on my blog.

"I think most languages don’t include trees because hash tables are close enough for most purposes. You don’t usually need to traverse key/value pairs in the map in sorted order, which is about the only thing a tree buys you over a hash. And if you know that you care about sortedness, you probably know enough to go implement your own."

With my Perl background, I’ve got a lot of sympathy for that position. It’s almost, but not quite, completely wrong though.

Extensible Containers are all the same and yet different

Vectors or trees or hashes or whatever, do have something in common. They are all capable of storing a bunch of things. You can have operations such as store and retrieve. That’s true of course. The difference lies in how long it takes to do an operation.

  • You want to find a particular item in a singly linked list? That’ll take N/2. Insertion is constant.
  • Insertion into a balanced binary tree takes log2 N. Finding an item also takes log2 N.
  • Iterating over a sorted range of keys in a tree? log2 N + length of range. In a hash? N log2 N.
  • When I want to implement a database, or whatever, this sort of thing matters. I could always implement it on top of singly linked list, or a hash, or a vector but it would be terribly inefficient.

    The English Effect

    Paul Graham once mentioned The Blub Paradox where a user of a particular programming language doesn’t understand the value in unfamiliar (more powerful) constructs in another language1.

    Actually the effect is not just restricted to languages, it affects all kinds of useful things. And they don’t have to be more or less powerful than other things, just different. So someone who is constrained into a scalar/array/hash view of the world, won’t necessarily see the value in binary trees2.

    I think of this as The English Effect. English isn’t more or less powerful than other languages. But a native english speaker will tend to look at other languages and not see any value in learning them. They will be unaware that there are ideas that can be expressed much more elegantly in other languages.

    The same ideas can be described, albeit in a clumsy way, just the same as binary trees can be implemented in assembler.

    Take for example, the German doch. The number of times I have wished there was a simple disagreement with a negative assertion in English and the working efficiency it would impart is uncountable.

    Oh, that idea you have, it is never going to work, the boss will never okay it and it will take to long and there are instances it won’t cover and, and, and.

    You simply respond with Doch! And your negative colleague has no possible comeback. You are all able to proceed with your work and do so gladly ;)

    1. I tend to find that the way it is phrased is quite condescending. And how is it a paradox anyway?

    2. Looking at Nathan’s blog though, he does Lisp. And in Lisp, the basic datatype is a tree. So he’s not unfamiliar with that stuff which makes his comment even more strange to me.

    Read Full Post »

As Emacs is open source, we don’t need to speculate. We can just read the code. I’m lazy though, so I prefer to speculate ;) If you were making a text editor, what would you use for your buffer type?

Array of Chars

The obvious, naive choice is an array of chars. But that means that, while append may be fast (until you need to resize your array at least), insertion is very slow. To insert a character in the middle of an array of chars, you need to move all of the characters following the insertion point up by one character. If we assume that insertions are evenly distributed, inserting a character will take, on average, time proportional to N/2.

Queue up Inserts

We can be a bit smarter than that. With normal text editor usage, for any given insert, we will probably insert a bunch of characters in the same place. We could append those characters into a preallocated array, and after a certain amount of time or number of characters, insert those characters into the array of chars. This brings the amortized speed cost down to N/M/2 where M is the average number of characters inserted in one go.

A Tree of Characters

So much for arrays. How about a tree? Assuming the tree, is balanced, insertion will take log2 N. The downside is that traversal is a lot slower (still order N, but a decent constant factor slower). And we are also likely to lose cache locality.

Side note: I often surprised that the main ‘scripting’ languages don’t provide much more than the equivalent of perl scalars, arrays and hashes. Sure, Python has tuples and Ruby has symbols (okay, I do miss symbols in Perl), but where are the built-in trees? Even Lisp has them. I had a quick look for trees in the CPAN and found Tree::Binary and Tree::RedBlack. I couldn’t find at a quick glance whether Tree::Binary self-balances or not. Hmmm…

A Rope

The final simple option is a rope (PDF). This is a binary tree where the leaves, instead of being characters, are arrays of characters. This improves cache locality (depending on the average size of the array), and traversal speed, although it isn’t as fast as a simple array of characters.

What would you use?

Read Full Post »


Get every new post delivered to your Inbox.