Vacation 2014

Mon Jul 14, 2014

So I'm wrapping up my vacation. I took the full week off, and we're into weekend time as I write this. Monday, I go back to work. You know what I've been doing on vacation? Dick all. Oh, sorry, that's what most people do on vacation. Reading up on data structures.

Seriously.

About six years ago, SteveY wrote up a post entitled "Get That Job at Google", which incidentally happened to contain a giant pile of stuff to read up on that sounded interesting. I'd read it a couple years back and put a lot of that stuff on a "ToDo" list in the back of my head and didn't think too much about it. Having a kid tends to eat into your personal time, it turns out. It basically got whittled down to just occasionally hacking on GitHub projects, going to various programming-related social events and writing the occasional hundred thousand words.

This week, I finally got enough time together that I decided to sit down and plow through the first chunk of that reading list. This may brand me as an irredeemable nerd, but that was easily the most fun I've had in about five years. Not that I'm done yet, mind you. So far I've

brushed up on my Big O, tree traversal/rotation and various sorts
implemented minimal hash tables
done a bit of reading on NP-complete problems and various graph storage techniques
done a lot of reading and thinking about, but not yet an implementation of, a binomial heap
implemented a minimal trie

Those last two weren't even in the list. I was supposed to be reading up on Red/Black trees and AVL trees, but kept getting distracted by more interesting data structures. I've still got a lot to read up on. Specifically, I still need to:

implement a quicksort and mergesort
actually get through Red+Black/AVL/Splay tree implementations (Finger trees(PDF warning) also look interesting, but aren't strictly on the reading list)
do more graph-related reading and prototyping
Do some serious reading related to NP-complete problems
brush up on various basic math things (the ones specified in the article are combinatorics and probability)
lightly brush up on OS basics (threads vs processes and performance implications, concurrency constructs)

By the way, so you know in advance, this article is going to feel pretty disjointed; every five minutes or so I get interested enough in something I'm link-hunting for that I go read about it for fifteen minutes or so before resuming prose. Hopefully it's still comprehensible.

Before I move on to the rest of my reading list, lets dissect the two things I've implemented so far. In an effort to potentially help someone else understand them, and with the hope that if I've gotten anything seriously wrong, I'll be yelled at by the internet shortly-ish.

Hash Tables

A hash table is a key/value storage structure that's meant to give you as close to constant time access/insertion as possible |1|. The way you do that is using a vector of linked lists and a hash function. The idea is that when you want to store a value, you hash it and take a modulus to figure out which bucket it goes in, remove that key from the bucket if it already exists, and insert the new key/value pair. You need to keep a list of such pairs |2|, and you need to keep the un-hashed key around, because there might be collisions under your hash function in the set of values you want to be able to store as keys. If you knew enough about your key space in advance that you could guarantee no collisions, you can probably do something more efficient, if less flexible, than hashes..

Here are some implementations, minus a couple of key bits that I'll talk about in a minute.

;; Common Lisp
(defclass my-hash-table ()
  ((bucket-count :reader bucket-count :initarg :bucket-count)
   (hash-fn :reader hash-fn :initarg :hash-fn)
   (comparison-fn :reader comparison-fn :initarg :comparison-fn)
   (buckets :reader buckets :initarg :buckets)))

(defun make-my-hash-table (&key (bucket-count 16) (hash-fn #'my-hash) (comparison-fn #'equal))
  (make-instance
   'my-hash-table
   :bucket-count bucket-count
   :hash-fn hash-fn
   :comparison-fn comparison-fn
   :buckets (make-array bucket-count :initial-element nil)))

(defmethod hash-mod ((table my-hash-table) key)
  (mod (funcall (hash-fn table) key) (bucket-count table)))

(defun make-my-hash-table (&key (bucket-count 16) (hash-fn #'my-hash) (comparison-fn #'equal))
  (make-instance
   'my-hash-table
   :bucket-count bucket-count
   :hash-fn hash-fn
   :comparison-fn comparison-fn
   :buckets (make-array bucket-count :initial-element nil)))

(defmethod get-hash ((table my-hash-table) key)
  (let ((cmp (comparison-fn table)))
    (loop for (k . v) in (aref (buckets table) (hash-mod table key))
       when (funcall cmp k key) return (values v t)
       finally (return (values nil nil)))))

(defmethod set-hash! ((table my-hash-table) key value)
  (let ((h (hash-mod table key))
        (cmp (comparison-fn table)))
    (setf (aref (buckets table) h)
          (cons (cons key value)
                (remove-if (lambda (k/v) (funcall cmp (car k/v) key))
                           (aref (buckets table) h))))
    value))

## Python
class my_hash_table:
    def __init__(self, buckets=16, hash_fn=my_hash):
        self.bucket_count = buckets
        self.hash_fn = hash_fn
        self.hashmod = lambda thing: self.hash_fn(thing) % self.bucket_count
        self.buckets = [[]] * self.bucket_count

    def get(self, key):
        for (k, v) in self.buckets[self.hashmod(key)]:
            if k == key:
                return v
        raise KeyError("Key '" + str(k) + "' not found...")

    def set(self, key, value):
        h = self.hashmod(key)
        tup = (key, value)
        self.buckets[h] = filter(lambda pair: pair[0] != key, self.buckets[h])
        self.buckets[h].append(tup)

-- Haskell
module Main where

import Data.Char
import Data.Array

data HashTable a b = Hash { tblBuckets :: Array Integer [(a, b)]
                          , tblBucketCount :: Integer
                          , tblHashFn :: a -> Integer
                          }

instance (Show a, Show b) => Show (HashTable a b) where
    show h = show $ tblBuckets h

empty :: Integer -> (a -> Integer) -> HashTable a b
empty bucketCount hashFn = Hash { tblBuckets = array (0, bucketCount-1) $ zip [0..(bucketCount-1)] $ repeat []
                              , tblBucketCount = bucketCount
                              , tblHashFn = hashFn
                              }

hashMod :: HashTable a b -> a -> Integer
hashMod hash = (`mod` tblBucketCount hash) . tblHashFn hash

get :: Eq a => HashTable a b -> a -> Maybe b
get hash key = lookup key bucket
    where bucket = buckets ! hashMod hash key
          buckets = tblBuckets hash

set :: Eq a => HashTable a b -> (a, b) -> HashTable a b
set hash (k, v) = hash { tblBuckets = buckets // [(ix, newBucket)] }
    where newBucket = insertOrReplace bucket (k, v)
          ix = hashMod hash k
          insertOrReplace b (k, v) = (k, v):filter ((/=k) . fst) bucket
          bucket = buckets ! ix
          buckets = tblBuckets hash

Firstly, you won't see the hash function implementation above. It's sort of incidental to the structure of the rest of the machinery, you probably want different ones for different situations, and the ones that are easy enough to implement off the top of your head tend not to be sufficient except for toy tasks |3|. If you're really interested in hash functions, rather than hash tables, you can find a good introduction and some examples here.

Also, none of the above tables grow. In a real situation, they'd have an additional parameter of some threshold at which they re-hash. That is, at some point when the difference between the number of buckets you've got and the number of entries you're tracking gets large enough, you'll want to increase the bucket-count of your table |4|. Typically, this means doubling the number of buckets, and going back through your existing ones to re-hash entries into their new positions. You'd want to do that on insertion, which means that the Haskell version would probably have the easiest time with the change; conceptually, you're returning a new hash table each time you "insert" anything.

With that out of the way, lets step through these and compare.

Constructors

(defclass my-hash-table ()
  ((bucket-count :reader bucket-count :initarg :bucket-count)
   (hash-fn :reader hash-fn :initarg :hash-fn)
   (comparison-fn :reader comparison-fn :initarg :comparison-fn)
   (buckets :reader buckets :initarg :buckets)))

(defun make-my-hash-table (&key (bucket-count 16) (hash-fn #'my-hash) (comparison-fn #'equal))
  (make-instance
   'my-hash-table
   :bucket-count bucket-count
   :hash-fn hash-fn
   :comparison-fn comparison-fn
   :buckets (make-array bucket-count :initial-element nil)))

class my_hash_table:
    def __init__(self, buckets=16, hash_fn=my_hash):
        self.bucket_count = buckets
        self.hash_fn = hash_fn
        self.hashmod = lambda thing: self.hash_fn(thing) % self.bucket_count
        self.buckets = [[]] * self.bucket_count
...

data HashTable a b = Hash { tblBuckets :: Array Integer [(a, b)]
                          , tblBucketCount :: Integer
                          , tblHashFn :: a -> Integer
                          }

empty :: Integer -> (a -> Integer) -> HashTable a b
empty bucketCount hashFn = Hash { tblBuckets = array (0, bucketCount-1) $ zip [0..(bucketCount-1)] $ repeat []
                              , tblBucketCount = bucketCount
                              , tblHashFn = hashFn
                              }

Those are the declarations and constructors for each language. Python gets to cheat a bit because it has a Smalltalk-inspired object system in which a class owns methods, so its full declaration would also contain the getter and setter we'll go over in a moment. The Common Lisp is obviously the wordiest, mainly because defclass makes you explicitly declare readers, but also partially because it has an additional slot. I'm choosing to follow the tradition of the built-in CL hash table and let the user specify a comparison function. It's relevant, because you can compare things on a value or identity basis, and you might want to do either |5|.

In the Python variant, I decided to do the hash/mod composition once in the constructor. It's not entirely necessary; if you take this approach, you'll need to replace your hashmod slot when re-sizing the table. If you don't, you'll need to type it out twice; once in the insertion function/method and again in the lookup. I did the same thing in the Common Lisp and Haskell versions, but those don't suffer the same drawbacks thanks to CLOS and a pure functional approach respectively. And again, you could do something similar by defining a hashMod function in Python, that just doesn't seem in-character for the language.

The Haskell version is pretty short and sweet thanks to the record syntax. It's also the only one that clearly states what the hash function is expected to be; a function from your key type to Integer|6|, as well as being the only one to enforce homogeneous key and value types. It sounds mildly annoying, but I can't really think of a situation where I'd really want heterogeneous ones, so it doesn't sound like a big deal.

Lookup

(defmethod get-hash ((table my-hash-table) key)
  (let ((cmp (comparison-fn table)))
    (loop for (k . v) in (aref (buckets table) (hash-mod table key))
       when (funcall cmp k key) return (values v t)
       finally (return (values nil nil)))))

...
    def get(self, key):
        for (k, v) in self.buckets[self.hashmod(key)]:
            if k == key:
                return v
        raise KeyError("Key '" + str(k) + "' not found...")
...

get :: Eq a => HashTable a b -> a -> Maybe b
get hash key = lookup key bucket
    where bucket = buckets ! hashMod hash key
          buckets = tblBuckets hash

This is where we'll see possibly the biggest difference between the language approaches; if you look semi-closely, you'll notice that they have completely different return tendencies. I tried to follow the in-language style for each, so this is really more a reflection of the attitudes of the communities |7|. The Python version either finds the key it's looking for or throws a run-time exception. The Haskell variant returns a Maybe, which means its actual result will either be Just <some value> or Nothing. Finally, the CL version returns a value or NIL, and returns a second value that specifies whether the first value represents a success. This is to disambiguate the situations where you might want to store NIL as a value in your table; if we didn't specify, you wouldn't know whether that represented "Found the value NIL at the specified key" or "Failed to find the specified key".

I'm not sure how to feel about that difference.

The Python approach sounds the least appetizing, but the other two sound fairly similar. In the Haskell variant, you're being very explicit that this computation might fail, and you're forcing the caller to deal with that. In the Common Lisp variant, you're setting up the situation so that if a user of your data structure doesn't care about storing NILs in their table, they can easily ignore the part that deals with that ambiguity. Some implicit knowledge is still required here though; you need to know that the second value exists and you need to know that it'll be T on a lookup success and NIL on a lookup failure, and you need to know that Lisp functions typically return NIL in the case where they fail rather than using the condition system.

When you say it like that, it sounds like the Haskell approach is outright better, but I've seen enough aspiring Haskellers get tripped up over the use of Maybe that I'm not so sure.

Setting

(defmethod set-hash! ((table my-hash-table) key value)
  (let ((h (hash-mod table key))
        (cmp (comparison-fn table)))
    (setf (aref (buckets table) h)
          (cons (cons key value)
                (remove-if (lambda (k/v) (funcall cmp (car k/v) key))
                           (aref (buckets table) h))))
    value))

...
    def set(self, key, value):
        h = self.hashmod(key)
        tup = (key, value)
        self.buckets[h] = filter(lambda pair: pair[0] != key, self.buckets[h])
        self.buckets[h].append(tup)

set :: Eq a => HashTable a b -> (a, b) -> HashTable a b
set hash (k, v) = hash { tblBuckets = buckets // [(ix, newBucket)] }
    where newBucket = insertOrReplace bucket (k, v)
          ix = hashMod hash k
          insertOrReplace b (k, v) = (k, v):filter ((/=k) . fst) bucket
          bucket = buckets ! ix
          buckets = tblBuckets hash

Setting approaches are fairly equivalent, with the obvious giant disclaimer that the Haskell version is trying to be functional, so it returns a new HashTable with the appropriate values changed/inserted rather than mutating anything. This means that the return value is, in fact, HashTable a b rather than the newly inserted value as in the Common Lisp variant, or None as in the Python variant.

That's really the only difference here though. Each of the setters takes a table|8|, a key and a value. It hashes and mods the key to figure out which bucket we're dealing with, then it adds the k/v pair to the result of filtering that bucket on key|9|. The Common Lisp and Haskell versions both cons onto the front of said list, where the Python version uses the append method which pushes to the back of a list. I guess I could have used insert(0, tup) instead, but as I understand, inserting to the beginning of the list tends to perform significantly worse than appending to the end. My intuition says it's not worth bothering with, since the only thing it might affect is get/set performance on recently inserted keys in large buckets, and you kind of want to work to avoid large buckets in general.

Ok, on to the next thing.

Trie

This is just an interesting data structure I came across while reading the rest of the tree stuff. It's fairly interesting because it can search for things by prefixes, making it pretty handy in situations where you want string completion. The only real concrete use case I'm thinking of for them at the moment is storage for titles/names at my blog |10|, so that you'd be able to navigate to one by knowing only the prefix of the appropriate title.

It's not as commonly known as a hash table, so here's a demo interaction first:

TRIE> (defparameter *trie* (empty))
*TRIE*
TRIE> (insert! "test" *trie*)
(NIL NIL (#\t NIL (#\e NIL (#\s NIL (#\t "test")))))
TRIE> (insert! "testing" *trie*)
(NIL NIL
 (#\t NIL (#\e NIL (#\s NIL (#\t "test" (#\i NIL (#\n NIL (#\g "testing"))))))))
TRIE> (insert! "trie" *trie*)
(NIL NIL
 (#\t NIL (#\r NIL (#\i NIL (#\e "trie")))
  (#\e NIL (#\s NIL (#\t "test" (#\i NIL (#\n NIL (#\g "testing"))))))))
TRIE> (insert! "banana" *trie*)
(NIL NIL (#\b NIL (#\a NIL (#\n NIL (#\a NIL (#\n NIL (#\a "banana"))))))
 (#\t NIL (#\r NIL (#\i NIL (#\e "trie")))
  (#\e NIL (#\s NIL (#\t "test" (#\i NIL (#\n NIL (#\g "testing"))))))))
TRIE> (insert! "batman" *trie*)
(NIL NIL
 (#\b NIL
  (#\a NIL (#\t NIL (#\m NIL (#\a NIL (#\n "batman"))))
   (#\n NIL (#\a NIL (#\n NIL (#\a "banana"))))))
 (#\t NIL (#\r NIL (#\i NIL (#\e "trie")))
  (#\e NIL (#\s NIL (#\t "test" (#\i NIL (#\n NIL (#\g "testing"))))))))
TRIE> (member? "batman" *trie*)
"batman"
TRIE> (member? "Gabarone" *trie*)
NIL
TRIE> (completions-of "t" *trie*)
("trie" "test" "testing")
TRIE> (completions-of "te" *trie*)
("test" "testing")
TRIE> (completions-of "test" *trie*)
("test" "testing")
TRIE> (completions-of "testi" *trie*)
("testing")
TRIE> (completions-of "G" *trie*)
NIL
TRIE> (completions-of "" *trie*)
("trie" "test" "batman" "banana" "testing")
TRIE>

I'm still not sure if the right thing to return on insertion is the entire trie. I guess the newly inserted value would be more in character for Common Lisp. Also, because of that use case I mentioned, I didn't bother implementing a delete! either |11|.

(defun empty () (list nil nil))

(defun insert! (str trie)
  (labels ((recur (lst trie)
             (if lst
                 (let ((res (assoc (car lst) (cddr trie))))
                   (if res
                       (recur (cdr lst) res)
                       (let ((new (list (car lst) nil)))
                         (push (recur (cdr lst) new)
                               (cddr trie)))))
                 (setf (cadr trie) str))
             trie))
    (recur (coerce str 'list) trie)))

(defun lookup-to (lst trie)
  (cond ((null trie) nil)
        ((null lst) trie)
        (t (lookup-to (cdr lst) (assoc (car lst) (cddr trie))))))

(defun member? (str trie)
  (cadr (lookup-to (coerce str 'list) trie)))

(defun completions-of (str trie)
  (let ((level (list (lookup-to (coerce str 'list) trie)))
        (res nil))
    (loop while level
       do (setf level
                (loop for lst in level
                   for val = (cadr lst)
                   when val do (push val res)
                   append (cddr lst)))
       finally (return (nreverse res)))))

This is the absolute minimal version. I didn't even bother writing up a definition for a trie node, whether explicitly or implicitly through some function definitions. Basically, one looks like this

(letter . value . children)

The root node has NIL as the "letter", otherwise that's the letter you'd have to look up in the previous level to get this far. All terminal nodes have values in the second slot, while non-terminals have NIL|12|. Child nodes are further tries that contain the next possible letters of a string at this point. While the toy example above just keeps a full word at each terminal node, in general there's more than that. It might be the word, a definition and a likelihood of its appearance in a block of text. If I were concerned only with keeping words around, a DAWG would do better. Remember, my use case is navigating articles by title prefixes, so my real implementation is going to have a title and a pointer to the article body at minimum. But I'm getting ahead of myself.

The empty trie is the list of two NILs; one representing the "letter" and the other representing the value. The "letter" needs to be accounted for in the situation where you're asking for the completions-of the empty string. In this case, NIL makes sense because coercing the empty string to a list gives you the empty list (NIL).

(defun completions-of (str trie)
  (let ((level (list (lookup-to (coerce str 'list) trie)))
        (res nil))
    (loop while level
       do (setf level
                (loop for lst in level
                   for val = (cadr lst)
                   when val do (push val res)
                   append (cddr lst)))
       finally (return (nreverse res)))))

Which means that the (list (lookup-to (coerce str 'list) trie)) up there will return the whole trie when you pass "" as str. And actually, you need to know how lookup-to works to be assured of that

(defun lookup-to (lst trie)
  (cond ((null trie) nil)
        ((null lst) trie)
        (t (lookup-to (cdr lst) (assoc (car lst) (cddr trie))))))

If we ever run out of trie to search, the given list isn't in it. If we ever run out of lst, we return the rest of the trie, since that's the place we stopped. We don't just look up the word here because our initial input might have been a partial word, which means we really want to manipulate the remaining tree of nodes, not just a single terminal one. Finally, if we're out of neither lst nor trie, we call lookup-to again with the rest of lst and the appropriate child branch of the current trie.

Back up to completions-of

(defun completions-of (str trie)
  (let ((level (list (lookup-to (coerce str 'list) trie)))
        (res nil))
    (loop while level
       do (setf level
                (loop for lst in level
                   for val = (cadr lst)
                   when val do (push val res)
                   append (cddr lst)))
       finally (return (nreverse res)))))

Once we get back the result of lookup-to, we establish an initially empty result list, res, then doing a breadth-first traversal of our lookup result. That traversal pulls out every non-NIL value it finds into res, and finally returns the in-place reversal of res|13|.

insert!ing a fresh string isn't too much more interesting.

(defun insert! (str trie)
  (labels ((recur (lst trie)
             (if lst
                 (let ((res (assoc (car lst) (cddr trie))))
                   (if res
                       (recur (cdr lst) res)
                       (let ((new (list (car lst) nil)))
                         (push (recur (cdr lst) new)
                               (cddr trie)))))
                 (setf (cadr trie) str))
             trie))
    (recur (coerce str 'list) trie)))

You start at the root of the trie and the beginning of your string coerced to a list. Each time through, if you're out of string, you're done and need to set the value of the current trie node. Otherwise, you need to look the next letter up in the current trie level. If you find an entry, you recur with the rest of your string and that entry. If there is no such entry, you insert the remainder of the list.

This could probably use some tweaking. Again, I'm not sure the right thing to return here is the modified trie. It would also be fairly simple to optimize the case where you know the rest of your string is fresh, and not too much harder to write a version of this that neither coerces its string to a list nor uses recursion.

The last function we need to look at is member?.

(defun member? (str trie)
  (cadr (lookup-to (coerce str 'list) trie)))

Which is fairly straight-forward. It just uses lookup-to to traverse to the end-point of the word we're interested in, then checks if the value at that node is non-NIL. If it is NIL, then the given string is not a member of this trie.

So that's what I get down to during a fun |14| vacation. I sit down at a park or balcony or something and read up on data structures. I'm honestly not sure if this'll end up making me a better programmer in the long run, but it's fun regardless. As always, I'll let you know how it goes.

Footnotes

1 - |back| - It doesn't always work, incidentally. Take a look at this example of an exploit that happens when you let clients choose arbitrary keys for your tables. Hashing also doesn't give you much of a performance boost if you force a low number of buckets, or use a hash function that doesn't sufficiently distribute your keys.

2 - |back| - Ok, no, you don't need to; the linked-list approach is called "separate chaining" and you can avoid it by using an open addressing strategy or one of the others listed. You do, however, need to have one of "a perfect hash function" or "collision resolution", otherwise your hash table won't quite do what you think it should.

3 - |back| - And vice versa. The best recommended pair seems to be one of the functions due to Bob Jenkins, but I can't cram either into my head at the moment. 4 - |back| - And hence re-hash all the entries.

5 - |back| - Ok, yes, you can do the same thing in Python using == and is, but that would put me in the territory of expecting something out of operators as input. No idea how Pythonic that is, so I'm leaving it out, but I wanted to note that you could do the same here. Haskell, being a pure functional language, doesn't have to worry about pointer comparison at all.

6 - |back| - Which is what we're using as indices into our Array of buckets.

7 - |back| - Or at least, the attitudes of the core library developers/language maintainers.

8 - |back| - Granted, the Python version does so implicitly, but still.

9 - |back| - To remove any old values of key that might be present.

10 - |back| - Though I'm thinking that URI routing in House might eventually benefit from this storage mechanism too. I'd just need to figure out where the rough patches are and figure out how to mitigate them. I'll let you know how that goes once I get to it.

11 - |back| - If you do it yourself, just keep in mind that you can't do the really naive thing of deleting every element of the given string. In particular, if I did (delete "test" *trie*) given the demo code above, that should delete the word "test", but should absolutely not delete "testing". Left as an exercise for the reader, for now.

12 - |back| - In the case of a trie, "terminal" doesn't mean "there are no more nodes after this", it means "this node marks the end of a word". As an example, in the case of a trie that contains only the words "test" and "testing"; the second #\t is going to be a terminal node with the value "test" that does have further children.

13 - |back| - To compensate for the fact that we've been pushing onto a singly linked list.

14 - |back| - For my definition of "fun".