With optimization of the “smaller” 95,000 person GEDCOM resulting in a load time of less than one second, I didn’t think that simply counting the number of tags by type of tag should add 2 additional seconds to the time.
There are over 1.4 million tags to count into 100 different tag types. I also count separately whether they are record tags, link tags, or data tags. I’ve been using the b-tree structure that is one of the oldest things in Behold, around since the beginning about 10 years ago. It has withstood the test of time and I thought it was pretty efficient … until now.
So it was time to finally replace it with a hash table implementation. No matter how good my b-tree is, it still takes on the order of log(n) for each operation, where n is the number of items in the tree. But a hash table takes on the order of (1), i.e. it doesn’t matter how big the tree is, it always takes about the same amount of time.
What a hash function does is it takes a string, e.g.: INDI, and moves bits around (as efficiently as possible since you’ll do this zillions of times) to create a fairly random number from it. Then you whittle it down to a number from 1 to the size of your hash table and that is the location in the table where the information relating to the INDI tag should be. If two of your tags “collide” on the same location, then there are various techniques of handling it. The bottom line is that as long as your hash table has enough room to easily store all your entries with few collisions, then the hash table implementation will be very efficient.
There were several hash table implementations available for Delphi, but I decided on Primoz Gabrijelcic’s GPStringHash. I like his work, and he’s obviously put some thought into making it efficient as the hash function is coded in inline assembler. He’s also the author of GpProfile that I’ve used for years and he’s been quite active on Stackoverflow.
The bottom line is his hash function reduced the 1.4 million lookups from 1.4 seconds down to .35 seconds. That makes the hash 4 times faster than the b-tree, and this is for a small table of only few hundred entries! I was surprised that the improvement was that much.
I also found a way to eliminate all but .05 out of the .60 seconds of overhead in the increment procedure. Bottom line is an 80% reduction from 2.0 seconds down to .40 seconds. I still think .40 seconds is a too much time to just count 1.4 million items, but it will have to do for now.
I will be using this hash function instead of my old b-tree for most of my other internal data structures as well, which should result in additional speed improvements.