Login to participate
  
Register   Lost ID/password?

Louis Kessler's Behold Blog

Getting the Timing Right - Sun, 26 Jul 2015

One of my concerns about switching from in-memory data structures to an on-disk database has been with regards to speed. Memory is faster than disk. Writing to disk will slow things down.

But it better not slow things down too much. For normal-sized genealogy files (say up to 10,000 people), everything should be fast and smooth. Larger files of 100,000 or more people should degrade gracefully. But they should not take forever.

As I have learned through my implementation of Behold’s database during the past few weeks, writing databases to disk one record at a time is sloooooow.

In the past to get timings, I used to use AQtime, a third-party profiler that would count and time each line of your executable without adding much overhead at all. It worked like magic to me. But moving to Delphi XE8, I would need to upgrade to their new version. The cost plus a horrendous interface with the Delphi development environment just wasn’t worth it to me.

Instead I found a wonderful simple new function in Delphi called StopWatch that gives accurate timings. You can create as many as you want and you just add them around your code, e.g.:

Stopwatch1.Start;
Stopwatch2.Start;
code
Stopwatch2.Stop;
more code
Stopwatch2.Start;
even more code
Stopwatch2.Stop;
and more code
Stopwatch1.Stop;

imageTesting inside a routine that is called 10,000 times, I get a timing of 229 milliseconds (ms), and rerunning the routine doesn’t vary from that by more than about 5 ms, so it is very accurate and more than accurate enough to compare the speed of different implementations for optimization purposes.

So I used Stopwatches to see how quick I could get SQLite to be and determine how to best implement it. I used two files. In the first, I loaded 33,790 INDI (people) records from a GEDCOM file. In the second I loaded 198,522 INDI records.

What I was comparing to I knew was an unattainable goal. It was the B* tree I created and super-optimized to index and store the INDI records in memory in Behold. A B* tree is a balanced version of a B-tree (which is really a generalized binary tree but with n-nodes rather than just 2). For the two files, my B* tree took 53 ms and 459 ms.

Next I did a straight insert of the same information to a disk-based SQLite database. It took 1530 ms and 10985 ms. Well that’s quite a bit longer, and 10 seconds is a bit too long for someone to wait when it only took a half a second before.

The default way a database writes information is one record at a time. What was needed was to batch the transactions using a technique called Array DML. Once I did that, I was able to get times down to 176 ms and 1012 ms.

That was better, still not as good as my beautiful b*-tree, but acceptable when disk writes are involved.

Still something bothered me. The amount of time needed to load the DML array was half of the total time. This was a simple assignment of all the data, one by one, into the array. There was no database work going on. In the two tests, it was loading a 20 MB array and a 100 MB array and it was all in memory. That should be much faster than it is, and shouldn’t take as long as writing the data to disk.

I wasn’t sure what was going on. It was using a Delphi object known as a TCollection. Technically it was a collection of a collection of variants. So it was two dimensional, one for each INDI and one for the 16 fields for each INDI.

I played around a bit but was not able to find a way to load that TCollection any faster. I even posted a question on StackOverflow (check it out if you want to get into some technical details with more timings).

By the way. All the above timings were compiled to 64-bit. Remember when I blogged a few days ago that 64-bit was slower than 32-bit? Well that’s not true for the database.  My B* tree is 25% slower in 64-bit, but the database writes are 15% faster despite using 35% more memory.

I’m going to do a bit more research into this, but I’m not going to let it side-track me. Tomorrow I’ll start the work to get Behold writing to and using the database.

Blog and Forum Was Down For A Bit - Thu, 23 Jul 2015

I noticed a problem in the Behold Forum last night where a post was not always showing up. Last night I attempted to fix it with somewhat disastrous results.

Unbeknownst to me, I crashed both the Forum and Blog last night at 11:45 p.m. and it took me a few hours today to bring them back up today.

I still don’t know exactly what happened, but I’m going to do some investigation and will attempt to ensure everything is working.

So, hopefully not, but there may be a few short periods of disruption with the blog and forum over the next couple of days.

—-

Update 8 hours later:  I’ve cleaned everything up and the Blog and Forum should both be running nicely now.

Bonus: I found and fixed a problem where searches for exact strings with double quotes will now paginate properly. This had been bugging me for years and I finally found what it was and fixed it. PHP evaluates the string within double quotes, and when the search term is a double quoted string, PHP eats it up. The fix was to enclose the query string with an htmlspecialchars function to prevent the evaluation. GenSoftReviews searches had the same problem, and I’ve fixed it there as well.

PHP “evaluates” doubly quoted strings. It doesn’t evaluate single quoted strings. Another unexpected inconsistent feature. This is yet another example of why PHP is the world’s worst programming language.

PHP is a tool like this one - from Jeff Atwood's article

Always Learning New Programming Concepts and Methods - Mon, 20 Jul 2015

Most of the new concepts and tools for programming didn’t exist when I graduated from University with a Masters of Computer Science in 1980. The programming languages I worked with were FORTRAN, APL, PL/I and even Assembler. I picked up Pascal on the job and that led me to Turbo Pascal for personal use and then to Delphi with its object-oriented version of Pascal.

Delphi’s object-orientation introduced in 1995 was to me a totally new concept I had to pick up. I never really needed to learn it in detail since I never wrote low level objects other programmers would consume, but I had to learn enough about it to consumer other programmer’s objects from the packages they wrote.

Also during the 1990’s was the development of Unicode, which finally made every language available in one character set. Windows started supporting it around 2000 in Windows NT and XP. I have several books on it in my library (e.g. Unicode: A Primer) attesting to the time I spent learning about it. Unicode was a necessity for genealogy software and I had to buy 3rd party packages to support it in Behold until Delphi 2009 came out which finally included it.

Now with Delphi XE8, Generics have really taken hold. One of my specialties has always been in data structures and algorithms. Sometimes I just can’t help myself as I really enjoy getting down and dirty with a good algorithm, just as I did a couple of years ago for Ancestral Loops. But now with Generics, many of these methods have become much easier and with them, I will be able to replace my custom-developed b-tree and the 3rd party hash tables that I had been using in Behold.

As far as databases go, they were in their infancy when I was at University. I do recall taking a course on normalizing relational databases, which is to analyze data and to organize it into its primary records based on the common keys that define specific fields. Years later, I refreshed that knowledge when having to do that normalization for work-related projects.

I had the opportunity to work and gain expertise with a hierarchical database system called FOCUS by Information Builders. FOCUS was around before Oracle and Microsoft Access became the big fish. I worked a bit with Access, but had samplings of Oracle and even a bit of the SAP Data Warehouse.

My personal website requirements led me to WordPress, the PHP language, its MySQL database and the six months of my life I spent customizing my blog and forum. If you lived within a few blocks of me during that time, you may have heard me often spouting expletive deleteds due to my continued frustrations with PHP (which I agree is the second worst programming language in the world – the worst one is obvious if you remember Job Control Language (JCL). Do watch this great talk on The Worst Programming Language Ever by Mark Rendle)

But now I’ve got two technologies to take on that are new for me:

1: The SQLite database was first designed in 2000. Since then, it has become the most widely deployed and used database engine in the world. Billions of copies exist because it is in just about everything you use. One major genealogy software package uses it, that being RootsMagic. And it is my choice for Behold’s database, primarily due to its speed, size and universality.

SQLite

2: Delphi’s new FireDAC framework. This allows high-level and universal access to SQLite or any database. It is like the middleware that gets Delphi to play nice with SQLite.

image

I’m learning these as we speak. A couple of months intensive work of implementation and I should gain a fairly extensive knowledge in both.

After that, I’ll have more learning coming, including JSON, API’s, Windows 10,…

So don’t ever think any programmer knows everything about programming. Knowledge grows old fast. And new technologies rise faster than Spring flowers.

This continual learning is what makes programming so much of a challenge … and so much fun.