Login to participate
  
Register   Lost ID/password?

Louis Kessler’s Behold Blog

Database Design for Genealogy Data - Tue, 14 Jul 2015

There are several considerations as I structure the disk-based database for Behold. I want it to be capable of loading genealogy data from any source, I want it to be flexible, but most importantly I need it to handle the in-memory structures already in Behold. Let’s go through these considerations:

Data from any Source

Behold already reads GEDCOM. GEDCOM is the long-time standard used for transmission of genealogy data. Almost all genealogy programs can read and write from and to GEDCOM files to at least some degree. Most programs don’t make use of GEDCOM’s capabilities and some many mistakes writing to it.

GEDCOM’s file structure includes records for a Person, Family, Note, Source, Repository, Multimedia, Submitter and Submission. Notably absent are records for a Place and an Event, which I’ll need for Behold. GEDCOM does allow custom tags, and many vendors have included custom level zero records in Behold. That means just about anything goes – except that no other programs will read one program’s custom tags except in the rare cases that a program will go through the work to support another program’s custom tags.

Family Historian is one of the few genealogy programs that uses GEDCOM as its file format. Internally, it likely loads everything into a more efficient in-memory data structure like Behold does, but it only writes its data to a file as GEDCOM. And if I’m not mistaken, Family Historian still uses GEDCOM 5.5, which was superseded by GEDCOM 5.5.1 which PAF used and is the de facto standard.

There was an Event-Oriented GEDCOM developed by COMMSOFT and used in Ultimate Family Tree until Ancestry purchased it and discontinued it. They put the Event record front and centre in that structure.

There are some extensions to GEDCOM, most notably GEDCOM 5.5EL developed and supported by a group of genealogy program authors that has added a Place record and some other goodies.

The FHISO organization is working to produce an updated standard. That is probably a long time off, but when they do, that will be a new standard I’ll want to read.

Other programs: I’m not saying I’m going to get Behold to read from every other program out there, but it is very good to look at any published data file documentation that other programs have. I at least have to make sure that Behold will have the structures in place that it can store and display any relevant genealogy data that any program may have. Looking at other program’s file structures may give me other good ideas on ways to design Behold’s file.

Some of the programs with the most interesting data files I’ve looked at include:

  • RootsMagic – the database is SQLite, which is what I’ll be using. There are many SQLite viewers so their database can be easily inspected. In fact, there’s a SQLite Tools for RootsMagic group at wikispaces that publishes the RootsMagic data structure, provides tools, and helps people write their own custom queries from the RootsMagic data file.

    RootsMagic’s data structure includes standard records for: Person, Family, Place, Event, Source, Citation and Multimedia. There are some tables for subrecords (e.g. Name, Address). The most interesting records in RootsMagic which Behold should be capable of supporting include Groups, Research/To-Do, Source Templates, Exclusions (problem list items that are not a problem) and Witnesses.

  • Gramps – provides a full API (Application Programming Interface) to the Gramps database. The main records are Person, Family, Place, Event, Source, Citation, Repository, Media Object and Note.

    I think the Gramps people did a superb job of setting up a well designed layout with just the right fields in each record. I particularly like the idea of including on each record, fields for:  private, change, and a list of links to all the connections (rather than making them separate tables as most programs do). If I was starting anew and wanted to just “borrow” a database structure, this might be the one I would pick.

  • The Master Genealogist (TMG) – Bob Velke discontinued TMG last year. TMG was considered to be one of the more advanced genealogy programs when it was first created, and had to have data structures to support them. In August last year, the file structures of TMG 9 were published to allow other vendors to build import utilities that so that TMG’s users could migrate to a supported program. Lee Hoffman put up a nice summary of the TMG File Structures up to TMG 7. 

    Some of TMG’s more unique records include: Custom Flags, DNA information, Focus Groups, Participants/Witnesses and Timeline Locks.

Then there are the online Family Tree Services who you subscribe to and maintain your family tree on their websites. Most of these sites provide APIs to allow programmers to read and in some cases write from and to their service.The APIs will give an indication as to the structure of their data. Some of the most notable include:

  • MyHeritage – They have complete documentation online for their API, which they call the Family Graph API (surprisingly still with Beta designation on it). I heard Uri Gonen talk about it at RootsTech 2014. I tweeted then that I was very impressed and that MyHeritage is doing a lot of things right. Of all the APIs I’ve seen, theirs is the most straightforward and looks like it might be the simplest to implement.

    Some of the unique objects in their API include information related to their SmartMatch system, including a Matching Request, MatchesCount and ConfirmationStatus.

  • FamilySearch – They also have their documentation available online for developers. They also have their GEDCOM X Data Objects defined which describe the actual data structure used in their Family Tree. 

    Of most interest here is their inclusion of Discussions and their concept of Memories.

There are other online services with APIs, including Geni, Genealogie Online, WikiTree, GenealogyCloud and the DNA testing company 23andMe to name a few.

I should add that I really like Tim Forsythe’s online Gigatrees service. He doesn’t give a complete description but does at least map out his data structure he calls GREnDL. He attempts to do the correct thing and be source-based, so at the center of his model, he has the “Claim” record.

There are others who have developed data structures with different ideas that I keep aware of, like Tony Proctor (STEMMA), and Tom Wetmore (DeadEnds).


Flexibility

Behold is a flexible GEDCOM reader. It knows the rules of GEDCOM but reads in and can display anything that has the GEDCOM level/tag/value structure, even if the input doesn’t follow GEDCOM’s rules.

So yes, there is valid data and there is invalid data. But whatever is input must be preserved in some way. It cannot be thrown away.

The programs that convert data they don’t recognize into notes are sometimes criticized because they lose understanding of the data. But that is much better than not inputting the data at all. In the former case, the data can at least be exported again, but in the latter case the data is lost forever.

What must be done is to try to recognize all GEDCOM constructs and record the data as intended. If the construct as input is illegal, then maybe it can be translated to a legal construct. If not, maybe a custom tag can be used. If not, then turn it into a note.

Whatever is done, the data structures must be able to handle this flexibility. A very rigid structure requiring only certain types of records and fields and not allowing unknown records or unknown fields will have no place to put this data.

GEDCOM’s level/tag/value structure is actually very flexible and can handle just about anything. I plan to make use of it to contain the detail information about each record in the database.

Behold’s In-Memory Structures

Behold is already quite well set up to be converted to a file-based database. Its current structures include all the GEDCOM records and that already includes a place record and allows for custom records.

When I added Life Events into version 1.1, events suddenly took on new meaning. They were now not only linked to the individual or family, their place and their sources, but now they were also linked to all the people closely related to the individual. So I’ve come to realize that because the events are referenced so much, it will be better to make Event a top level record.

To generalize, I’m thinking I can collapse most record types into a single table because they have so much in common. My idea may change, but what follows is what I’m going to try to implement.

The main table will likely be something like this:

  1. ID – sequential index number
  2. Type – type of record, from:  INDI, FAM, PLAC, EVEN [TYPE], NOTE, SOUR, SREC, OBJ, RELA (relationship), GRP, TODO, DNA, …
  3. Name (of person, place, source, etc.)
  4. Filenum (the data file it is from)
  5. FileID (the ID of this record in its original file)
  6. Date
  7. Place ID
  8. Detail (in GEDCOM format)
  9. User Reference Number (for future use)
  10. Private (for future use)
  11. Changed date

There will also need to be some sort of a Partner/Parent/Children table linking people with their partners and people with their children and the partner start relationship event and the child birth/adoption/fostering events. I still have some thinking to do for this structure. But it will be based on the internal structure now in Behold and may look something like this:

  1. LinkType – type of link (to partner and/or family, to child and/or parents)
  2. Person ID
  3. Family ID
  4. Person Link Detail (in GEDCOM format)
  5. Family Link Detail (in GEDCOM format)
  6. Start Event ID
  7. Next Person ID (next partner or next child)
  8. Next Family ID (next family or next parents)

The rest of the tables will simply be combination of indexes to make everything work fast together, e.g.

  • Place index is:  ID (place), ID (non-place)
  • Source index is:  ID (source/source record), ID (anything) 
  • and a bunch of others

When I implement To-Do lists, I can connect it like I plan to the repository:

  • To-Do index is: ID (repository), ID (to-do)

Life events should be much easier to compute. I’ll have a table of relationships:

  • Person ID, Related Person ID, Relationship

and then the index of life events:

  • Person ID, Related Person ID, Event ID

You know. This could actually work out quite nicely. But we’ll see what happens. It is a process that once underway, evolves into what’s necessary.

What About 64 Bit? - Sun, 12 Jul 2015

Yes, I do now have a 64-bit executable of Behold that works exactly the same as the 32-bit version. But I’m not going to release it yet.

What I found out in my testing was that it was slower than the 32-bit version and used more memory. So for most people, there’s no real reason to make it available because it will not help you.

64-bit does allow larger files to be loaded. Tamura Jones provides a wonderful capacity test with a program called GedFan that generates successive GEDCOM files each larger than the previous with double the number of individuals. Behold 32-bit has a fan value of 19 meaning it can load a file of about a half a million people, but doubling that it runs out of memory.

Without modification, Behold 64-bit failed at fan value 22 (four million people) ironically because the memory reporting function call I was making placed the result into a 32-bit integer. I changed the variable from an Integer to an Int64 and ran it again.

With that change, Behold now does load fan value 22. But this extra capacity is not going to be of benefit to too many people yet. It took 99 seconds to load (a good amount of that time was checking ancestral loops, which I’ll have to put into its own thread) and the internal data structures Behold needed caused it to red-line both my computer’s RAM and swap file and that’s with 12 GB RAM and a 40 GB swap file!

Red-lining memory and swap file

The trouble is that this version of Behold loads everything into memory and builds all its data structures and indexes and links in memory. A doubling test like this will make it run out of memory before anything else shuts it down.

What will make 64-bit useful for Behold and will increase both its capacity and speed will be writing the data to a real disk-based database, rather than keeping it in memory. I’ll be using SQLite as the database. This is the database that RootsMagic uses and Tamura Jones found RootsMagic to be one of the faster GEDCOM readers with one of the highest fan values. SQLite is a very fast database, and even though it is disk-based, it doesn’t lose much to the overhead required for page swapping when you keep everything in memory.

For now, I’m not releasing the 64-bit version.

When the database work is done, it will come with both 32-bit and 64-bit versions of Behold. I’m starting the database work tomorrow.

image

Seven Years of Improvements - Sun, 12 Jul 2015

I know my last two posts were not written in English for the normal person. But I felt I had to document a bit of the work involved in just upgrading a development environment. It is complex process full of pits and valleys. Getting through it needs a machete and a torch. We programmers struggle with our programming tools as much as our users struggle with the programs we create.

Now it is time to reap the rewards. Delphi and TRichView and ElPack and EurekaLog each have gone through seven years of improvements. I can run through the change logs of each version, and see what changes are notable and will provide benefits to Behold users.

 

Delphi improvements

Lots has been added since Delphi 2009.

The most important improvement for the future of Behold is FireDAC with SQLite support. I will be using this for Behold’s database.

Another improvement is 64-bit capability. Now I will be able to make Behold available both as a 32-bit program and as a 64-bit program.

There are new online data access protocols now available including REST web services, and libraries I may be needing. The newer versions of Delphi include improved JSON low level processing and an improved XML engine.

There were supposedly style improvements that take on the Theme of the O/S (Vista, Windows 7, WIndows 8), etc., but other than it adding a blue background to indicate that a toolbar item was selected, I haven’t noticed too much difference. It’s also touch-enabled. XE8 added support for Windows 10 and Windows 10 styles, so it will be ready when that comes.

Delphi has really enhanced its Generics library, which is a suite of data structures I can use in Behold. They are well-written, generalized and fast. They should speed my coding time, and over time I’ll replace my older data structures with these. Delphi 2009 included an early version of Generics, but there were some bugs so I only made limited use of them.

A Parallel Computing library has been added that will allow me to optimize Behold with background processes and to keep it running as fast as possible.

They also made lots of fixes and improvement. One item that really pleased me was a fix that helped Behold’s Find box. When you bring it up, it stays on your screen, but it is called a non-modal box because it does not take control of your screen. So you can find something and then go to the Everything Report and work on it, and then go back to the Find box. The default action in the Find box is of course to find the text. In Behold 1.1 and earlier, going back and forth between the Find Box and the main program would somehow lose the default action. I would press Ctrl-F to go back to the Find box and then press enter to get the next search, but it wouldn’t work. I tried in the past to find an easy fix to this, but it was a Delphi/Operating System thing. What pleased me is apparently, whatever was causing this loss of the default was fixed sometime between Delphi 2009 and XE8. Hopefully more subtle bugs like this were also fixed by going to XE8, without many new ones creeping in.

 

RichView improvements

The Everything Report is built on RichView, so any new features and improvements to RichView can benefit Behold.

I was using RichView 10.1.4 from 2008 and have now installed version 15.7.

Now it can export as a word document (.docx files). I’ve added DocX as an export option in Behold and it will be available in the next release.

Some fixes/improvements include improved rendering of bidirected text, optimizations in calculating text width (which is probably the main reason why Behold loads 30% faster when compiiled with Delphi XE8 and RichView 15.7), touch screen support, e.g. handles when selecting with a touchscreen:image

Also improvement to the line breaking algorithm, 64-bit compilation support, style templates (which I may or may not use). Hidden items were added, but I already had my own customized implementation of hidden items which I liked better, so I kept my own.

 

ElPack improvements

I use ElPack in Behold for the TreeView, the grids in the Organize pages, and a few other components. I had the 2009 Version and I upgraded to the 2015 Version to get a version that would work with Delphi XE8.

ElPack is not LMD Innovative’s major product. They acquired it quite a while ago from ELDOS who I initially purchased the package from because I needed a treeview and grid with Unicode support. So LMD is maintaining it, and is slowly merging it with their core LMD products, but they are not doing much to enhance it.

They’ve added 64-bit and touch-based features. They fixed a number of bugs. And they’ve included a few new packages with ElPack like a DialogPack that I could make use of in the future.

 

EurekaLog improvements

EurekaLog traps errors that would otherwise crash Behold. It lets me display a message to inform the user there is a problem, and provides some information that might allow me to find and fix the problem. I also use it when I’m developing since it can help me find and fix memory leaks.

I first purchased EurekaLog Version 6 back in 2009. I initially used EurekaLog’s default dialogs and this allowed the users to email the problem to me. It caused me problems. Users complained. So I changed it:

I came up with my own form after careful thinking. I did not want a generic box. I wanted something that the user would know it was my program self-detecting the problem, and that it was not Windows telling them that my program had a problem. EurekaLog’s generic dialogs look too much like Windows. I attempted to fix the email problems with a "mailto" email. But even with that, the emailing was problematic, so I changed it to my current form, where I instead ask the user to drop off the bug information on the Behold feedback page:

image

EurekaLog’s improvements going from Version 6.x to 7.x involved a complete rewrite of their product. My old code wouldn’t work. I placed a support ticket with EurekaLog and Kevin of their support team worked with me and provided me a framework to continue to trap errors and provide my own form the way I had been doing.

Upgrading to Version 7 will provide 64-bit error trapping and better information for me for debugging purposes.

 

Overall

Upgrading to my new development environment that includes the latest versions of Delphi, TRichView, ElPack and EurekaLog took 14 days and involved about 400 recompilations of Behold.

Now I’ve got the tools available I’ll need to go forward and create a native Behold database to store your data. Then, with a place to store your edits, I’ll be able to add editing to Behold.