Login to participate
  
Register   Lost ID/password?

Louis Kessler's Behold Blog

I Know Where I’m Going, But … - Fri, 17 Jul 2015

… But I don’t know what it’s going to be like when I get there.

My adventure towards editing in Version 2 of Behold is a bit like the New Horizon’s mission to Pluto. They set off 9 years ago and knew where they were going. They knew how they’d get there. They even knew what they were going to do when they got there. But they didn’t know what they’d find when they got there.

image

So Behold is well on its way to getting its own database so that it will have a place to save the edits you add to it. I know that this editing is going to be something special. I just don’t know exactly what it will look like yet. My picture of it is about the same as the image of Pluto above, which was the best picture we had of it until last week.

As I get closer to the goal, ideas will expand and the picture will get clearer.

image

Even my last blog post, just 62 hours ago, talked about setting up a main database table that would generalize all the tables. Well, two days of working on the tables led to realizations that a generalized table will not work out, and the standard method of normalizing tables for INDI, FAM, SOUR, OBJ, etc., will do after all.

Also the realization that I must change just one thing at a time, ensure it can work, and then get it working before going on to the next is essential. Behold must always be a working system. At this point, wholesale changes will not work. The mission is set. Pluto is in our sites. Let’s get there.

NASA's Three-Billion-Mile Journey to Pluto Reaches Historic Encounter

Pluto nearly fills the frame in this image from the Long Range Reconnaissance Imager (LORRI) aboard NASA’s New Horizons spacecraft, taken on July 13, 2015 when the spacecraft was 476,000 miles (768,000 kilometers) from the surface. (Photo: NASA/APL/SwRI)

Database Design for Genealogy Data - Tue, 14 Jul 2015

There are several considerations as I structure the disk-based database for Behold. I want it to be capable of loading genealogy data from any source, I want it to be flexible, but most importantly I need it to handle the in-memory structures already in Behold. Let’s go through these considerations:

Data from any Source

Behold already reads GEDCOM. GEDCOM is the long-time standard used for transmission of genealogy data. Almost all genealogy programs can read and write from and to GEDCOM files to at least some degree. Most programs don’t make use of GEDCOM’s capabilities and some many mistakes writing to it.

GEDCOM’s file structure includes records for a Person, Family, Note, Source, Repository, Multimedia, Submitter and Submission. Notably absent are records for a Place and an Event, which I’ll need for Behold. GEDCOM does allow custom tags, and many vendors have included custom level zero records in Behold. That means just about anything goes – except that no other programs will read one program’s custom tags except in the rare cases that a program will go through the work to support another program’s custom tags.

Family Historian is one of the few genealogy programs that uses GEDCOM as its file format. Internally, it likely loads everything into a more efficient in-memory data structure like Behold does, but it only writes its data to a file as GEDCOM. And if I’m not mistaken, Family Historian still uses GEDCOM 5.5, which was superseded by GEDCOM 5.5.1 which PAF used and is the de facto standard.

There was an Event-Oriented GEDCOM developed by COMMSOFT and used in Ultimate Family Tree until Ancestry purchased it and discontinued it. They put the Event record front and centre in that structure.

There are some extensions to GEDCOM, most notably GEDCOM 5.5EL developed and supported by a group of genealogy program authors that has added a Place record and some other goodies.

The FHISO organization is working to produce an updated standard. That is probably a long time off, but when they do, that will be a new standard I’ll want to read.

Other programs: I’m not saying I’m going to get Behold to read from every other program out there, but it is very good to look at any published data file documentation that other programs have. I at least have to make sure that Behold will have the structures in place that it can store and display any relevant genealogy data that any program may have. Looking at other program’s file structures may give me other good ideas on ways to design Behold’s file.

Some of the programs with the most interesting data files I’ve looked at include:

  • RootsMagic – the database is SQLite, which is what I’ll be using. There are many SQLite viewers so their database can be easily inspected. In fact, there’s a SQLite Tools for RootsMagic group at wikispaces that publishes the RootsMagic data structure, provides tools, and helps people write their own custom queries from the RootsMagic data file.

    RootsMagic’s data structure includes standard records for: Person, Family, Place, Event, Source, Citation and Multimedia. There are some tables for subrecords (e.g. Name, Address). The most interesting records in RootsMagic which Behold should be capable of supporting include Groups, Research/To-Do, Source Templates, Exclusions (problem list items that are not a problem) and Witnesses.

  • Gramps – provides a full API (Application Programming Interface) to the Gramps database. The main records are Person, Family, Place, Event, Source, Citation, Repository, Media Object and Note.

    I think the Gramps people did a superb job of setting up a well designed layout with just the right fields in each record. I particularly like the idea of including on each record, fields for:  private, change, and a list of links to all the connections (rather than making them separate tables as most programs do). If I was starting anew and wanted to just “borrow” a database structure, this might be the one I would pick.

  • The Master Genealogist (TMG) – Bob Velke discontinued TMG last year. TMG was considered to be one of the more advanced genealogy programs when it was first created, and had to have data structures to support them. In August last year, the file structures of TMG 9 were published to allow other vendors to build import utilities that so that TMG’s users could migrate to a supported program. Lee Hoffman put up a nice summary of the TMG File Structures up to TMG 7. 

    Some of TMG’s more unique records include: Custom Flags, DNA information, Focus Groups, Participants/Witnesses and Timeline Locks.

Then there are the online Family Tree Services who you subscribe to and maintain your family tree on their websites. Most of these sites provide APIs to allow programmers to read and in some cases write from and to their service.The APIs will give an indication as to the structure of their data. Some of the most notable include:

  • MyHeritage – They have complete documentation online for their API, which they call the Family Graph API (surprisingly still with Beta designation on it). I heard Uri Gonen talk about it at RootsTech 2014. I tweeted then that I was very impressed and that MyHeritage is doing a lot of things right. Of all the APIs I’ve seen, theirs is the most straightforward and looks like it might be the simplest to implement.

    Some of the unique objects in their API include information related to their SmartMatch system, including a Matching Request, MatchesCount and ConfirmationStatus.

  • FamilySearch – They also have their documentation available online for developers. They also have their GEDCOM X Data Objects defined which describe the actual data structure used in their Family Tree. 

    Of most interest here is their inclusion of Discussions and their concept of Memories.

There are other online services with APIs, including Geni, Genealogie Online, WikiTree, GenealogyCloud and the DNA testing company 23andMe to name a few.

I should add that I really like Tim Forsythe’s online Gigatrees service. He doesn’t give a complete description but does at least map out his data structure he calls GREnDL. He attempts to do the correct thing and be source-based, so at the center of his model, he has the “Claim” record.

There are others who have developed data structures with different ideas that I keep aware of, like Tony Proctor (STEMMA), and Tom Wetmore (DeadEnds).


Flexibility

Behold is a flexible GEDCOM reader. It knows the rules of GEDCOM but reads in and can display anything that has the GEDCOM level/tag/value structure, even if the input doesn’t follow GEDCOM’s rules.

So yes, there is valid data and there is invalid data. But whatever is input must be preserved in some way. It cannot be thrown away.

The programs that convert data they don’t recognize into notes are sometimes criticized because they lose understanding of the data. But that is much better than not inputting the data at all. In the former case, the data can at least be exported again, but in the latter case the data is lost forever.

What must be done is to try to recognize all GEDCOM constructs and record the data as intended. If the construct as input is illegal, then maybe it can be translated to a legal construct. If not, maybe a custom tag can be used. If not, then turn it into a note.

Whatever is done, the data structures must be able to handle this flexibility. A very rigid structure requiring only certain types of records and fields and not allowing unknown records or unknown fields will have no place to put this data.

GEDCOM’s level/tag/value structure is actually very flexible and can handle just about anything. I plan to make use of it to contain the detail information about each record in the database.

Behold’s In-Memory Structures

Behold is already quite well set up to be converted to a file-based database. Its current structures include all the GEDCOM records and that already includes a place record and allows for custom records.

When I added Life Events into version 1.1, events suddenly took on new meaning. They were now not only linked to the individual or family, their place and their sources, but now they were also linked to all the people closely related to the individual. So I’ve come to realize that because the events are referenced so much, it will be better to make Event a top level record.

To generalize, I’m thinking I can collapse most record types into a single table because they have so much in common. My idea may change, but what follows is what I’m going to try to implement.

The main table will likely be something like this:

  1. ID – sequential index number
  2. Type – type of record, from:  INDI, FAM, PLAC, EVEN [TYPE], NOTE, SOUR, SREC, OBJ, RELA (relationship), GRP, TODO, DNA, …
  3. Name (of person, place, source, etc.)
  4. Filenum (the data file it is from)
  5. FileID (the ID of this record in its original file)
  6. Date
  7. Place ID
  8. Detail (in GEDCOM format)
  9. User Reference Number (for future use)
  10. Private (for future use)
  11. Changed date

There will also need to be some sort of a Partner/Parent/Children table linking people with their partners and people with their children and the partner start relationship event and the child birth/adoption/fostering events. I still have some thinking to do for this structure. But it will be based on the internal structure now in Behold and may look something like this:

  1. LinkType – type of link (to partner and/or family, to child and/or parents)
  2. Person ID
  3. Family ID
  4. Person Link Detail (in GEDCOM format)
  5. Family Link Detail (in GEDCOM format)
  6. Start Event ID
  7. Next Person ID (next partner or next child)
  8. Next Family ID (next family or next parents)

The rest of the tables will simply be combination of indexes to make everything work fast together, e.g.

  • Place index is:  ID (place), ID (non-place)
  • Source index is:  ID (source/source record), ID (anything) 
  • and a bunch of others

When I implement To-Do lists, I can connect it like I plan to the repository:

  • To-Do index is: ID (repository), ID (to-do)

Life events should be much easier to compute. I’ll have a table of relationships:

  • Person ID, Related Person ID, Relationship

and then the index of life events:

  • Person ID, Related Person ID, Event ID

You know. This could actually work out quite nicely. But we’ll see what happens. It is a process that once underway, evolves into what’s necessary.

What About 64 Bit? - Sun, 12 Jul 2015

Yes, I do now have a 64-bit executable of Behold that works exactly the same as the 32-bit version. But I’m not going to release it yet.

What I found out in my testing was that it was slower than the 32-bit version and used more memory. So for most people, there’s no real reason to make it available because it will not help you.

64-bit does allow larger files to be loaded. Tamura Jones provides a wonderful capacity test with a program called GedFan that generates successive GEDCOM files each larger than the previous with double the number of individuals. Behold 32-bit has a fan value of 19 meaning it can load a file of about a half a million people, but doubling that it runs out of memory.

Without modification, Behold 64-bit failed at fan value 22 (four million people) ironically because the memory reporting function call I was making placed the result into a 32-bit integer. I changed the variable from an Integer to an Int64 and ran it again.

With that change, Behold now does load fan value 22. But this extra capacity is not going to be of benefit to too many people yet. It took 99 seconds to load (a good amount of that time was checking ancestral loops, which I’ll have to put into its own thread) and the internal data structures Behold needed caused it to red-line both my computer’s RAM and swap file and that’s with 12 GB RAM and a 40 GB swap file!

Red-lining memory and swap file

The trouble is that this version of Behold loads everything into memory and builds all its data structures and indexes and links in memory. A doubling test like this will make it run out of memory before anything else shuts it down.

What will make 64-bit useful for Behold and will increase both its capacity and speed will be writing the data to a real disk-based database, rather than keeping it in memory. I’ll be using SQLite as the database. This is the database that RootsMagic uses and Tamura Jones found RootsMagic to be one of the faster GEDCOM readers with one of the highest fan values. SQLite is a very fast database, and even though it is disk-based, it doesn’t lose much to the overhead required for page swapping when you keep everything in memory.

For now, I’m not releasing the 64-bit version.

When the database work is done, it will come with both 32-bit and 64-bit versions of Behold. I’m starting the database work tomorrow.

image