Database Design for Genealogy Data - Tue, 14 Jul 2015
There are several considerations as I structure the disk-based database for Behold. I want it to be capable of loading genealogy data from any source, I want it to be flexible, but most importantly I need it to handle the in-memory structures already in Behold. Let’s go through these considerations:
Data from any Source
Behold already reads GEDCOM. GEDCOM is the long-time standard used for transmission of genealogy data. Almost all genealogy programs can read and write from and to GEDCOM files to at least some degree. Most programs don’t make use of GEDCOM’s capabilities and some many mistakes writing to it.
GEDCOM’s file structure includes records for a Person, Family, Note, Source, Repository, Multimedia, Submitter and Submission. Notably absent are records for a Place and an Event, which I’ll need for Behold. GEDCOM does allow custom tags, and many vendors have included custom level zero records in Behold. That means just about anything goes – except that no other programs will read one program’s custom tags except in the rare cases that a program will go through the work to support another program’s custom tags.
Family Historian is one of the few genealogy programs that uses GEDCOM as its file format. Internally, it likely loads everything into a more efficient in-memory data structure like Behold does, but it only writes its data to a file as GEDCOM. And if I’m not mistaken, Family Historian still uses GEDCOM 5.5, which was superseded by GEDCOM 5.5.1 which PAF used and is the de facto standard.
There was an Event-Oriented GEDCOM developed by COMMSOFT and used in Ultimate Family Tree until Ancestry purchased it and discontinued it. They put the Event record front and centre in that structure.
There are some extensions to GEDCOM, most notably GEDCOM 5.5EL developed and supported by a group of genealogy program authors that has added a Place record and some other goodies.
The FHISO organization is working to produce an updated standard. That is probably a long time off, but when they do, that will be a new standard I’ll want to read.
Other programs: I’m not saying I’m going to get Behold to read from every other program out there, but it is very good to look at any published data file documentation that other programs have. I at least have to make sure that Behold will have the structures in place that it can store and display any relevant genealogy data that any program may have. Looking at other program’s file structures may give me other good ideas on ways to design Behold’s file.
Some of the programs with the most interesting data files I’ve looked at include:
- RootsMagic – the database is SQLite, which is what I’ll be using. There are many SQLite viewers so their database can be easily inspected. In fact, there’s a SQLite Tools for RootsMagic group at wikispaces that publishes the RootsMagic data structure, provides tools, and helps people write their own custom queries from the RootsMagic data file.
RootsMagic’s data structure includes standard records for: Person, Family, Place, Event, Source, Citation and Multimedia. There are some tables for subrecords (e.g. Name, Address). The most interesting records in RootsMagic which Behold should be capable of supporting include Groups, Research/To-Do, Source Templates, Exclusions (problem list items that are not a problem) and Witnesses.
- Gramps – provides a full API (Application Programming Interface) to the Gramps database. The main records are Person, Family, Place, Event, Source, Citation, Repository, Media Object and Note.
I think the Gramps people did a superb job of setting up a well designed layout with just the right fields in each record. I particularly like the idea of including on each record, fields for: private, change, and a list of links to all the connections (rather than making them separate tables as most programs do). If I was starting anew and wanted to just “borrow” a database structure, this might be the one I would pick.
- The Master Genealogist (TMG) – Bob Velke discontinued TMG last year. TMG was considered to be one of the more advanced genealogy programs when it was first created, and had to have data structures to support them. In August last year, the file structures of TMG 9 were published to allow other vendors to build import utilities that so that TMG’s users could migrate to a supported program. Lee Hoffman put up a nice summary of the TMG File Structures up to TMG 7.
Some of TMG’s more unique records include: Custom Flags, DNA information, Focus Groups, Participants/Witnesses and Timeline Locks.
Then there are the online Family Tree Services who you subscribe to and maintain your family tree on their websites. Most of these sites provide APIs to allow programmers to read and in some cases write from and to their service.The APIs will give an indication as to the structure of their data. Some of the most notable include:
- MyHeritage – They have complete documentation online for their API, which they call the Family Graph API (surprisingly still with Beta designation on it). I heard Uri Gonen talk about it at RootsTech 2014. I tweeted then that I was very impressed and that MyHeritage is doing a lot of things right. Of all the APIs I’ve seen, theirs is the most straightforward and looks like it might be the simplest to implement.
Some of the unique objects in their API include information related to their SmartMatch system, including a Matching Request, MatchesCount and ConfirmationStatus.
- FamilySearch – They also have their documentation available online for developers. They also have their GEDCOM X Data Objects defined which describe the actual data structure used in their Family Tree.
I should add that I really like Tim Forsythe’s online Gigatrees service. He doesn’t give a complete description but does at least map out his data structure he calls GREnDL. He attempts to do the correct thing and be source-based, so at the center of his model, he has the “Claim” record.
Behold is a flexible GEDCOM reader. It knows the rules of GEDCOM but reads in and can display anything that has the GEDCOM level/tag/value structure, even if the input doesn’t follow GEDCOM’s rules.
So yes, there is valid data and there is invalid data. But whatever is input must be preserved in some way. It cannot be thrown away.
The programs that convert data they don’t recognize into notes are sometimes criticized because they lose understanding of the data. But that is much better than not inputting the data at all. In the former case, the data can at least be exported again, but in the latter case the data is lost forever.
What must be done is to try to recognize all GEDCOM constructs and record the data as intended. If the construct as input is illegal, then maybe it can be translated to a legal construct. If not, maybe a custom tag can be used. If not, then turn it into a note.
Whatever is done, the data structures must be able to handle this flexibility. A very rigid structure requiring only certain types of records and fields and not allowing unknown records or unknown fields will have no place to put this data.
GEDCOM’s level/tag/value structure is actually very flexible and can handle just about anything. I plan to make use of it to contain the detail information about each record in the database.
Behold’s In-Memory Structures
Behold is already quite well set up to be converted to a file-based database. Its current structures include all the GEDCOM records and that already includes a place record and allows for custom records.
When I added Life Events into version 1.1, events suddenly took on new meaning. They were now not only linked to the individual or family, their place and their sources, but now they were also linked to all the people closely related to the individual. So I’ve come to realize that because the events are referenced so much, it will be better to make Event a top level record.
To generalize, I’m thinking I can collapse most record types into a single table because they have so much in common. My idea may change, but what follows is what I’m going to try to implement.
The main table will likely be something like this:
- ID – sequential index number
- Type – type of record, from: INDI, FAM, PLAC, EVEN [TYPE], NOTE, SOUR, SREC, OBJ, RELA (relationship), GRP, TODO, DNA, …
- Name (of person, place, source, etc.)
- Filenum (the data file it is from)
- FileID (the ID of this record in its original file)
- Place ID
- Detail (in GEDCOM format)
- User Reference Number (for future use)
- Private (for future use)
- Changed date
There will also need to be some sort of a Partner/Parent/Children table linking people with their partners and people with their children and the partner start relationship event and the child birth/adoption/fostering events. I still have some thinking to do for this structure. But it will be based on the internal structure now in Behold and may look something like this:
- LinkType – type of link (to partner and/or family, to child and/or parents)
- Person ID
- Family ID
- Person Link Detail (in GEDCOM format)
- Family Link Detail (in GEDCOM format)
- Start Event ID
- Next Person ID (next partner or next child)
- Next Family ID (next family or next parents)
The rest of the tables will simply be combination of indexes to make everything work fast together, e.g.
- Place index is: ID (place), ID (non-place)
- Source index is: ID (source/source record), ID (anything)
- and a bunch of others
When I implement To-Do lists, I can connect it like I plan to the repository:
- To-Do index is: ID (repository), ID (to-do)
Life events should be much easier to compute. I’ll have a table of relationships:
- Person ID, Related Person ID, Relationship
and then the index of life events:
- Person ID, Related Person ID, Event ID
You know. This could actually work out quite nicely. But we’ll see what happens. It is a process that once underway, evolves into what’s necessary.