Login to participate
  
Register   Lost ID/password?
Louis Kessler’s Behold Blog » Blog Entry           prev Prev   Next next

Reading GEDCOM – A Guide for Programmers - Thu, 12 Oct 2023

I’m back hard at work on Behold, trying to finish off the many features I’ll be including in Version 1.3 when it’s ready.

Some of the changes for Version 1.3 made me take a look again at Behold’s GEDCOM import. Behold currently is a flexible GEDCOM reader. That is, it will accept almost anything that looks like GEDCOM and present it the best it can. In so doing, Behold displays as much of your data as possible so you can see what the GEDCOM file contains.

This is important because many programs that export GEDCOM do not follow all the rules. They include deprecated constructs, illegal keywords, user-defined tags that no program but their own understand, and many other irregularities. Most developers no longer care whether other programs can import their data, but instead just want to ensure that their own program can read its own data back in.

This article is probably about 25 years overdue. I don’t believe anyone has given a simple programmers guide to GEDCOM Reading before. In the past 5 to 10 years, very few new genealogy programs have been written and there’s very little reason why a new GEDCOM standard is needed. That’s because almost all genealogy software that currently existing reads and writes to the GEDCOM 5.5.1 Standard draft that was released in 1999 and was unchanged but deemed final in 2019.

I have stated in the past that I think that GEDCOM 5.5.1 is an excellent standard and the reason why genealogy data doesn’t transfer between two programs is rarely the Standard’s fault. Most often it’s the programmer’s fault. See my 2011 article: Build a BetterGEDCOM or learn GEDCOMBetter?

image


Reading GEDCOM 101

So let’s go through the basics. What does a programmer have to do to read a GEDCOM file? Since almost every program can export its data to GEDCOM 5.5.1 or to something resembling GEDCOM 5.5.1, concentrating on reading that particular version of GEDCOM should do the trick.

It doesn’t matter what programming language is used. A GEDCOM file is a simple text file, and just about every language can read a text file. It doesn’t matter what data structures or database the information is loaded into. That can be the programmer’s preference. For Behold, I happen to use Delphi (object oriented Pascal) and in-memory data structures to store the data.

To read GEDCOM, the programmer will have to have available a copy of The GEDCOM 5.5.1 standard. It is 101 pages and is well written with very few errors. It uses a Backus-Naur form (BNF) type of notation which most programmers would either be familiar with, or find easy to understand.

Chapter 1 of the Standard gives the Data Representation Grammar. Basically, all you have to know is that:

image

So step 1 for the programmer is to make a subroutine that will process the next line and return what’s in it. A call to the routine might look like:

ParseLine(Level, OptXRefID, Tag, OptLineValue)

which gets the values of Level, OptXRefID, Tag and OptLineValue from the line.

Then you simply process the line using the values.

Easy peasy.

There of course is a bit more to it, and the Standard gives you the exact breakdown of each of the parts of the line. It tells you for example that the level is a number from 0 to 99 and that delim is a space. OptXRefID is an identifier between two at signs, e.g. @I43@, Tag is a string, and OptLineValue is either a Value (which is a string) or a Pointer to an XRefID that will have the identifer’s ID, i.e. @I43@.

The amazing thing about this simple scheme is that it can be used to define any data representation language. In that way, it is similar to XML and JSON. Why didn’t the developers of GEDCOM use XML or JSON for their standard? Well, they weren’t invented yet. So the developers of GEDCOM developed their own data representation language.  Almost anything written with XML or JSON could also be written with the GEDCOM Grammar, and vise versa.


Reading GEDCOM 201

Well that’s all fine and good, but we have no genealogical data defined yet.

So thus comes Chapter 2 of the Standard:  The GEDCOM Structure. This section defines what the valid tags are at each level, and what type of value or pointer each tag might have, and what substructures each tag at each level might have.

This is all explained quite clearly in the Standard.

Using the ParseLine routine above, there are only a few different types of lines to process:

  1. A Level 0 line starts a new record. The Tag defines the record type. The Tag could be INDI (Individual), FAM (Family), SOUR (Source), REPO (Repository), OBJE (Multimedia) and a few others. When a Level 0 record is reached, the programmer will start a new entry for it in a data structure or table of a database.
  2. A Level 1 line starts a fact or information for the Level 0 record before it. Each Level 1 fact can be associated with Level 0 record it belongs to.
  3. Level 2 and below is additional information belonging to the Level 1 fact before it.
  4. The Pointers are the connectors. There are two types of connectors:
    • Individual to Family connectors, to connect parents to children and spouses to each other using the FAM record as the intermediate linkage. It’s a tricky concept to grasp but programmers will understand it as it’s a relatively simple structure in graph theory.
    • Connectors to other records, e.g. a fact is connected to a source record using a SOUR tag with a pointer (which is how GEDCOM implements citations), or a source is connected to a repository with a REPO tag with a pointer.

Seriously, that’s basically it.

You can add some data checking and error messages if the GEDCOM isn’t valid, but the amount you do is up to you if all you are doing is reading GEDCOM.


What About the Program’s Data Structure?

Most genealogy software developers already have their own program and data structures or database to store their data. To read GEDCOM, all they have to do is pigeon-hole each line of data into their own data.

Err, um, and that’s where the data loss due to transferring GEDCOM occurs. If a GEDCOM file includes types of data, e.g. a phonetic name variation, which the developer’s program doesn’t support, well what’s the developer to do? Unless he sees a need for that data in his program, he’ll just throw it away, as he will with any user-defined program specific data that often comes in GEDCOM.

The most often reason GEDCOM data doesn’t transfer isn’t because the GEDCOM standard can’t represent the data. Rather, it’s because the receiving program doesn’t have a place for the data.

Now if the programmer didn’t already have a structure for his data, he could define one based on the GEDCOM Structure. Doing so, he could have a place for everything that’s possible to be transferred. Then it’s up to the programmer to develop reports and forms to allow the user to see and edit the data. Whether or not they make reports and forms that can display and edit all possible data from a GEDCOM file is a different matter.

There are some programs that can almost do this. Ancestral Quest and the no longer supported Personal Ancestral File (PAF) program have internal structures that helped define the GEDCOM standard when it was developed. The program Family Historian uses GEDCOM as its database. Even so, there are differences in what data these programs will display for you, allow you to edit, import from GEDCOM and export from GEDCOM.

No program is perfect. That’s why there are so many of them, because everyone (and every programmer) wants something a bit different.

Getting your data to transfer well from one program to another isn’t a matter of making the GEDCOM standard better. It’s a matter of getting the developers to all include in their data structures everything that GEDCOM has to offer –> and that isn’t going to happen anytime soon.

3 Comments           comments Leave a Comment

1. coret (coret)
Netherlands flag
Joined: Thu, 15 Dec 2011
5 blog comments, 0 forum posts
Posted: Thu, 12 Oct 2023  Permalink

Why are you ignoring the latest GEDCOM version 7.0 and stick with the 24 year old one?

2. Steve Little (digitalarchivist)
United States flag
Joined: Wed, 10 Nov 2021
5 blog comments, 0 forum posts
Posted: Wed, 18 Oct 2023  Permalink

Thank you, Louis. This is helpful and timely.

3. Louis Kessler (lkessler)
Canada flag
Joined: Sun, 9 Mar 2003
287 blog comments, 245 forum posts
Posted: Sun, 22 Oct 2023  Permalink

Bob (Coret):
Main reason is that 5.5.1 is supported by 98% of genealogy vendors, whereas the number who have implemented some form of 7.0 can be counted on one hand. There are many reasons why I don’t think 7.0 will ever take hold. See my newest blog post Is Updating the GEDCOM Standard Necessary?

Leave a Comment

You must login to comment.

Login to participate
  
Register   Lost ID/password?