Login to participate
  
Register   Lost ID/password?

Louis Kessler’s Behold Blog

Is Updating the GEDCOM Standard Necessary? - Sun, 22 Oct 2023

The GEDCOM Standard was first developed almost 40 years ago as a way to store genealogical data and transfer it between programs. It was developed about the same time the first genealogical software programs were developed.

The early programs developed a basic structure for genealogy, and the standard reflected that. The standard was updated many times mostly to ease its implementation and to transfer additional types of data, but the basic record structure has never really changed.

The standard that is common use now is GEDCOM 5.5.1. It was drafted in 1999 and finalized without changes in 2019. So in 24 years, the standard hasn’t changed. Similarly about 24 years ago, genealogy software had matured to the point where their data structures were set and rarely needed to change. Having the GEDCOM standard to base their data structures on had a lot to do with that.


What is a Standard and what is Good Standard?

A Standard is an agreed-upon document that provides rules or instructions for its intended audience to follow in order to meet the document’s specific purpose.

GEDCOM’s intended audience is mostly genealogy software developers.

GEDCOM’s specific purpose is to facilitate transfer of genealogical data between software.

Most people would agree that a standard is a good standard iff:

  1. It has been adopted and is used by most of its intended audience.
  2. It is understandable and contains most of what is needed to serve its purpose.
  3. It is relatively stable from version to version without requiring major changes.

So is GEDCOM 5.5.1 a good standard?

  1. Almost all genealogy software developers today know about the GEDCOM standard and the vast majority use it as a way to share their software’s genealogical data with others or to get its data from other software. - 1 / 1
  2. All the rules are there. Genealogy software has been successfully sharing data with other software for 24 years using GEDCOM 5.5.1. - 2 / 2
  3. GEDCOM 5.5.1 hasn’t changed at all in 24 years. - 3 / 3

Giving it 3 out of 3, I can’t see why this version of GEDCOM 5.5.1 would not be considered to be a “good” standard.


Data Doesn’t Transfer

GEDCOM is supposed to facilitate data transfer between programs.

If you are using genealogy program XXXX and decide you want to switch to genealogy program YYYY, then you need to transfer your data. So you export your data from XXXX to a GEDCOM 5.5.1 file and you import it from that file into program YYYY. You will likely find that a lot of your data did not transfer.

For the past 15 years, we’ve seen initiatives such as BetterGEDCOM, FHISO, and GEDCOM 7.0 try to improve GEDCOM 5.5.1 to enable much more of the data to transfer. The idea here was that there was something about GEDCOM 5.5.1 that was preventing the data transfer.

I believe this thinking is wrong.

The work I have done have led me to conclude that:

  • 5% of data doesn’t transfer because GEDCOM 5.5.1 cannot handle the specific type of data.
  • 35% of data doesn’t transfer because the receiving system did not implement the functionality that needs or uses that data, and thus did not have a data structure or table in its database to store it.
  • 60% of data doesn’t transfer because the developer did not use the correct GEDCOM 5.5.1 method, or used his own custom tags to do the transfer.

If only 5% of the data not transferring is due to GEDCOM, then the standard is not the problem.

If 35% is due to the receiving system not needing or accepting the data, then no improvements to the standard could fix that.

If 60% is due to developers not making the effort to correctly implement GEDCOM, then more education about the standard is needed.


What Is Not Needed

There is nothing inherently wrong with GEDCOM 5.5.1.  What is not needed is a significant revision to it. What I am referring to of course, is the release of GEDCOM 7.0 two years ago by FamilySearch.

GEDCOM 7.0 is written differently from GEDCOM 5.5.1. It no longer uses the GEDCOM form but uses a Hierarchical container format. Standard Bachus Naur Format (BNF) for defining the syntax is changed to “A Metasyntax for Structure Organization”. Changing the representation of the standard is akin to writing it in a different language. It makes the adoption of the standard by 5.5.1 users unnecessarily more difficult. Programmers do not want something to change just for the sake of change. They want a standard where every change is simple and understandable and meets a need. If it ‘aint broke, don’t fix it.

The selling point of a new standard is for better data transfer. It seems like slim pickings if they are trying to reduce the 5% of the data that does not transfer. Adding new data structures is admirable if they are needed by the majority. But will enabling negative assertions, rich-text notes and “better” multimedia handling be useful if 35% of the systems will not need or accept that data and 60% of them will not follow the rules in using it?

After more than two years, very few genealogy developers have implemented GEDCOM 7.0. Fewer still have implemented the new features that 7.0 added.

There can be many different reasons for this, from technical to practical to the simple idea that they’d rather wait for everyone else to implement it before they spend their time and resources in doing it themselves.


What Is Needed

If you want more of your data to transfer between programs, you won’t get it by creating a new standard for that 5%, and you won’t be able to improve on the 35% that your destination program has not implemented.

The best you can do is to reduce the 60% of the data that is written incorrectly or read incorrectly or written as custom tags which the receiving system cannot understand. For that we need better resources that will help the developer implement the GEDCOM 5.5.1 standard as correctly as possible.

And there are a couple of resources available for that right now.

  1. The GEDCOM 5.5.1 Annotated Edition
  2. The GEDCOM 5.5.5 Specification

Both are available at: https://www.gedcom.org/gedcom.html

These specs were created in 2019 by Tamura Jones with the input of 9 genealogy software developers, myself included.

The GEDCOM 5.5.1 Annotated Edition takes all the knowledge and experience of these experts and adds them as notes into the original 5.5.1 standard. They explain whatever is not clear and give suggestions as to how to correctly implement GEDCOM.

The GEDCOM 5.5.5 Specification effectively updates the 5.5.1 standard with the notes from the 5.5.1 Annotated Edition and marks items that are no longer of practical use and should be deprecated from the 5.5.1 standard. In this way the 5.5.5 Specification should be used for writing to a GEDCOM file as it is 100% backward compatible to 5.5.1, except for some necessary correction of mistakes in 5.5.1 and relaxation of some length restrictions. 


Further Reading

   SNAGHTML5ee555


Conclusions

Is Updating the GEDCOM Standard Necessary?  I would say no. If anything, a few minor additions to 5.5.1 would be useful, but nothing major.

Moving to GEDCOM 7.0 could be dangerous as it might make data less likely to transfer correctly. Developers do not want to spend time changing their programs to implement features not needed by their own programs.

Available resources such as the 5.5.1 Annotated Edition and the 5.5.5 Specification that better explain how to implement GEDCOM can help developers make their GEDCOM more compatible with others.

Any future work on the GEDCOM standard should strongly discourage the use of user-defined (i.e. custom) tags, or even better, make them illegal.

Is Perfect GEDCOM Reading and Writing Necessary? - Mon, 16 Oct 2023

My answers might surprise you.


Reading Perfect GEDCOM

In my previous article Reading GEDCOM – A Guide for Programmers, I outlined a general technique that a programmer of a genealogy program could use to read GEDCOM. Basically, the programmer should read each line in using the generalized definition of a “gedcom_line”. Then from those lines, they should pick out the data that their program uses and load it into their program’s data structure or database.

But what if the programmer wanted to read the GEDCOM perfectly? i.e. exactly as the standard defines it.

That is entirely possible. It is a bit more difficult. The programmer will have to write a parser for GEDCOM. There are a few  Open Source GEDCOM Parsers available in various programming languages, but there is no guarantee that any of them can read GEDCOM perfectly. So for perfection, it really will be up to the programmer to write their own parser in their language of choice.

Writing a parser was an assignment often given to Computer Science students when I was taking the subject in University back in the ‘70s. Somehow I managed to avoid any courses with that assignment.

Back in 2016, I wrote an article Unexpected BNF where I was planning to implement complete GEDCOM checking in Behold, and in that article, I came up with a scheme to do it. I never did follow through with that implementation.

But I have just revisited the idea and have now implemented a close to perfect GEDCOM parser in my developmental version of Behold, which will be included when I release this version.

What makes it is nearly perfect is that I have entered the GEDCOM notation into my code in a manner that ensured that I was following all the rules. The way I did this is similar to but slightly different from what is in my Unexpected BNF article.

For example, the GEDCOM INDI record in the Standard starts with this:

image

My code to parse this looks like this:

image

It’s very easy to check that I’ve translated each line of the Standard to my calls to my GC procedure. My GC procedure does all the magic of building up the GEDCOM structure as an internal data structure in Behold. I won’t give the code here, but that routine is only 43 lines long (including blank lines and comments).

The internal data structure connects lines at one level to possible lines at the next lower level, gives the Tag (preceded by an @ if it has an XREF_ID) and the optional Line Value  which could be a value (i.e. a string in quotes that GEDCOM calls an element), a pointer (starts with a @), or a structure (enclosed in << >>). It also gives the minimum and maximum number of times that line may occur under its higher level tag.

I ended up with about 400 of these GC calls  The constructed data structure would ensure that the order and the number of occurrences of each line follows the standard.

I still needed to define the data values (i.e. elements). that are in single quotes. The GEDCOM standard defines the elements like this:

image

so I made a ELEM procedure for this:

image

which contains the element name, minimum and maximum length, and the processing Behold will do on that element.

If the processing value is zero (0), then Behold will treat this element as any string value and will only check that the string’s length is in the allowed range.

If the processing is Elem_Custom, then Behold will do some custom processing for this element, e.g. for Sex, ensuring it is one of the three allowable values.

I did not use Elem_Custom for ROLE_IN_EVENT or ROMANIZED_TYPE, because even though they have legal values specified, they also allow any user-defied value <ROLE_DESCRIPTOR> or <user defined>, meaning any string is allowed.

And one of the reasons why you can’t write a perfect parser for GEDCOM is because there are some minor mistakes in the standard. e.g. SEX_VALUE size is given from 1 to 7, when clearly 1 to 1 should be there. This is a remnant of earlier GEDCOM versions where in addition to single letters, the full words MALE, FEMALE and UNKNOWN were allowed. Changing the length to 1:1 was overlooked when the words were removed.in later versions of the standard.

Similarly, ROLE_IN_EVENT is given a length of 1 to 15. But one of its values is <ROLE_DESCRIPTOR> which can be any string from length 1 to 25. So the 1:15 value will cause truncation of a long role descriptor and should be 1:25.

None-the-less, all the elements can be defined to check that the values in the GEDCOM file are all valid and not too long or short.


So We Can Read GEDCOM Nearly Perfectly. Should We?

The real purpose of reading GEDCOM is to load the genealogy data from the GEDCOM file into your program. If you only read “perfect” GEDCOM, you will miss out on loading valid data contained in slightly imperfect GEDCOM. As I mentioned earlier, a general GEDCOM reader should be used to read a GEDCOM file.

However, a “perfect” GEDCOM checker could be used to check that the GEDCOM follows standards. Your reader can check the data and provide messages about what is wrong in the file. Wouldn’t that be important to know when saving the data back to a GEDCOM file?  (See the next section for more about this.)

But you should use a general GEDCOM reader to load the data into your program. That means, no, it isn’t necessary to read GEDCOM perfectly.


Writing Perfect GEDCOM

Okay then. Surely it is important to write “perfect” GEDCOM to a GEDCOM file.. After all, if you don’t, then all those programs that read GEDCOM won’t be able to load data from your GEDCOM, will they?

And isn’t it the non-conformance to GEDCOM by programmers that is causing much of the data loss? - As I talked about in my previous article.

Well, technically, a developer can have his program export all its data using his own custom tags by starting them with an underscore. Instead of using the GEDCOM tag BIRT for birth and DEAT for death, he can use _BORN and _DIED. His GEDCOM could still be 100% valid because although GEDCOM discourages such user-defined tags, they are allowed anywhere.

I’m sure the developer will read the data from own _BORN and _DIED tags correctly back into his own program. The problem is that very few if any other programs will be likely to load the birth and death data from those tags into their data structure and the data will not transfer.

The solution is (to start a movement) to try to get programmers to avoid using their own custom tags and to use a valid GEDCOM construct whenever possible. It would take a bit more effort in to see if GEDCOM already has a place for each of their data elements. Here’s a few examples.

  • I’ve seen one program export these facts and events:  _DEGREE, _SEPARATED, _MILITARY_SERVI, and _NAMESAKE
    Most programs would throw these away.
    Custom facts and events in INDI or FAM records should be defined using an EVEN or FACT tag followed by a TYPE tag, e.g.
       1 EVEN
       2 TYPE Degree
    and
       1 FACT
       2 TYPE Separated
    and
       1 EVEN
       2 TYPE Military Service
    and extra name information can be done similarly
       1 NAME
       2 TYPE Namesake
    Most programs would understand the above and keep them.
  • I’ve seen programs do all sorts of weird things when exporting citations. Whereas they could instead use the very powerful but underused PAGE tag in GEDCOM with its WHERE_WITHIN_SOURCE value, e.g.:
       2 SOUR @S22@
       3 PAGE Film: 1234567, Frame: 344, Line: 28
    The citation elements and values can simply be listed in pairs divided by colons and separated by commas. Most programs would load this correctly and display the citation beautifully the way the program wants to.
  • Invalid dates, e.g.:
    1950-57 is not valid. You must say FROM 1950 TO 1957.
    If you really want to include an invalid date, GEDCOM allows just about anything if you enclose it in parenthesis, e.g.: (1950-57)
  • There are a number of data values that allow any textual value. These should be used instead of creating new tags. e.g.
    -  The ASSO tag allows any text for the relationship between 2 people.
    -  The ROLE tag allows any text for the role a person plays in the citation of an event.
  • And what’s wrong with throwing anything else that has no representation in GEDCOM into a NOTE?  At least it will be read and retained by the program reading it.

Most of this is just basic GEDCOM. Almost all programs will read the above and place your data into their data structures or database. There would be minimal data loss.


So We Can Write GEDCOM Nearly Perfectly. Should We?

A programmer does not need to have his program write ALL of GEDCOM perfectly. Only the data that is being exporting needs to be written correctly, which often is a small subset of the full GEDCOM standard.

What’s important is that GEDCOM tags and constructs are used rather than the lazy-man’s user-defined tag which often causes data loss.

And with regards to the maximum and minimum lengths of values, you needn’t worry about them. The GEDCOM maximums and minimums are often rather arbitrary and were added at a time (over 20 years ago) where minimizing program size was important.. Many genealogy programs today use text strings that can be 255 characters or more in their data structures. They then export strings that are longer than GEDCOM maximum length for that value. You do not want a program to truncate the data value to adhere to GEDCOM’s maximum length. That would be a loss of data. Most programs don’t check for the maximum length and should read it fine unless it is excessively long. So here, with regards to value lengths, don’t write GEDCOM perfectly.

Reading GEDCOM – A Guide for Programmers - Thu, 12 Oct 2023

I’m back hard at work on Behold, trying to finish off the many features I’ll be including in Version 1.3 when it’s ready.

Some of the changes for Version 1.3 made me take a look again at Behold’s GEDCOM import. Behold currently is a flexible GEDCOM reader. That is, it will accept almost anything that looks like GEDCOM and present it the best it can. In so doing, Behold displays as much of your data as possible so you can see what the GEDCOM file contains.

This is important because many programs that export GEDCOM do not follow all the rules. They include deprecated constructs, illegal keywords, user-defined tags that no program but their own understand, and many other irregularities. Most developers no longer care whether other programs can import their data, but instead just want to ensure that their own program can read its own data back in.

This article is probably about 25 years overdue. I don’t believe anyone has given a simple programmers guide to GEDCOM Reading before. In the past 5 to 10 years, very few new genealogy programs have been written and there’s very little reason why a new GEDCOM standard is needed. That’s because almost all genealogy software that currently existing reads and writes to the GEDCOM 5.5.1 Standard draft that was released in 1999 and was unchanged but deemed final in 2019.

I have stated in the past that I think that GEDCOM 5.5.1 is an excellent standard and the reason why genealogy data doesn’t transfer between two programs is rarely the Standard’s fault. Most often it’s the programmer’s fault. See my 2011 article: Build a BetterGEDCOM or learn GEDCOMBetter?

image


Reading GEDCOM 101

So let’s go through the basics. What does a programmer have to do to read a GEDCOM file? Since almost every program can export its data to GEDCOM 5.5.1 or to something resembling GEDCOM 5.5.1, concentrating on reading that particular version of GEDCOM should do the trick.

It doesn’t matter what programming language is used. A GEDCOM file is a simple text file, and just about every language can read a text file. It doesn’t matter what data structures or database the information is loaded into. That can be the programmer’s preference. For Behold, I happen to use Delphi (object oriented Pascal) and in-memory data structures to store the data.

To read GEDCOM, the programmer will have to have available a copy of The GEDCOM 5.5.1 standard. It is 101 pages and is well written with very few errors. It uses a Backus-Naur form (BNF) type of notation which most programmers would either be familiar with, or find easy to understand.

Chapter 1 of the Standard gives the Data Representation Grammar. Basically, all you have to know is that:

image

So step 1 for the programmer is to make a subroutine that will process the next line and return what’s in it. A call to the routine might look like:

ParseLine(Level, OptXRefID, Tag, OptLineValue)

which gets the values of Level, OptXRefID, Tag and OptLineValue from the line.

Then you simply process the line using the values.

Easy peasy.

There of course is a bit more to it, and the Standard gives you the exact breakdown of each of the parts of the line. It tells you for example that the level is a number from 0 to 99 and that delim is a space. OptXRefID is an identifier between two at signs, e.g. @I43@, Tag is a string, and OptLineValue is either a Value (which is a string) or a Pointer to an XRefID that will have the identifer’s ID, i.e. @I43@.

The amazing thing about this simple scheme is that it can be used to define any data representation language. In that way, it is similar to XML and JSON. Why didn’t the developers of GEDCOM use XML or JSON for their standard? Well, they weren’t invented yet. So the developers of GEDCOM developed their own data representation language.  Almost anything written with XML or JSON could also be written with the GEDCOM Grammar, and vise versa.


Reading GEDCOM 201

Well that’s all fine and good, but we have no genealogical data defined yet.

So thus comes Chapter 2 of the Standard:  The GEDCOM Structure. This section defines what the valid tags are at each level, and what type of value or pointer each tag might have, and what substructures each tag at each level might have.

This is all explained quite clearly in the Standard.

Using the ParseLine routine above, there are only a few different types of lines to process:

  1. A Level 0 line starts a new record. The Tag defines the record type. The Tag could be INDI (Individual), FAM (Family), SOUR (Source), REPO (Repository), OBJE (Multimedia) and a few others. When a Level 0 record is reached, the programmer will start a new entry for it in a data structure or table of a database.
  2. A Level 1 line starts a fact or information for the Level 0 record before it. Each Level 1 fact can be associated with Level 0 record it belongs to.
  3. Level 2 and below is additional information belonging to the Level 1 fact before it.
  4. The Pointers are the connectors. There are two types of connectors:
    • Individual to Family connectors, to connect parents to children and spouses to each other using the FAM record as the intermediate linkage. It’s a tricky concept to grasp but programmers will understand it as it’s a relatively simple structure in graph theory.
    • Connectors to other records, e.g. a fact is connected to a source record using a SOUR tag with a pointer (which is how GEDCOM implements citations), or a source is connected to a repository with a REPO tag with a pointer.

Seriously, that’s basically it.

You can add some data checking and error messages if the GEDCOM isn’t valid, but the amount you do is up to you if all you are doing is reading GEDCOM.


What About the Program’s Data Structure?

Most genealogy software developers already have their own program and data structures or database to store their data. To read GEDCOM, all they have to do is pigeon-hole each line of data into their own data.

Err, um, and that’s where the data loss due to transferring GEDCOM occurs. If a GEDCOM file includes types of data, e.g. a phonetic name variation, which the developer’s program doesn’t support, well what’s the developer to do? Unless he sees a need for that data in his program, he’ll just throw it away, as he will with any user-defined program specific data that often comes in GEDCOM.

The most often reason GEDCOM data doesn’t transfer isn’t because the GEDCOM standard can’t represent the data. Rather, it’s because the receiving program doesn’t have a place for the data.

Now if the programmer didn’t already have a structure for his data, he could define one based on the GEDCOM Structure. Doing so, he could have a place for everything that’s possible to be transferred. Then it’s up to the programmer to develop reports and forms to allow the user to see and edit the data. Whether or not they make reports and forms that can display and edit all possible data from a GEDCOM file is a different matter.

There are some programs that can almost do this. Ancestral Quest and the no longer supported Personal Ancestral File (PAF) program have internal structures that helped define the GEDCOM standard when it was developed. The program Family Historian uses GEDCOM as its database. Even so, there are differences in what data these programs will display for you, allow you to edit, import from GEDCOM and export from GEDCOM.

No program is perfect. That’s why there are so many of them, because everyone (and every programmer) wants something a bit different.

Getting your data to transfer well from one program to another isn’t a matter of making the GEDCOM standard better. It’s a matter of getting the developers to all include in their data structures everything that GEDCOM has to offer –> and that isn’t going to happen anytime soon.