Login to participate
  
Register   Lost ID/password?
Louis Kessler’s Behold Blog » Blog Entry           prev Prev   Next next

Is Perfect GEDCOM Reading and Writing Necessary? - Mon, 16 Oct 2023

My answers might surprise you.


Reading Perfect GEDCOM

In my previous article Reading GEDCOM – A Guide for Programmers, I outlined a general technique that a programmer of a genealogy program could use to read GEDCOM. Basically, the programmer should read each line in using the generalized definition of a “gedcom_line”. Then from those lines, they should pick out the data that their program uses and load it into their program’s data structure or database.

But what if the programmer wanted to read the GEDCOM perfectly? i.e. exactly as the standard defines it.

That is entirely possible. It is a bit more difficult. The programmer will have to write a parser for GEDCOM. There are a few  Open Source GEDCOM Parsers available in various programming languages, but there is no guarantee that any of them can read GEDCOM perfectly. So for perfection, it really will be up to the programmer to write their own parser in their language of choice.

Writing a parser was an assignment often given to Computer Science students when I was taking the subject in University back in the ‘70s. Somehow I managed to avoid any courses with that assignment.

Back in 2016, I wrote an article Unexpected BNF where I was planning to implement complete GEDCOM checking in Behold, and in that article, I came up with a scheme to do it. I never did follow through with that implementation.

But I have just revisited the idea and have now implemented a close to perfect GEDCOM parser in my developmental version of Behold, which will be included when I release this version.

What makes it is nearly perfect is that I have entered the GEDCOM notation into my code in a manner that ensured that I was following all the rules. The way I did this is similar to but slightly different from what is in my Unexpected BNF article.

For example, the GEDCOM INDI record in the Standard starts with this:

image

My code to parse this looks like this:

image

It’s very easy to check that I’ve translated each line of the Standard to my calls to my GC procedure. My GC procedure does all the magic of building up the GEDCOM structure as an internal data structure in Behold. I won’t give the code here, but that routine is only 43 lines long (including blank lines and comments).

The internal data structure connects lines at one level to possible lines at the next lower level, gives the Tag (preceded by an @ if it has an XREF_ID) and the optional Line Value  which could be a value (i.e. a string in quotes that GEDCOM calls an element), a pointer (starts with a @), or a structure (enclosed in << >>). It also gives the minimum and maximum number of times that line may occur under its higher level tag.

I ended up with about 400 of these GC calls  The constructed data structure would ensure that the order and the number of occurrences of each line follows the standard.

I still needed to define the data values (i.e. elements). that are in single quotes. The GEDCOM standard defines the elements like this:

image

so I made a ELEM procedure for this:

image

which contains the element name, minimum and maximum length, and the processing Behold will do on that element.

If the processing value is zero (0), then Behold will treat this element as any string value and will only check that the string’s length is in the allowed range.

If the processing is Elem_Custom, then Behold will do some custom processing for this element, e.g. for Sex, ensuring it is one of the three allowable values.

I did not use Elem_Custom for ROLE_IN_EVENT or ROMANIZED_TYPE, because even though they have legal values specified, they also allow any user-defied value <ROLE_DESCRIPTOR> or <user defined>, meaning any string is allowed.

And one of the reasons why you can’t write a perfect parser for GEDCOM is because there are some minor mistakes in the standard. e.g. SEX_VALUE size is given from 1 to 7, when clearly 1 to 1 should be there. This is a remnant of earlier GEDCOM versions where in addition to single letters, the full words MALE, FEMALE and UNKNOWN were allowed. Changing the length to 1:1 was overlooked when the words were removed.in later versions of the standard.

Similarly, ROLE_IN_EVENT is given a length of 1 to 15. But one of its values is <ROLE_DESCRIPTOR> which can be any string from length 1 to 25. So the 1:15 value will cause truncation of a long role descriptor and should be 1:25.

None-the-less, all the elements can be defined to check that the values in the GEDCOM file are all valid and not too long or short.


So We Can Read GEDCOM Nearly Perfectly. Should We?

The real purpose of reading GEDCOM is to load the genealogy data from the GEDCOM file into your program. If you only read “perfect” GEDCOM, you will miss out on loading valid data contained in slightly imperfect GEDCOM. As I mentioned earlier, a general GEDCOM reader should be used to read a GEDCOM file.

However, a “perfect” GEDCOM checker could be used to check that the GEDCOM follows standards. Your reader can check the data and provide messages about what is wrong in the file. Wouldn’t that be important to know when saving the data back to a GEDCOM file?  (See the next section for more about this.)

But you should use a general GEDCOM reader to load the data into your program. That means, no, it isn’t necessary to read GEDCOM perfectly.


Writing Perfect GEDCOM

Okay then. Surely it is important to write “perfect” GEDCOM to a GEDCOM file.. After all, if you don’t, then all those programs that read GEDCOM won’t be able to load data from your GEDCOM, will they?

And isn’t it the non-conformance to GEDCOM by programmers that is causing much of the data loss? - As I talked about in my previous article.

Well, technically, a developer can have his program export all its data using his own custom tags by starting them with an underscore. Instead of using the GEDCOM tag BIRT for birth and DEAT for death, he can use _BORN and _DIED. His GEDCOM could still be 100% valid because although GEDCOM discourages such user-defined tags, they are allowed anywhere.

I’m sure the developer will read the data from own _BORN and _DIED tags correctly back into his own program. The problem is that very few if any other programs will be likely to load the birth and death data from those tags into their data structure and the data will not transfer.

The solution is (to start a movement) to try to get programmers to avoid using their own custom tags and to use a valid GEDCOM construct whenever possible. It would take a bit more effort in to see if GEDCOM already has a place for each of their data elements. Here’s a few examples.

  • I’ve seen one program export these facts and events:  _DEGREE, _SEPARATED, _MILITARY_SERVI, and _NAMESAKE
    Most programs would throw these away.
    Custom facts and events in INDI or FAM records should be defined using an EVEN or FACT tag followed by a TYPE tag, e.g.
       1 EVEN
       2 TYPE Degree
    and
       1 FACT
       2 TYPE Separated
    and
       1 EVEN
       2 TYPE Military Service
    and extra name information can be done similarly
       1 NAME
       2 TYPE Namesake
    Most programs would understand the above and keep them.
  • I’ve seen programs do all sorts of weird things when exporting citations. Whereas they could instead use the very powerful but underused PAGE tag in GEDCOM with its WHERE_WITHIN_SOURCE value, e.g.:
       2 SOUR @S22@
       3 PAGE Film: 1234567, Frame: 344, Line: 28
    The citation elements and values can simply be listed in pairs divided by colons and separated by commas. Most programs would load this correctly and display the citation beautifully the way the program wants to.
  • Invalid dates, e.g.:
    1950-57 is not valid. You must say FROM 1950 TO 1957.
    If you really want to include an invalid date, GEDCOM allows just about anything if you enclose it in parenthesis, e.g.: (1950-57)
  • There are a number of data values that allow any textual value. These should be used instead of creating new tags. e.g.
    -  The ASSO tag allows any text for the relationship between 2 people.
    -  The ROLE tag allows any text for the role a person plays in the citation of an event.
  • And what’s wrong with throwing anything else that has no representation in GEDCOM into a NOTE?  At least it will be read and retained by the program reading it.

Most of this is just basic GEDCOM. Almost all programs will read the above and place your data into their data structures or database. There would be minimal data loss.


So We Can Write GEDCOM Nearly Perfectly. Should We?

A programmer does not need to have his program write ALL of GEDCOM perfectly. Only the data that is being exporting needs to be written correctly, which often is a small subset of the full GEDCOM standard.

What’s important is that GEDCOM tags and constructs are used rather than the lazy-man’s user-defined tag which often causes data loss.

And with regards to the maximum and minimum lengths of values, you needn’t worry about them. The GEDCOM maximums and minimums are often rather arbitrary and were added at a time (over 20 years ago) where minimizing program size was important.. Many genealogy programs today use text strings that can be 255 characters or more in their data structures. They then export strings that are longer than GEDCOM maximum length for that value. You do not want a program to truncate the data value to adhere to GEDCOM’s maximum length. That would be a loss of data. Most programs don’t check for the maximum length and should read it fine unless it is excessively long. So here, with regards to value lengths, don’t write GEDCOM perfectly.

1 Comment           comments Leave a Comment

1. Steve Little (digitalarchivist)
United States flag
Joined: Wed, 10 Nov 2021
5 blog comments, 0 forum posts
Posted: Wed, 18 Oct 2023  Permalink

Louis, these two posts are gems. I found the GEDCOM 101 and 201 sections especially insightful. Thank you for these, and all your work, and for sharing the accumulated wisdom of your years of experience. You explain things in a way I find accessable and valuable.

 

The Following 1 Site Has Linked Here

  1. Friday\\\'s Family History Finds Oct 20, 2023 - Empty Branches on the Family Tree - Linda Stufflebean : Sun, 22 Oct 2023
    "Is Perfect GEDCOM Reading and Writing Necessary? by Louis Kessler on Behold Genealogy"

Leave a Comment

You must login to comment.

Login to participate
  
Register   Lost ID/password?