Login to participate
  
Register   Lost ID/password?

Louis Kessler’s Behold Blog

Complete Genealogy Data Transfer - Mon, 8 Jun 2015

Isn’t this what every genealogist wants?

I thought the problem was that when you export your data from one program, the second one doesn’t read it all in. The sources may not transfer properly. Some data may come in as notes rather than as the event information they should be. Some data may just be ignored completely. Forget about the formatting and don’t even think that you’ll ever get back those hundreds of hours you spent carefully filling out those source citation templates.

We’ve been complaining for years that GEDCOM doesn’t transfer all the data. I’ve said before that it’s 90% the programmers and only 10% due to GEDCOM, but the reason doesn’t really matter. What matters is that it doesn’t do it.

So what’s the solution?

I thought it was very clear.

When a new genealogy data communication standard is created, it must require all compliant programs to:

  1. Input the data using the standard,
  2. Output the data using the standard, and
  3. Pass all input data through to the output including what it may not use or may not process so that EVERYTHING that was input (that is unchanged by the user during the run) will be output.

That number 3 is the key. The *only* way to get complete data transfer is to transfer the data completely, isn’t it?

For a moment, let me first reassure you that I am really working hard on Behold whenever I’m at my computer. But when I’m away and don’t have access to my development tools, I catch up on my other sites, including all the stuff going on at FHISO. I am interested in seeing a new genealogy data transfer standard to allow for the complete data transfer which is GEDCOM’s biggest problem. I’d like to see the effort move forward. And every so often, I just have to put my two cents in when I read an important post on the FHISO forums.

A week ago the FHISO Technical Standing Committee Coordinator, Luther Tychonievich asked an excellent question. He was asking the best way in a new genealogy data standard for a program to handle a data structure that it does not support. He gave 3 options that lose some data, and a 4th option that the program must be able to input, edit, and re-export the file, keeping the data structure intact.

I immediately saw and replied that the requirement was similar to option 4, but the program need not be able to edit the data structure. It only need to input and re-export the file. In other words, the program must “pass-through” all the data that it doesn’t use.

Wow! What a reaction. There are a lot of intelligent people, excellent programmers and deep thinkers on the FHISO mail list, and a thread started with the subject “Pass Through Requirement??”. I am not sure what it was that wasn’t clear, but there was almost complete rejection of the necessity of data pass-through.

I think what I said is important enough that I’d like to repost it here and get some opinions from the general genealogical community.

What do you think. Am I right or am I wrong?

This is what I said:

Sorry people. You can disagree, but I’m sticking by my guns. Data not processed must pass-through.

Let me reiterate my example again:

Program A sends data to Program B. Program B doesn’t understand Concept 1 that program A uses, so throws away Concept 1 data.

Program B sends data it got to Program C. Program C doesn’t understand Concept 2 that both Program A and Program B uses, so throws away Concept 2 data.

Program A now gets its original data back from Program C. All its Concept 1 and Concept 2 data is missing.

In other words, data gets lost when one program will not pass-through data that it will not handle.

This is why I see a requirement of data pass-through as necessity.

The non-transferability of data through GEDCOM is the number one complaint of GEDCOM and is really the primary reason why FHISO needs a new standard.

FHISO must write the new standard so that different concepts that not all programs will support (e.g. the information/evidence layer, GPS, citation templates, capabilities for one-name or one-place researchers, evidence analysis, etc.), must be sufficiently independent of each other in the standard so that a program that does not handle a concept can simply pass the data through. It will take some thinking to do this properly, but it can be done.

But once you allow any data to be lost, all is lost.

If data loss is allowed, then using an extreme example, a programmer might decide not to handle sources at all. They’ll do stuff with just the conclusion data and export just the conclusion data with none of the sources that were originally attached to the input data.

Yes, this program is compliant. It follows the standard for defining the data. FHISO will have to endorse it as a compliant program if data loss is allowed.

If FHISO is just creating a data definition standard, that is fine.

But FHISO is creating much more than that. FHISO is creating a data COMMUNICATION standard. See that key word. The data must be communicated between programs. Data loss does not communicate the data and is unacceptable.

Don’t take an example of html being lost by a text processor. That’s quite different. Take an example of sending your data up to Ancestry, you editing it up on Ancestry, and then you downloading it and not getting everything back, be it notes, sources, pictures, or maybe certain tags or data items that you don’t notice until its too late. Imagine wanting to move from Ancestry to FamilySearch and then later from FamilySearch to MyHeritage.

Yes, I know that there are all sorts of tricky little examples that seem to make this difficult: e.g. person is edited with unhandled data. But these are all solvable once the core idea of data pass through is accepted and designed.

Louis

Do you care if all your data transfers, or don’t you?

Sometimes It Works … But Not Always - Sat, 6 Jun 2015

Didn’t quite make my self-imposed May 31 deadline. It was a busy week.

I also got caught up in the attempt to add one last improvement into this version. I just can’t help myself. Yes, I know it’s better to get the version out first and then add the improvements later. But as I came to a screenshot I was going to be including in the documentation, I felt a slight need to standardize and improve the data presentation..

So I thought about the structure I have assembled. If you take a look at the information for each event for a person being presented, I’ve set it up to look like this (with the indentation as shown):

Person
      Event: Date Place
            Event-details  
            Source, Analysis

An example of this is:

image

Here you have several event details: The wife’s name and age. A photo, and the source of the marriage information along with its assessed quality and data from the the record,  which could as well have included an analysis of the source record.

Good! Now for the Place Details section, why not flip around the Person and the Place, and include the same information, like this:

Place 
      Event: Date 
            Person  
                  Event-details
                  Source, Analysis                  

And an example would be:

image

I’m sure that One-Place studiers are especially going to love the presentation of the data this way. Everything is there, organized by place and then by date.

So I thought I’d do the same thing with the sources. Simply in a similar manner, flip the Source with the Person in the first structure and get something like this:

Source 
      Event: Date Place 
            Person, Analysis 
                    Event-details

It seemed correct. And it contained everything like I hoped it would. But something just wasn’t quite right.

After I played with it for a while, I realized what the problem was.

In the first example ordered by person, all the event information was associated with the person. In the second example ordered by place, all the event information was associated with the place.

But in this case, ordered by Source, not all the information is associated with the source. That other-event-info (in this case the picture) is associated with the marriage event for the person, but it is not associated with the source. That additional information could include other notes or even other sources.

The picture would need its own source information if it is to be assigned to the source.

So I relented and realized that in the Source Details, I could only have this structure:

Source 
      Event: Date Place 
            Person
                 Analysis

and it looks like this:

image

So the image I had of presenting complete event information under the Person, the Place, and under the Source, just didn’t quite pan out. But this is the best you can do.

At the end of this, I did change the Place Details to the expanded structure, so it was a worthwhile diversion. This was something I had in my future plans for version 1.5, so that’s now done.

Still finishing off the Version 1.1 documentation. Doing this has been a really good audit that everything’s working and the way I want it. But now I’m hoping this will have been the last diversion before releasing 1.1.

Recording Your Reasoning (Proof Argument) - Thu, 28 May 2015

So I was about to to update the documentation section for Source Details, and I looked at how I had implemented this in version 1.0:

image

Version 1.1 was unchanged from this, so I thought the update of this section would be simple. But then it struck me. I was not displaying this information correctly.

In the couple of years since the last version of Behold, I had been increasing my knowledge about dealing with sources, and last August I wrote a paper on Standardizing Sources and Citation Templates, which I submitted to FHISO.

Point number 8 in that paper was something I realized was very important: to separate out the sources from the conclusions. What I said was:

All assumptions and conclusions and assessments of the source must be placed with the source reference, not with the source

Well, take a look at S5-3 in the screenshot above. Listed under the source record is some “data” that says:

The Enumerator for this census improperly listed John J. McCarthy as being born Sep 1800, he then changed the date to 1890, which is still incorrect. John is listed as being 8-months old on the date of the census, 9 Jun 1900, if you reverse the date for 8-months John would have been born in Sep 1899, which is the proper date.

Well this information is not part of the source. This is the researcher’s assessment of the source. It should not be displayed as part of the source, but should be displayed as the analysis of the source used to arrive at the the conclusion. In this case, the conclusion is the birth event of John J. McCarthy.

Similarly, there are those “Quality” values. Those also are not part of the source. They are the researcher’s assessment of the quality of the source with respect to arriving at the conclusion. One source can be assessed differently for different conclusions, e.g. a death record may be very good for the death date, but if it only includes an age at death, then it’s not as good for the birth date.

Let’s see where this information comes up in GEDCOM. There’s conclusion information:

1 BIRT
2 DATE SEP 1899
2 PLAC Boston, Suffolk, MA

and under that will be the source reference:

2 SOUR @S90@
3 PAGE Genealogy.com, Series: T623, …
3 QUAY 3
3 DATA The Enumerator for this census …

The SOUR line is the pointer to the source. The PAGE line describes the specific record used within the source.

Now those two lines, the QUAY and DATA, well they are not part of the source record and shouldn’t be displayed as part of the source record. They are part of the reference to the source. They describe the linkage, i.e. the analysis and reasoning used to come up with the conclusion.

As a result, the QUAY and DATA and the other information allowed with it (e.g. Notes, Objects, Date recorded, Event cited from), are all part of the linkage between the source record and the conclusion. What this means is that this information needs to be displayed in two places.

One place is with the conclusion, to describe the reasoning the source record brought to the conclusion:

image

The other is with the source record, to show the reasoning was used with this source for each conclusion:

image

Note the difference in the S5-3 listing to the 1.0 version of it shown earlier. The Quality and Data are now shown attached to the conclusion event that is supported by the source record.

Previously they were included as part of the source record. When that was done, S5-4 did not have exactly the same quality and Data values. So S5-4 previously was shown as a different record.

Now the source record can be treated as identical, and the former S5-3 and S5-4 can be put together as the new S5-3 with a combined total of 3 supported events. The reasoning based on that source record can now be attached individually to each event.

This is really major. It has opened my eyes up to the fact that GEDCOM actually has a place for a researcher’s reasoning statements. The user’s analysis/reasoning can go into the NOTE statements that are placed in source references.

Doing this can allow a step-by-step proof argument to be documented and passed on through GEDCOM to another program. You would do it like this:

1 BIRT
2 DATE dd MMM yyyy
2 PLAC xyz
2 SOUR @S1@
3 NOTE The date and place were from the birth certificate
2 SOUR @S2@
3 NOTE Immigration record contained her age and country o 4 CONC f origin, agreeing with what I had.

Once I add editing to Behold, I’ll also add the ability to record your step-by-step proof argument, and you’ll be able to document and display all your reasoning.

I needed three days to make these changes to Behold. Now back to finishing the documentation and getting Version 1.1 out.