Login to participate
  
Register   Lost ID/password?
Louis Kessler's Behold Blog » Blog Entry           prev Prev   Next next

Complete Genealogy Data Transfer - Mon, 8 Jun 2015

Isn’t this what every genealogist wants?

I thought the problem was that when you export your data from one program, the second one doesn’t read it all in. The sources may not transfer properly. Some data may come in as notes rather than as the event information they should be. Some data may just be ignored completely. Forget about the formatting and don’t even think that you’ll ever get back those hundreds of hours you spent carefully filling out those source citation templates.

We’ve been complaining for years that GEDCOM doesn’t transfer all the data. I’ve said before that it’s 90% the programmers and only 10% due to GEDCOM, but the reason doesn’t really matter. What matters is that it doesn’t do it.

So what’s the solution?

I thought it was very clear.

When a new genealogy data communication standard is created, it must require all compliant programs to:

  1. Input the data using the standard,
  2. Output the data using the standard, and
  3. Pass all input data through to the output including what it may not use or may not process so that EVERYTHING that was input (that is unchanged by the user during the run) will be output.

That number 3 is the key. The *only* way to get complete data transfer is to transfer the data completely, isn’t it?

For a moment, let me first reassure you that I am really working hard on Behold whenever I’m at my computer. But when I’m away and don’t have access to my development tools, I catch up on my other sites, including all the stuff going on at FHISO. I am interested in seeing a new genealogy data transfer standard to allow for the complete data transfer which is GEDCOM’s biggest problem. I’d like to see the effort move forward. And every so often, I just have to put my two cents in when I read an important post on the FHISO forums.

A week ago the FHISO Technical Standing Committee Coordinator, Luther Tychonievich asked an excellent question. He was asking the best way in a new genealogy data standard for a program to handle a data structure that it does not support. He gave 3 options that lose some data, and a 4th option that the program must be able to input, edit, and re-export the file, keeping the data structure intact.

I immediately saw and replied that the requirement was similar to option 4, but the program need not be able to edit the data structure. It only need to input and re-export the file. In other words, the program must “pass-through” all the data that it doesn’t use.

Wow! What a reaction. There are a lot of intelligent people, excellent programmers and deep thinkers on the FHISO mail list, and a thread started with the subject “Pass Through Requirement??”. I am not sure what it was that wasn’t clear, but there was almost complete rejection of the necessity of data pass-through.

I think what I said is important enough that I’d like to repost it here and get some opinions from the general genealogical community.

What do you think. Am I right or am I wrong?

This is what I said:

Sorry people. You can disagree, but I’m sticking by my guns. Data not processed must pass-through.

Let me reiterate my example again:

Program A sends data to Program B. Program B doesn’t understand Concept 1 that program A uses, so throws away Concept 1 data.

Program B sends data it got to Program C. Program C doesn’t understand Concept 2 that both Program A and Program B uses, so throws away Concept 2 data.

Program A now gets its original data back from Program C. All its Concept 1 and Concept 2 data is missing.

In other words, data gets lost when one program will not pass-through data that it will not handle.

This is why I see a requirement of data pass-through as necessity.

The non-transferability of data through GEDCOM is the number one complaint of GEDCOM and is really the primary reason why FHISO needs a new standard.

FHISO must write the new standard so that different concepts that not all programs will support (e.g. the information/evidence layer, GPS, citation templates, capabilities for one-name or one-place researchers, evidence analysis, etc.), must be sufficiently independent of each other in the standard so that a program that does not handle a concept can simply pass the data through. It will take some thinking to do this properly, but it can be done.

But once you allow any data to be lost, all is lost.

If data loss is allowed, then using an extreme example, a programmer might decide not to handle sources at all. They’ll do stuff with just the conclusion data and export just the conclusion data with none of the sources that were originally attached to the input data.

Yes, this program is compliant. It follows the standard for defining the data. FHISO will have to endorse it as a compliant program if data loss is allowed.

If FHISO is just creating a data definition standard, that is fine.

But FHISO is creating much more than that. FHISO is creating a data COMMUNICATION standard. See that key word. The data must be communicated between programs. Data loss does not communicate the data and is unacceptable.

Don’t take an example of html being lost by a text processor. That’s quite different. Take an example of sending your data up to Ancestry, you editing it up on Ancestry, and then you downloading it and not getting everything back, be it notes, sources, pictures, or maybe certain tags or data items that you don’t notice until its too late. Imagine wanting to move from Ancestry to FamilySearch and then later from FamilySearch to MyHeritage.

Yes, I know that there are all sorts of tricky little examples that seem to make this difficult: e.g. person is edited with unhandled data. But these are all solvable once the core idea of data pass through is accepted and designed.

Louis

Do you care if all your data transfers, or don’t you?

16 Comments           comments Leave a Comment

1. arb (arb)
Australia flag
Joined: Wed, 25 Feb 2015
7 blog comments, 0 forum posts
Posted: Mon, 8 Jun 2015  Permalink

(Apologies in advance - I’m on my iPad without a decent keyboard and this is mostly stream-of-consciousness stuff…)

This is a very tricky (and potentially messy) area and I don’t think there is one “right” way of doing this. To start with, the restrictions you want to place on software implenenting this new standard will severely restrict the take up of the standard. I think you are trying to put too much responsibility on a data interchange standard. The key should be “data interchange” - keep the standard’s focus on that and this becomes less of an issue.

What do I mean by that statement? The standard should _only_ specify _how_ data exported from an app should be formatted, and how that data should be interpreted upon import. the standard should make recommendations on how to handle unsupported data elements/constructs but these should be optional recommendations only. IMHO, a genealogy data format standard should not be concerned with passthrough - yes, complete passthrough might be nice to have, but it is not essential.

Some apps may only ever consume data files produced by the standard (ie, tree printing/charting apps) and may never need to export data, while other apps might only ever produce data files and never consume them (my aps will fall into this category) and others will be both producers and consumers.

One beef I have with some of the genealogy software I have used is that the developers tried to stick too close to GEDCOM when designing their apps’ data structures. This severely constrains the software’s ability to innovate - if its not in GEDCOM, the devs won’t put the features in their app. Your approach to enforcing all elements of the standard will similarly constrain developers (whether a real constraint or just in the develoepr’s minds) and this is not good, IMHO.

Developers need to be able to implement the features they want in their appsand that includes leaving out features if it makes sense for their app to do so.

2. Louis Kessler (lkessler)
Canada flag
Joined: Sun, 9 Mar 2003
226 blog comments, 226 forum posts
Posted: Tue, 9 Jun 2015  Permalink

Arb:

I think data exchange is everything. If you can’t export the data your program creates, then it’s an island unto itself. So yes, developers need to implement the features they want and they only have to make use of just the data they need, But unless they can’t add their contribution to the data stream without deleting some of the stream in the process, they’ll be doing more harm than good.

With regards to whether this will restrict take-up of the new standard: That was one of the concerns of Luther on the FHISO list, and I had responded with this:

Luther said:
> Option 4 suggests to me a significant barrier to market share (at least in
> regard to export)

Maybe at first. But if the new standard is presented as one that will NOT
lose data, I think it will be one that is wanted and will eventually be
accepted by all once the early adopters show that it works.

To enable this, the standard will have to be of a form that will allow
easy implementation of pass-through of data and have very specific
instructions to tell the developer how to do this.

The exact scheme, of course, will be dependent how the final standard is
structured. I don’t pretend to have the answer now, but it could be as
simple as saying something like this (in GEDCOMish):

——————————-
On input, any data not supported by the software should be stored in a
table internally, e.g.:

@I3@ INDI.INDI @I2@
@I3@ INDI.BIRT.SOUR.NOTE.SOUR @S51@

and on output merged back into any INDIs, SOURs, NOTEs etc. that have
not been deleted by the user during the run.
——————————-

If this entire write-up in the standard ends up taking more than a page or
two, then it is probably too complicated and should be reworked.

3. arb (arb)
Australia flag
Joined: Wed, 25 Feb 2015
7 blog comments, 0 forum posts
Posted: Tue, 9 Jun 2015  Permalink

Louis, some programs _are_ islands. A program that imports a data file and prints a report or fancy poster-sized tree does not need to export any data. These programs don’t necessarily generate any data. Other programs might import data and do something more intensive, such as providing data mining tools, where any output will necessarily be a subset of the original data. Other programs might only ever create/originate data - a research planner & log tool for instance, might never import data. There is a wide spectrum of different types of (genealogy) software with widely varying data import and export needs. Requiring that a program retain and export data when that data, or even export functionality, is not required by the program is just putting unnecessary work on the devs.

Any standard should recommend that unknown/unexpected data be retained and passed through if there is an import/export cycle, but IMHO it should not be a hard requirement.

Should the standard require that an import followed by an export should not lose data? Probably yes if there is no further manipulation between the import and the export, but what if there is some data massaging or restructuring? Some data must be sacrosanct - source citations for example should never lose detail - but should _all_ data be treated as such?

Specifying explicit instructions on how data should be stored is a recipe for disaster. A data interchange standard should let the developer choose how information is stored internally and should only specify how data is to be expressed when being exported. One thing that turned me off FHISO (and the GenTech data model effort) early on was some of the discussions I saw concerning database schemas - a data interchange standard should have no opinion on the internal data representation. So long as the developer is capturing the relevant information, it should matter if I use ints, floats, strings or some weird binary representation. Similarly, if I whose to use a network database, or XML files, or some NoSQL bigtable variant, that should be my decision and the standard(s) should have nothing to say on the matter. Some of my data is not being stored in a traditional database, let alone in a tabular format - why are you trying to tell me how to store my data, when all we really care about is how I am going to present the data when called upon to export it?

We are all looking for a robust data INTERCHANGE standard. There is no need for a requirement that all apps implementing the standard should be part of a data pipeline, but if they are, then they should state up front what, if any, data manipulation will happen during an import/export cycle.

One final note: the standard should not guarantee that data will not be lost, apps that implement the standard should be making those guarantees. If an app deliberately strips certain data (say a hypothetical genea-processor that imports a file and exports a simplified version containing only birth, marriage, and death events, stripping out all other event types) then are you saying this app cannot implement your standard?

4. Louis Kessler (lkessler)
Canada flag
Joined: Sun, 9 Mar 2003
226 blog comments, 226 forum posts
Posted: Tue, 9 Jun 2015  Permalink

Well yes. A read-only program obviously need not export anything. But if it does, and if it expects other programs following the standard to read its export, then its export must not lose any data.

Sorry, but I’m very adamant on this point. This new standard must be rigid and must require data retention between programs. Anything less and we’re no better off than GEDCOM.

5. arb (arb)
Australia flag
Joined: Wed, 25 Feb 2015
7 blog comments, 0 forum posts
Posted: Tue, 9 Jun 2015  Permalink

Okay, let me try another tack… What you propose is potentially dangerous, and here’s why:

* You create a data file using program A, A includes some non-standard attributes, which are exported as per the standard.
* The user then imports this file into program B, which does not support, nor even understand, A’s non-standard attributes. The user works with program B to do some editing. because B does not understand the non-standard attributes, they are not displayed to the user, so the edits are done with this data being hidden - submarine data if you will.
* The user now exports the data from program B and imports it into program C. Program C does understand A’s additions and displays the data, which to the user has magically (re-)appeared and might now contradict some of the changes made in program B. This is not a good situation.

So instead you may mandate that all programs _must_ support all of the additions made by every other program - hardly a workable solution.
Another solution you might propose is that all programs _must_ display non-standard data in some pre-specified format. Sounds good, but Program B is not going to be able to edit this data, because it does not know the rules for such edits. The data will be kept and shunted from one app to another until it lands back with a program that understand these additions. Not really workable, or user-friendly, IMHO.

Another scenario to consider: Let’s say you maintain your family tree in program A, which includes non-supported additions. You receive a new document which you break down into several claims, and after some analysis you decide that the document refers to person X, so you add attributes and events to X citing this new document. Next you export the data and import it into program B, which imports your non-standard attributes and events. Further analysis of the source document shows that it doesn’t really relate to person X, but person Y instead. In program B you delete the attributes and events that B understands and attach them to the correct person, but because B does not understand the added attributes and/or events that A included, those additions are not correctly transferred. You then export the data for import back into program A - you now have an inconsistent data file, with some attributes/events relating to person Y attributed to person X.

So long as I tell you up front that unsupported attributes will not be imported, and so long as I _do_ correctly handle and export the core data from the standard, there should be no problem. Again, this is for a data INTERCHANGE format, not a “one true genealogical data model”. My apps may very well be storing more data than is ever exported in this new interchange format. Just like today with GEDCOM, it is perfectly legal and valid for me to store extra data in my app which is not supported by, nor exported to GEDCOM. We use many different tools for various reasons, and not all tools are going to handle, understand, nor import/export all of the data from other programs.

Any future standard should get the basics right. Agree on the key containers. Agree on the base elements of source citations. Agree on a thesaurus for different terminology so different terms can be correctly translated. But allow developers to make their own decisions on what data they import and/or export.

Now, having said all that, I would support a movement towards a certification process for software which implements the standard. Software could be certified as meeting various levels of support for the standard - Imports core data; exports core data; imports extensions with read-only viewing; imports extensions and exports the data unmodified; imports extensions, allows some edits, exports in a compatible format; etc.

6. Louis Kessler (lkessler)
Canada flag
Joined: Sun, 9 Mar 2003
226 blog comments, 226 forum posts
Posted: Wed, 10 Jun 2015  Permalink

Arb:

Excellent! Now you’re really starting to think about this.

The examples you give of data-pass through when an intermediate program doesn’t handle something are solvable. As I said in my post, the concepts “must be sufficiently independent of each other” to help prevent such problems. What to do about it should also be defined in the standard.

With regards to extensions to the standard, I say a new standard should not allow them. If one program has an extension, a second won’t understand the meaning of it. All important concepts must have constructs in the new standard, and less important substructures can be customized with something like a TYPE attribute in GEDCOM, with any value allowed as the type. See point 7 in my Nine Necessities in a GEDCOM Replacement paper.

I agree with the certification process. See point 8 in the same paper. But I think only a single read/write certification is required. Read-only programs need not be certified.

Louis

7. Will Chapman (qbuster)
United Kingdom flag
Joined: Tue, 19 Jan 2010
1 blog comment, 6 forum posts
Posted: Sat, 13 Jun 2015  Permalink

I agree with Louis. I do accept what arb is concerned about but, if I understand Louis correctly, I think you are missing the main point: main point is that if gedcom reader x doesn’t recognise a chunk of data it should ignore it - pass on by and leave it intact.

For me this is a no-brainer. If I can’t trust a genealogical application to give me the data I submitted back in a format that my original software can read, I won’t use it again and I will name-and shame that app to my friends and colleagues. The size of the industry is substantial - millions of users each typically spending hundreds of dollars and thousands of hours each year to be part of it. It is time the industry joined forces to solve the problem once and for all.

I am a long term user of Legacy, Geni/Myheritage and FamilySearch and I have been impressed with how they have found a way to share data and edits. I suspect that FamilySearch are behind this initiative but whomever it is has to be congratulated. ((From a base tree of around 7,000 records,the enhanced features made over the last 12 months or so have enabled me to discover over 100,000 blood relatives - going back 40+ generations). Having said that, whilst I can export the ‘bloodline’ or what they call the ‘forest’ (all connected records), the resulting gedcoms aren’t what I would call flawless - yes I accept that it might not high on their development agenda - and this is why I am keen to see Behold progress and realise its potential; I desperately need a way to edit my growning portfolio of gedcoms in such a way that any amendments can be saved to an gedcom format that can safely be used in my toolkit of gedcom apps.

Regards

Will

8. Louis Kessler (lkessler)
Canada flag
Joined: Sun, 9 Mar 2003
226 blog comments, 226 forum posts
Posted: Sat, 13 Jun 2015  Permalink

Will,

I presume you don’t currently do extensive documentation of your sources, because I don’ t believe Legacy, Geni/Myheritage and FamilySearch do a very good job yet of transferring sources between each other or into GEDCOM. GEDCOM is technically okay for sources, but most programs don’t read them in well enough to make it worthwhile to have GEDCOM as the storage mechanism for them.

So the alternative is either a new standard, if FHISO ever gets off the starting block, or for developers (maybe me) to develop that AncestorSync-like program that will directly read and write with other program databases and online sites,

Louis

9. Enno Borgsteede (ennoborg)
Netherlands flag
Joined: Wed, 9 May 2012
15 blog comments, 0 forum posts
Posted: Tue, 16 Jun 2015  Permalink

Louis, I’m with arb, because I don’t think the issues are solvable, except when unprocessed objects are completely isolated from the ones that are edited.

You mention the extreme example of a program that doesn’t process sources, and I that’s a perfect illustration. When you have a tree where for example an estimated birth date is taken from a census, and someone edits that in a program that doesn’t do sources, setting an exact date from a birth certificate, or enters a birth place where in the census only a state was mentioned, the new birth data will not reflect the data found in the census source. And in that case, letting the original source pass through is plain wrong.

10. Louis Kessler (lkessler)
Canada flag
Joined: Sun, 9 Mar 2003
226 blog comments, 226 forum posts
Posted: Tue, 16 Jun 2015  Permalink

Enno,

I agree with you. When a data item is edited, the attached non-handled data may not be relevant or may even be wrong.

But that doesn’t mean the issue is unsolvable. One possible solution might be to require that the former pre-edited data be transmitted with the non-handled data as a separate event and marked somehow as “superceded”. Then a program that does handle the feature can display the info and the user can choose whether to reincorporate the data or to delete it.

I don’t see any of the issues of transferring all the data as being unsolvable. But it will take some thinking, and depend on how the final standard is set up in order to make it a simple process for developers.

Louis

11. Enno Borgsteede (ennoborg)
Netherlands flag
Joined: Wed, 9 May 2012
15 blog comments, 0 forum posts
Posted: Thu, 18 Jun 2015  Permalink

Louis,

Interesting idea, which suggests that you think about a sort of versioning in the data layer. Is that what you mean? Preserving the data as it was before a change looks like versioning to me, because there is a chance that the process needs to be repeated when another application not supporting the data makes another change.

Taken to the extreme, it might result in a stack of unprocessed and unverified changes, which must then be handled by a super standard application. And that would look a bit like the current mess on FamilySearch. :-)

Enno

12. Louis Kessler (lkessler)
Canada flag
Joined: Sun, 9 Mar 2003
226 blog comments, 226 forum posts
Posted: Sat, 20 Jun 2015  Permalink

Enno,

Well its not really like versioning. The previous version of the data needs to be passed ONLY when (1) the program does not support the feature in question, and (2) the data it subordinate to (i.e. dependent on) other data that the user has changed.

So there would never be two versions of any data. There would only be the latest version of any data, whether a program supported it or not.

Louis

13. Darren Price (cleaverkin)
United States flag
Joined: Tue, 14 Jul 2015
2 blog comments, 0 forum posts
Posted: Tue, 14 Jul 2015  Permalink

I propose a conformance requirement possibly more strict than what Louis describes. I think that in order for any software tool that performs both import and export should pass the following test: (1) Import any standard-conformant data file; (2) make one edit to the file; (3) undo the previous edit; (4) export the data. The resulting exported file should EXACTLY match the input file. As a software engineer I can think of a lot of reasons why this would be a difficult test to pass, either with current GEDCOM or a newer standard. But it seems to me that it shouldn’t be.

14. Louis Kessler (lkessler)
Canada flag
Joined: Sun, 9 Mar 2003
226 blog comments, 226 forum posts
Posted: Wed, 15 Jul 2015  Permalink

Darren:

I suppose few programs would have any trouble adding one event and then deleting it, or deleting one event and then re-adding it. Even adding or deleting an individual, and then deleting or re-adding him should be straightforward as long as all the parent/child/spouse links are re-added as well. In the latter case, the program should be allowed to use a different ID number for the individual, because they should not be expected to have to remember the ID number that was deleted.

This is a good test to see if a program is compliant to a standard, but I don’t know if it does much to ensure data pass-through.

So I’m not sure what you’re trying to test. You’ll have to be more specific.

Louis

15. Darren Price (cleaverkin)
United States flag
Joined: Tue, 14 Jul 2015
2 blog comments, 0 forum posts
Posted: Wed, 15 Jul 2015  Permalink

Rather than “make an edit” and “undo”, I probably should have said “add one data item” then “delete that item”.

It does at least insure that data not understood by the program is retained, and exported with exactly the same structure as was imported.

Darren Price

16. Louis Kessler (lkessler)
Canada flag
Joined: Sun, 9 Mar 2003
226 blog comments, 226 forum posts
Posted: Wed, 15 Jul 2015  Permalink

Darren,

Oh I see now. You just want to make sure that the program actually loads the data into its database. The “make an edit” and “undo” is to ensure that.

I don’t know if that’s necessary. I can’t really see how a program can load in an input file without loading it into its internal database. To export, they’ll always be exporting from the internal database. Never directly from the input file.

Louis

 

The Following 1 Site Has Linked Here

  1. Best of the Genea-Blogs 7 to 13 June 2015 | Genea-Musings | Randy Seaver : Sun, 14 Jun 2015
    "Louis wants a pass-through requirement for genealogy data communication. sounds right to me."

Leave a Comment

You must login to comment.

Login to participate
  
Register   Lost ID/password?