Login to participate
  
Register   Lost ID/password?

Louis Kessler’s Behold Blog

Standardizing Sources and Citation Templates - 4 days, 17 hrs ago

Note: This article has been submitted to FHISO’s Call For Papers as CFPS 114.

Thank you to the people who provided comments and feedback that helped me finalize this paper: Tamura Jones, Enno Borgsteede, Tony Proctor, Randy Seaver, Richard Smith, and Tom Wetmore.

 

Contents

Abstract

Introduction

Definition of Citation

1 GEDCOM’s Current Source Definition

2 The Underutilization of GEDCOM’s Sourcing

3 Zoteroing In On A Solution

4 Don’t Mix Up Data with Formatting

5 Suggesting a Solution

6 What about Repositories?

7 The Case for Citation Templates

8 Separating Sources from Conclusions

Summary and Recommendation

Works Cited

This paper was originally published August 27, 2014 on Louis Kessler’s Behold Blog.[1]

 

Abstract

A new standard is being explored to replace GEDCOM to provide an improved means of data communication between genealogy software, online family trees, and repositories of research information. One of the main concerns about the current GEDCOM standard is that source documentation and citations do not currently transfer well between systems. Addressing this issue is a requirement for the new standard. This paper will describe a simple method to define sources for the new standard and explain why citation templates should not be part of the standard but be defined separately.

 

Introduction

One of the most talked about issues regarding the GEDCOM transfer of data between genealogy software is how poorly source information moves between programs. (Thorud, Composer, & Hatchett, 2012)

During the past 15 years of developing my genealogy software Behold,[2] I have had to travel deep into the guts of the GEDCOM standard and discover how it works and interpret its workings.

Part of this task was to be able to logically present all the source information that is contained in GEDCOM files to the user. I determined the logical structure for source data that was used, and I found the capabilities and limitations in the GEDCOM definition that both would enable complete transfer of this data between programs, while at the same time limit it.

I discovered the ways some genealogy software exported their sources incorrectly to GEDCOM and how other programs attempted to export their formatting templates with their source data to a GEDCOM file. I read many articles about how few programs could properly read the sources that another program had exported.[3]

I also have thought about the ways genealogists want genealogy software to improve. Most genealogy programs available today are conclusion-based. You enter your facts and attach the sources to them. This is inefficient and de-emphasizes the importance of documenting your sources. An alternative is source-based genealogy (Kessler, Inventing Source-based Data Entry, 2013), where the source data can first be entered, and then the conclusions and facts that arise from the sources can be assembled. A future GEDCOM standard needs to allow both a conclusion view and a source-based view of the data.

A key aspect of this, missing from GEDCOM, is that conclusions and sources must be separated. Source information must be just the facts, and contain no subjective information. (Kessler, Nine Necessities in a GEDCOM Replacement, 2013) Conclusion data can contain the subjective information and should point to the sources that provide the conclusions.

What this would allow would be two-fold. It not only would delineate the information that should and should not be stored with the source information in GEDCOM. But it would also create a source information standard that potentially could be used by all repositories to produce searchable files that index all their source material. Genealogists would be able to search and extract source material relevant to their research and import them directly into the family information.

All these considerations have gone into this article, to suggest how sources should be handled in a new genealogy data communication standard.

 

Definition of Citation

There are two interpretations possible for what a citation is. A citation can thought of as the source data itself, or it can be thought of as the formatted representation of some source data. The former is the data. The latter is just formatting rules for the data.

Throughout this article, my interpretation is the latter, and when I’m referring to “citation”, I am meaning the formalized representation of the source data (or source details) according to some methodology that has rules on how to format different types of sources.

Thanks go to Tom Wetmore and Richard Smith for documenting the confusion with regards to this. (Smith, 2014; Wetmore, 2014)

 

1 GEDCOM’s Current Source Definition

The existing GEDCOM standard has provided for documentation of sources. As the standard was revised over the years, FamilySearch experimented with various forms and finally came up with what was used in the GEDCOM 5.5 standard and also in the GEDCOM 5.5.1 draft. (Jones, FamilySearch GEDCOM Specifications, 2014)

What is not universally realized is that GEDCOM already contains the structures necessary to record the information about almost any type of source.

GEDCOM divides the source information into four structures.The hierarchy is this:

A. Some conclusion has its source described with a SOURCE_CITATION.

B. The SOURCE_CITATION refers to a SOURCE_RECORD.

C. The SOURCE_RECORD includes a SOURCE_REPOSITORY_CITATION.

D. The SOURCE_REPOSITORY_CITATION refers to a REPOSITORY_RECORD.

Let’s take a look at them one by one.

A. The SOURCE_CITATION is described in GEDCOM as:

“The <<SOURCE_CITATION>> structure is placed subordinate to the fact being cited. It is generally best if the source citation contains only information specific to the fact being cited and then points to the more general description of the source, defined in a SOURCE_RECORD. This reduces redundancy, provides a way of controlling the GEDCOM record size, and more closely represents the normalized data model.”

The SOURCE_CITATION is somewhat misnamed. It simply provides the specific location within the source where the reference can be found along with details about the information. It does not attempt to create a citation (i.e. some formalized bibliographic description of the source cited) but just provides the data that is necessary to create the citation. It would more aptly be named source_detail, or source_reference or even “evidence” since it details the reference to the source of some evidence.

Its GEDCOM definition is:

SOURCE_CITATION:=
    n SOUR @<XREF:SOUR>@ {1:1} /* pointer to source record */
        +1 PAGE <WHERE_WITHIN_SOURCE> {0:1}
        +1 EVEN <EVENT_TYPE_CITED_FROM> {0:1}
            +2 ROLE <ROLE_IN_EVENT> {0:1}
        +1 DATA {0:1}
            +2 DATE <ENTRY_RECORDING_DATE> {0:1}
            +2 TEXT <TEXT_FROM_SOURCE> {0:M}
                +3 [CONC|CONT] <TEXT_FROM_SOURCE> {0:M}
        +1 <<MULTIMEDIA_LINK>> {0:M}
        +1 <<NOTE_STRUCTURE>> {0:M}
        +1 QUAY <CERTAINTY_ASSESSMENT> {0:1}

B. The SOURCE_RECORD is described in GEDCOM as:

“The SOURCE_RECORD structure was simplified into five basic sections: data or classification, author, title, publication facts, and repository. The data or classification section contains facts about the data represented by this source and is used to analyze the collection of sources that the researcher used. The author, title, publication facts, and repository sections provide free-form text blocks that inform subsequent researchers how to access the source data that the original researcher used.”

Its GEDCOM definition is:

SOURCE_RECORD:=
    n @<XREF:SOUR>@ SOUR {1:1}
        +1 DATA {0:1}
            +2 EVEN <EVENT_RECORDED> {0:M}
                +3 DATE <DATE_PERIOD> {0:1}
                +3 PLAC <SOURCE_JURISDICTION_PLACE> {0:1}
            +2 AGNC < RESPONSIBLE AGENCY> {0:1}
            +2 <<NOTE_STRUCTURE>> {0:M}
        +1 AUTH <SOURCE_ORIGINATOR> {0:1}
            +2 [CONC|CONT] <SOURCE_ORIGINATOR> {0:M}
        +1 TITL <SOURCE_DESCRIPTIVE_TITLE> {0:1}
            +2 [CONC|CONT] <SOURCE_DESCRIPTIVE_TITLE> {0:M}
        +1 ABBR <SOURCE_FILED_BY_ENTRY> {0:1}
        +1 PUBL <SOURCE_PUBLICATION_FACTS> {0:1}
            +2 [CONC|CONT] <SOURCE_PUBLICATION_FACTS> {0:M}
        +1 TEXT <TEXT_FROM_SOURCE> {0:1}
            +2 [CONC|CONT] < TEXT_FROM_SOURCE > {0:M}
        +1 <<SOURCE_REPOSITORY_CITATION>> {0:1} /* substructure */ 
        +1 <<MULTIMEDIA_LINK>> {0:M}
        +1 <<NOTE_STRUCTURE>> {0:M}
        +1 REFN <USER_REFERENCE_NUMBER> {0:M}
            +2 TYPE <USER_REFERENCE_TYPE> {0:1}
        +1 RIN <AUTOMATED_RECORD_ID> {0:1}
        +1 <<CHANGE_DATE>> {0:1}

This source record looks quite comprehensive unto itself. You can clearly see where the Author, Title, Publication and Agency is intended to go. There’s plenty more included with that.

Note the SOURCE_REPOSITORY_CITATION is included as a substructure.

C. The SOURCE_REPOSITORY_CITATION is described in GEDCOM as:

“This structure is used within a source record to point to the name and address record of the holder of the source document.”

Its GEDCOM definition is:

SOURCE_REPOSITORY_CITATION:=

n REPO @<XREF:REPO>@ {1:1} /* pointer to repository record */
    +1 <<NOTE_STRUCTURE>> {0:M}
    +1 CALN <SOURCE_CALL_NUMBER> {0:M}
        +2 MEDI <SOURCE_MEDIA_TYPE> {0:1}

D. The REPOSITORY_RECORD is described in GEDCOM as:

“Formal and informal repository name and addresses are stored in the REPOSITORY_RECORD.”

Its GEDCOM definition is:

REPOSITORY_RECORD:=
    n @<XREF:REPO>@ REPO {1:1}
        +1 NAME <NAME_OF_REPOSITORY> {0:1}
        +1 <<ADDRESS_STRUCTURE>> {0:1}
        +1 <<NOTE_STRUCTURE>> {0:M}
        +1 REFN <USER_REFERENCE_NUMBER> {0:M}
            +2 TYPE <USER_REFERENCE_TYPE> {0:1}
        +1 RIN <AUTOMATED_RECORD_ID> {0:1}
        +1 <<CHANGE_DATE>> {0:1}

Together, these four structures provide places for all the data needed to document one’s sources.

 

2 The Underutilization of GEDCOM’s Sourcing

GEDCOM’s sourcing certainly is comprehensive. But it is also complicated. There are numerous tags and structures and substructures and linkages. The information for almost any type of source description can be recorded with it and it can be made to do so by a developer who takes the time to study the structures.

However, it is not obvious as to where a specific piece of source data should go. The documentation is less than clear, and there are only trivial examples that don’t help the developer properly understand. It is possible to store any source, but there is no unambiguous, unique way.

Some developers took advantage of GEDCOM’s sourcing, but many developers decided not to use it, or use only parts of it.

Then along came Elizabeth Shown Mills and her book Evidence Explained (Mills, 2007). This popularized the formalization of citation writing for genealogy and emphasized the use of templates to develop the sentence structure and formatting for a large number of different types of sources. Many developers adapted her templates and included them in their software to make it easier to genealogists to create formal citations for their sources.

The last version of GEDCOM was created before the concept of citation templates and there was no obvious way to export the templates or the citations into GEDCOM. So many developers who included citation templates in their product didn’t try to export them. Some, most notably RootsMagic, decided it was important to export template information, and created their own non-standard GEDCOM tags so they could export this information. They could then reimport their own exported templates, but no other program could.

The overall result is that few programs export their sources to GEDCOM in a manner that another program can properly read. This is the problem that genealogists dearly want fixed.

So GEDCOM is capable of storing almost any source, but it is complicated and unclear with nothing but trivial examples to help the user. There’s got to be a better, simpler way.

 

3 Zoteroing In On a Solution

As part of the BetterGEDCOM initiative, GeneJ Composer, the then leader of the BG initiative, indicated she used and was very impressed with the software called Zotero (www.zotero.org). Zotero is a free tool that helps to collect, organize, cite and share one’s research sources. It is available for Mac, Windows and Linux.

During the discussion on how a Better GEDCOM would be able to record sources. GeneJ then developed a list of about 100 elements that were used for all source types in Zotero. These included items such as: abstractNote, accessDate, applicationNumber, archive, archiveLocation, artworkMedium, artworkSize, assignee, audioFileType, audioRecordingFormat, author, billNumber, blogTitle, bookTitle, callNumber, etc. (testuser42, 2011)

GeneJ provided an example of a Zotero source for a Blog Post, as shown below (Composer, 2011). This particular type of source (blogPost) has 15 elements, and 10 of those elements have values.

image

What is very important to note here is that this is the data needed to describe a source of the type: blogPost in Zotero. This data is not formatted. It is the raw data and using this data, any method of formatting using any style can be used to present this data as a citation.

Zotero lists dozens of different types of sources (which Zotero calls item types), e.g. artwork, audioRecording, bill, blogPost, book, etc. Each source type has its own set of relevant elements that are needed. These come from the master list of elements. The specific elements needed depend on the type of source.

The Zotero software thus provides an example of a simple and workable source definition structure that would work for genealogy source data.

I am not suggesting that the Zotero’s source types and elements be the master list for the new standard. I am simply using Zotero as an illustrative example of how these source types and elements can be set up. When the standard is developed, the list should attempt to contain every item that will be needed to document sources. Don’t worry. We aren’t talking thousands. We’re talking a few hundred.

As another more concrete example, instead of Zotero’s definitions, we can use the actual Evidence Explained definitions. Tamura Jones pointed out (Jones, Genealogy Citation Standard, 2011) that John H. Yates released free open source EE-style templates. (Yates, 2010) If you look at those templates, you will see 170 categories (what I call source types) and 592 Fields (what I call elements). Many are multiple versions of the same field, maybe in short and long form, first, middle and last name, or parts of an address.

The number of elements can also be reduced by changing "blogTitle" and "bookTItle" and “articleTitle” all to "Title" and use the source type to properly context them. It will be up to the standards committee to attempt to sort those out and come up with the best set of values.

 

4 Don’t Mix Up Data with Formatting

A new GEDCOM standard should transfer only the genealogy data. That genealogy data includes all the source information necessary to accurately describe a source.

The structured formalized notation for representing the data is not data. This is just a set of instructions telling you one specific way of displaying the data. These are nothing more than formatting rules.

A new GEDCOM standard should not transfer formatting information. Formatting should be left up to the receiving program. The receiving program may have its own preferred way of formatting sources. If they use Evidence Explained, then so be it. They may interpret EE differently than another program, and they should be allowed to do so their own way, and display it their own way.

A program may give you many alternative methods of formatting, e.g. Richard Lackey[4] or even bibliographic methods such as APA[5] or Chicago[6]. Again, it should be up to the program, and not up to the sending program to force its formatting upon another.

Even within one method, there may be many different ways to format a single source. Some examples include formatting for a bibliography, for a footnote, for an endnote or for an ibid.

This will be a controversial opinion, but a line must be drawn. Information should be the only thing transferred. One program should not tell another program how it should format and display that information. Structuring and formatting information should not be transferred.

The beauty in the variety of genealogy software is that they display your data in different ways. Some people like it one way. Some people like it another way. Forcing display of data in certain ways only restricts the choice.

 

5 Suggesting a Solution

The goal of a new Standard is that data transfers seamlessly between programs. For that to be done, all developers must adhere to the standard. The way to maximize the likelihood that developers can and will adhere is by making the standard simple and unambiguous.

Source data lends itself to a simple system. All that needs to be done is:

1. Identify the most common source types that genealogists will encounter, and make part of the standard.

2. Identify all the source possible source elements and make them keys in the new standard.

3. Discourage, but allow the programmer to define their own source and source element types.

Using a GEDCOM-like definition, this structure may be as simple as:

SOURCE_RECORD:=
    n @<XREF:SOUR>@ SOUR {1:1}
        +1 TYPE <SOURCE_TYPE> {0:1}
        +1 ELEM <SOURCE_ELEMENT_AND_VALUE> {0:M}

SOURCE_TYPE:=
    [artwork | audioRecording | bill | blogPost | book | … | _<user defined>]

SOURCE_ELEMENT_AND_VALUE:=
<SOURCE_ELEMENT_TYPE>: <TEXT>

SOURCE_ELEMENT_TYPE:=
    [abstractNote | accessDate | applicationNumber | archive | archiveLocation
     | artworkMedium | artworkSize | assignee | audioFileType
     | audioRecordingFormat | author | billNumber | blogTitle | bookTitle
     | callNumber | … | _<user defined>]

Using the blog post example, data transfer in a GEDCOM-like format would look like this:

0 @S123@ SOUR
1 TYPE BlogPost
2 ELEM Title: They Came Before: Technophoo. Have no fear …
2 ELEM Author: Genej, (first)
2 ELEM BlogTitle: They Came Before
2 ELEM Date: 7 Sep 2011
2 ELEM URL: http://theycamebefore.blogspot.com/2011/09/technophoo-have-no-fear.html
2 ELEM Accessed: 22 Dec 2011 16:51:57
2 ELEM ShortTItle: They Came Before
2 ELEM DateAdded: 22 Dec 2011 16:51:57
2 ELEM Modified: 22 Dec 2011 16:51:57

And we’re done. All data will export and import easily and will transfer properly.

Now obviously there may be some minor refinements to this, such as requiring that certain elements be certain data types. Most will be text, but a few might be dates or numbers.

Also, there will be a desire to allow a user defined source type or source element type. A developer may use source types or elements that are not in the standard. These will need to be identified possibly with a leading underscore as suggested above, to emphasise to the developer that these are a fields that other programs will not understand. For example:

0 @S124@ SOUR
1 TYPE _PostItNote
2 ELEM _Handwriting: No one will be able to interpret this.

Use of user defined identifiers should be discouraged. If a program needs to use one, there should be a venue through which the developer could apply to get a new identifier added to the next version of the standard.

 

6 What about Repositories?

Repository information can be stored as source elements, as suggested above. Or it can be given its own record structure, and the source can link to the repository as GEDCOM does today.

The advantage in keeping repository information separate is less repetition of information between sources, less chance of conflicting information about one repository being included in different sources, and better backwards compatibility with GEDCOM today.

Whether to keep sources and repositories together or separate is up to FHISO to decide. Should they be kept separate, the repository still can be set up in a similar manner to sources, with repository types and repository elements. Doing so would allow easier citation template development, as will now be described.

 

7 The Case for Citation Templates

There is still a place for citation templates. And yes, it would be nice if these are standardized. This would help programmers so that they can implement the various citation styles in a consistent manner and can display your sources according using your favourite style.

There have been previous attempts to standardize citations. In 2011, Real-Time Collaboration (the creator of AncestorSync) started an initiative called SourceTemplates (Jones, The SourceTemplates Initiative, 2011). They had the cooperation of BetterGEDCOM. However their citation model was essentially the same as GEDCOM’s source structures with a DataField record for defining the source elements. So it had the same complexity as GEDCOM’s sourcing and the initiative never got off the ground.

There’s a much better way to do this. By using the source types in combination with the source elements, it would be possible to develop templates for each source type, for every bibliographic style.

Here are two template examples for a source type of BlogPost:

Using the MLA style[7], a template for a BlogPost might be:
   $Author. “$Title.” $BlogTitle. $Publisher, $DateModified. Web. $DateAccessed
Inserting our sample data, this would display as:
    Genej, (first). “They Came Before: Technophoo. Have no fear …” They Came
             Before. 22 December 2011. Web. 22 December 2011.

Whereas using Evidence Explained, a template for a BlogPost in a footnote might be:
    $Author, “$Title,” $BlogTitle, $DateModified ($Url : accessed $Accessed)
And with our sample data, this would display our as:
    Genej, (first), “They Came Before: Technophoo. Have no fear …” They Came
            Before, 22 December 2011 http://theycamebefore.blogspot.com/2011/09
            /technophoo-have-no-fear.html accessed 22 December 2011)

So you can see that the development and use of citation templates is not a difficult task once all the source types and source elements are defined. If one standard set of citation templates was developed for every combination of source type, citation methodology and entry type, with translations into different languages, then genealogy software developers would have a great resource they could use. Their programs could use the templates to format citations in a standardized manner. They can add their own unique preferred formats. And they can allow users to add their own templates.

Thus, the development of an extensive set of citation templates would add consistency in how a particular style is displayed in different programs. It would save programmers the hassle of figuring out the details of each style for themselves.

What we have done is completely separated the definition of the source data from the definition of the formatting of that data provided by the citation templates. They are now two separate tasks.

Attempting to include these templates initially into the new GEDCOM standard would be a mistake. The GEDCOM standard is designed to transfer genealogical data correctly. This should be FHISO’s main goal as they embark on their endeavour to create this new standard. They should not be distracted by a desire to standardize the formatting of the data.

So FHISO should concentrate on defining the source information and leave the citations/templates for later or for someone else to do.

 

8 Separating Sources from Conclusions

Source information must be “just the facts”. There must be no assumptions or conclusions or assessments of the source in the source information. (Kessler, Separation of Sources from Conclusions, 2011)

All assumptions and conclusions and assessments of the source must be placed with the source reference, not with the source. So the reference would be:

SOURCE_REFERENCE:=
    n SOUR @<XREF:SOUR>@ {1:1} /* pointer to source record */
        +1 <<NOTE_STRUCTURE>> {0:M}   /* assumptions and conclusions */
        +1 QUAY <CERTAINTY_ASSESSMENT> {0:1}

This is very important. The source structure must be a complete independent entity that can be used simply to identify the material where a conclusion came from.

Doing so will allow repositories to use this part of the new standard as the format for cataloguing their source information in a standardized manner, compatible with the new genealogical data transfer standards. Genealogy software would be able to read these files and search and download the sources relevant to the user.

Genealogists would easily be able to keep their own libraries of interesting sources and share them with others. They could be volunteers to catalogue the source information for repositories and even contribute their own source libraries to the world’s knowledgebase.

This would open up new possibilities for genealogy data sharing and data exchange. (Kessler, Vision, 2011)

 

Summary and Recommendation

The current version of GEDCOM has extensive sourcing capabilities. However, they are complicated to interpret and use. A simpler method is needed.

FHISO should develop a set of standard source types and source element types.

FHISO should use a simple mechanism to transfer the source element values in the standard they will develop.

FHISO can allow, but should discourage user defined identifiers. FHISO should accept requests for new identifiers to be added to a future version of the standard.

FHISO should decide if sources and repositories be defined just by a source record, or if there should be an additional repository record as well.

Citation templates are not data. They are formatting.

Citations templates must not be transferred with the source data. Instead, the programs should allow the user to format their citations their way.

Developing citation templates for all the various methodologies is a desirable task, but less important than providing a standard for the transfer of genealogy data.

FHISO should first develop a standard to transfer genealogy data. They should not initially distract themselves from their main goal by attempting to also standardize citation templates. That can be left for later.

Conclusions must be left out of the source details.

The source standard created should work for transferring sources between genealogy software and also for recording source information by repositories.

Works Cited

Composer, G. (2011, December 27). Zotero blogPost graphic-example. Retrieved from BetterGEDCOM Wiki: http://bettergedcom.wikispaces.com/file/detail/Zotero_blogPost_graphic-example.png

Jones, T. (2011, June 27). Genealogy Citation Standard. Retrieved from Modern Software Experience: http://www.tamurajones.net/GenealogyCitationStandard.xhtml

Jones, T. (2011, October 5). The SourceTemplates Initiative. Retrieved from Modern Software Experience: http://www.tamurajones.net/TheSourceTemplatesInitiative.xhtml

Jones, T. (2014, August 21). FamilySearch GEDCOM Specifications. Retrieved from Modern Software Experience: http://www.tamurajones.net/FamilySearchGEDCOMSpecifications.xhtml

Kessler, L. (2011, December 16). Separation of Sources from Conclusions. Retrieved from BetterGEDCOM Wiki: https://bettergedcom.wikispaces.com/share/view/48324558

Kessler, L. (2011, August 13). Vision. Retrieved from BetterGEDCOM Wiki: http://bettergedcom.wikispaces.com/Vision

Kessler, L. (2013, July 29). Inventing Source-based Data Entry. Retrieved from Louis Kessler’s Behold Blog: http://www.beholdgenealogy.com/blog/?p=1321

Kessler, L. (2013, June 5). Nine Necessities in a GEDCOM Replacement. Retrieved from Paper 78 submitted to FHISO’s call for papers, Necessity #1: Separation of Sources from Conclusions: http://fhiso.org/files/cfp/cfps78.pdf

Mills, E. S. (2007). Evidence Explained: Citing History Sources from Artifacts to Cyberspace. Baltimore: Genealogical Publishing Company, Inc.

Smith, R. (2014, August 29). The role of email, attachments, slack, github, etc, in FHISO’s work. Retrieved from TSC-public mailing list archives: http://fhiso.org/pipermail/tsc-public_fhiso.org/2014/000117.html

testuser42. (2011, December 19). List of main Citation Elements. Retrieved from BetterGEDCOM Wiki: http://bettergedcom.wikispaces.com/List+of+main+Citation+Elements

Thorud, G., Composer, G., & Hatchett, A. (2012, March 5). Sources and Citations. Retrieved from BetterGEDCOM Wiki: http://bettergedcom.wikispaces.com/page/history/Sources+and+Citations

Wetmore, T. (2014, August 29). The role of email, attachments, slack, github, etc, in FHISO’s work. Retrieved from TSC-public mailing list archives: http://fhiso.org/pipermail/tsc-public_fhiso.org/2014/000102.html

Yates, J. H. (2010, February). Two Computer Ready Parametrizations of "Evidence Style" Historical Sources. Retrieved from http://jytangledweb.org/genealogy/evidencestyle/

 


[1] http://www.beholdgenealogy.com/blog/?p=1395

[2] Behold is a program that read’s GEDCOM data files and displays all the information from them.

[3] For example, Randy Seaver, Genea-Musings: Software Programs, GEDCOM Files and Source Citations – Some Recommendations, February 17, 2011, http://www.geneamusings.com/2011/02/software-programs-gedcom-files-and.html

[4] Cite Your Sources, paperback, June 1, 1978, Amazon.com http://www.amazon.com/Cite-Your-Sources-Richard-Lackey/dp/9995236478

[5] http://www.apastyle.org/ - Note that APA is being used for this paper.

[6] http://www.chicagomanualofstyle.org/

[7] Citesource: MLA Style – Blog Post. http://citesource.trincoll.edu/mla/mlablogpost_002.pdf

From Ancient GEDCOM to Prehistoric GEDCOM - Sun, 17 Aug 2014

A few hours ago, I posted an article wondering if I may have found the world’s oldest GEDCOM file. Tamura Jones in response emailed me one that he thought may be older.

image

Instead of beginning with a 0 HEAD record as all GEDCOMs do, this file begins with a 0HH record (no space between the 0 and the HH). It is followed by lines with a single number, a two character tag and a value. The tag for INDI records is II and the tag for FAM records is FI. The file ends with a 0ND record. It definitely looks like it might be the creature that preceded the GEDCOM file I had earlier found.

Then when I looked at the data in this file, I was even more surprised. It proves to be a file created by Phillip Brown’s Family History System which is especially apparent because Phillip’s own data is in the sample file.

It just so happens that Family History System was the first genealogy program I ever purchased back in 1993. I loved its Relative Report which was the inspiration for my Everything Report in Behold. But I never did use it for my own genealogy because its data input was not to my liking. A few years later I got Reunion for Windows for my data entry.

I no longer have the Family History System program on my current computer (it was a DOS-based program) … BUT … I still happened to have the Family History System user manual in hardcopy form! To my surprise, the manual not only talks about the Export/Import Utility, but it provides a full six page description of the Transfer Dataset Format. At the beginning of that, it says:

“The TRANSFER datasets used in the Export/Import processes of the Family History System Extension were designed using the original guidelines developed by the LDS Genealogical Department for representing genealogical information in standard character format. The name that is being used to describe this format is GEDCOM (for GEnealogical Data COMmunication format). The format that was used in the GEDCOM implementation of the LDS Personal Ancestor File (PAF) 2.0 software differed significantly from this original description.”

Since PAF 2.0 used GEDCOM 2.0, which we believe is the file I had found earlier today, FHS must have been exporting using GEDCOM 1.0.

There is an Import/Export utility that was also supplied with the FHS. The documentation there stated:

“The format of the transfer dataset follows closely the original GEDCOM format proposed by the LDS Genealogical Dept. and advocated by the quarterly journal, “Genealogical Computing”. … The formats of the transfer datasets implemented by releases 2.0 and 2.1 of the Personal Ancestor File (PAF) software distributed by the LDS Family History Dept. differed from the original guidelines and so are not compatible with the format used by this program. A separate FHS export/import program, compatible with the PAF software is now a part of the basic set of programs.”

In other words, FHS exported/imported GEDCOM 1.0 via it’s built-in transfer program. It used a separate export/import program for GEDCOM 2.0 (PAF 2.0) and GEDCOM 3.0 (PAF 2.1). So the file that Tamura showed me was indeed GEDCOM 1.0 as produced by Family History System. 

Will I support GEDCOM 1.0 in Behold? Well I could. But I doubt if anyone has any files of that format lying around that they really need to extract the data from. Let me know if you do.

p.s. I started subscribing to Genealogical Computing in 1992 – Volume 12. By then, GEDCOM was already up to version 5.0. I have all the issues from 1992 until they stopped publishing in 2005. It’s a fantastic historical documentation about early genealogy software. Does anyone out there have copies of volumes 1 to 11?

The World’s Oldest GEDCOM File? - Sun, 17 Aug 2014

While preparing my presentation of Reading Wrong GEDCOM Right for the Gaenovium Conference, I wanted to see if I had in my collection of over 600 test GEDCOM files some early GEDCOMs from the pre-GEDCOM 5.0 era.

I searched my files for some of the pre-GEDCOM 5.0 tags outlined by Tamura Jones in his GEDCOM Tags article. I didn’t have any such files. So I searched the web. I was surprised to find just one single file, It was among the collection of GEDCOMs at the now abandoned Genealogy Forum site.

The file I found was gedr6127.ged and the start of it looks like this:

image

The file information for this file at Genealogy Forum states that it was uploaded by Jean Hudson Masco on April 21, 1997. This is well past the introduction of the GEDCOM 5.0 draft in December 1991. The file header states that it was created with PAF, so the file must have been created by what in 1997 was a very old version of PAF. The VERS tag is not given in the HEAD section of this GEDCOM, so the version of PAF cannot be identified, nor can the version of GEDCOM that this file represents.

This was a very exciting find for me, sort of like an archaeological dig unearthing an ancient unknown language. I don’t know anyone who has a specification of GEDCOM prior to version 5.3 (if anyone does, please let me know), so now became a matter of interpreting the text and seeing if I could translate.

As it is, the current version of Behold cannot display the people and individuals in this file correctly. The first problem is that on the 0 INDI record lines, there is no space between the end of the identifier, i.e. @242@, and the tag, i.e. INDI.

This also is a problem on the 0 FAM tags, except in this file they are not FAM tags but FAMI tags with an “I” on the end.

The other interesting difference is the linkages. Look in the above example and you’ll see two lines containing:  1 PARE 2 RFN @89@. This is a link to the parents of the person, and in version since GEDCOM 5.3, this has become a single line: 1 FAMC @89@. 

All the other linkages were different as well. The list is:
- FAMC was PARE + RFN
- FAMS was FAMF + RFN
- CHIL was CHIL + YOUN
- HUSB was HUSB + RFN
- WIFE was WIFE + RFN
and I’m still working on the extra one they had which now has no equivalent:
- SIBL + OLD
which seems to be a linkage to a sibling which should be redundant information, but I’ll check that.

The dates are also in yyyymmdd format which has been changed in newer GEDCOMs to dd MMM yyyy. In a way, the old version was better, because it is the basic ISO standard for date representation. Within a GEDCOM file, it doesn’t matter how a date is stored. The GEDCOM file is not meant to be viewed by the genealogist. It is your genealogy software that simply must load the information and display it understandably for you. And using English month names for the 3-letter abbreviation does more harm than good. So I’m not sure why later versions made this change.

So I have now changed my development version of Behold so that these situations will be handled (and this will be included in the next release of Behold in case anyone else happens to have some ancient GEDCOMs lying around.) Once I did that, Behold was able to properly present the information in the file.

Do you have any of these ancient GEDCOMs lying around in this format? The sure way to tell is if the file ends with a line containing: 0 EOF. Newer GEDCOM versions end with 0 TRLR.  I wouldn’t mind having a few more for testing, so if you have an oldie, please contact me.


Followup: Aug 21, 2014

This file has now been confirmed to be a GEDCOM 2.0 file.

Discussion the next day with Tamura Jones led to the conclusion that there is an older file available, GEDCOM 1.0, and the program Family History System by Phillip Brown seems to be the only program built to export to (and import from) that earliest format. Even PAF only started with GEDCOM 2.0.

See the GEDCOM 1.0 article by Tamura Jones for interesting information on this.