Login to participate
  
Register   Lost ID/password?

Louis Kessler's Behold Blog

My 1000’th Blog Post - Sun, 17 Jan 2016

Well, it took a little longer for me than the other prolific genealogical bloggers who I follow, but I’ve finally reached this milestone with this post.

image

So a time to look back at the 1000 posts.

My first post was on November 7, 2002. It was a short one just to get started. That was over 13 years ago and genealogy blogs were quite rare. Dick Eastman started his newsletter 20 years ago, in 1996. I opened my websites in January 1997 (so my websites’ 20th is coming up next year).

Back then, I didn’t realize how important it was to use a proper title for a blog post, so I used the date. It wasn’t until Oct 15, 2007 that I started used titles.

Prior to 2008, my Behold site was a subdirectory of my www.lkessler.com site. My blog was done manually with manually edited pages. In 2008, I opened up the www.beholdgenealogy.com website and built and customized the Wordpress blog that you see now. I transferred all my earlier blog posts over to this blog. I integrated it with my site and with the bbPress-based forum for Behold with single login and combined search tools. Both the blog and forum have RSS feeds for easy following.

Of my 1000 Blog posts:

  • 750 were about Behold
  • 54 were about GenSoftReviews
  • 59 mentioned Winnipeg
  • 16 mentioned chess
  • 132 mention Delphi, the programming language I use for Behold

The posts that I am most proud of are the ones where I’ve reported on the various genealogical conferences that I have attended. I tried to put together interesting summaries with pictures from a genealogy software developer’s perspective. These included:

and I will be blogging about the upcoming 10th Unlock the Past Genealogy Cruise that will run from Feb 14 to March 3.

I have Google Analytics for my Behold website since I created it in 2008. Over that time, the entire site has had 370,566 page views. My blog has got over half of those at 213,712 page views with my blog home page getting 72,820 page views. Over the past year, those stats are 74,057 (203 per day) for my site, 45,194 (124 per day) for my blog and 9,142 (25 per day) for my blog home page.

My number one blog post by a wide margin according to Google Analytics had nothing to do with genealogy: It was My Torn Achilles Tendon After Nine Weeks from Oct 18, 2011 with 31,180 page views. (Just to let you know, I was playing squash again all out after 6 months, and 2 years later my leg was fully 100%.)

Numbers 2 through 10 were:

2. Family Group Sheets – Why and Wherefore? - Mar 23, 2012, 5,241 views.

3. Setting up a Solid State Drive with Windows 8.1 - May 17, 2014, 4,124 views.

4. Can Genealogy Software be Rated Fairly? - Jan 6, 2013, 3,283 views.

5. Back to Forum Five - Feb 12, 2008, 2,622 views.

6. How Good are GenSoftReview Ratings? – Jan 4, 2014, 2,508 views.

7. What Ancestry’s “Retirement” of FTM Really Means – Dec 11, 2015, 2,418 v.

8. Standardizing Sources and Citation Templates – Aug 27, 2014, 2,020 views.

9. Source Based Document Organization – Jun 9, 2013, 1,641 views

10. The Future of Genealogy – 6 Predictions – Apr 7, 2015, 1,627 views

This has been fun, and I intend to continue to blog about what I’m doing, advancements in genealogy, and where I’m going with Behold. Please stay with me and follow along as I journey towards my second thousand posts.

SEX is Mostly Known - Sat, 16 Jan 2016

My last post took a look at the SEX tag in GEDCOM and came up with a way to guess at it when it is not specified.

Tamura Jones challenged me and thought that records that do not state the sex are rare. He asked how useful is trying to figure it out for other cases, and I should stop after my rule number 1.

That made sense to me. If the SEX tag is mostly defined as M or F, then there really is no need to go to great lengths to figure out all the missing cases.

So I took a number of GEDCOM files from my collection that I use for testing and picked files as different from each other as possible created by different programs. I totalled up the number of SEX tags with the values: M for Male, F for Female, U for Unknown and the number of individuals who did not have a SEX tag. Out of over 450,000 individuals in these files, only 95 used the U tag, and 1966 did not have a SEX tag.

image

Well that’s not that many. If this is representative of the data out there, then 99.6% of individuals are coded as male or female.

Just for fun, I thought I’d see how good my Rule 2 would work on the 2061 missing and unknowns. So here’s the count of the SEX tags of the people that are pointed to by the HUSB tag:

image

And here’s the count of the SEX tags of the people pointed to by the WIFE tag:

image

There are 111 females pointed to by the HUSB tag, and 105 males pointed to by the WIFE tag. These could be same sex couples, or could simply be a mistake on the sex designation.

There are 262 unknowns or missings pointed to by HUSB tags, and 285 pointed to by WIFE tags. These 547 people out of the 2061 could with little effort be designated with reasonable certainty to be male and female.

Tamura mentioned to me a couple of other methods that could be used to reasonably guess at the sex if it is unknown:

1. The first and middle names of the person (if known) may give a clue. But this would require building up a database of male names, female names and male-or-female names, and would have to span languages and cultures. That’s quite a bit more work than I would want to do for this task. Besides, it’s quite likely that the people whose sex is unknown either are so because they have a male-or-female name, or they are a child of unknown first name and unknown sex.

2. The surname of the children (if known) often will match the surname of the male parent and not match the maiden name of the female parent. This passing down of names is not true for all cultures, but it would be a relatively easy test to perform. Once again, it is likely that if the childrens’ surnames and the parents surnames were all known, then the sex of the parents would have been known.

Overall, there is not much to be gained by any attempt to guess at the unknown or missing sexes. That one FTW file (above) would have been the only one of the 12 that would have benefited from this work, and still, it would have only coded an extra 519 of the 33,790 people in the file. That’s hardly worth the effort.

Instead what I’ve done is added a warning to the log file for any person whose sex is unknown or missing. Looking at the 850 warnings from the FTW file, it becomes apparent that 90% of them are obvious just from the name of the person, so really, nothing else need be done other than giving the information through these warnings. Rather than guessing, let the user fix the the data.

Theoretically, the techniques of using HUSB, WIFE, first name and children’s surname could be used to check the assigned sex, and provide a warning if it looks like the sex may be incorrect. But this too may be a bit of overkill, so I’ll leave this up to some other programmer to implement in some utility program.

Sorry about this diversion. I agree with Tamura. It looks like there’s no need to go any further than simply use the SEX value as defined.

Now back to our regularly scheduled programming. I am working on this now

Sex in GEDCOM - Thu, 14 Jan 2016

I have come across a need to check out the SEX tag in GEDCOM. Some of the new DNA features I’m finishing up for the next version of Behold make important use of the sex of the individual. Determining autosomal, X, Y and mitochondrial DNA shares between two individuals is much less accurate when the sex of anyone in the relationship line is not known.

GEDCOM includes sex quite succinctly as a level 1 tag of an individual defined like this:

+1 SEX <SEX_VALUE>   {0:1}

where

SEX_VALUE :=    { Size=1:7 }
A code that indicates the sex of the individual:
      M = Male
      F = Female
      U = Undetermined from available records and quite sure that it can’t be

A few oddities already. It appears that only, “M”, “F” and “U” are allowed for the SEX_VALUE, and I’ve never noticed a program that doesn’t adhere to this. But if you read carefully, it is not requiring that the value be restricted to these three. It is leaving the door open to other possibilities (what, I can’t guess at). I find it very strange to see Size=1:7 if only one-character codes are allowed. Why not Size=1:1?

Also, it is possible for the SEX tag to be missing, since {0:1} are allowed.

My interest from the DNA perspective is in trying to determine if possible, if the individual is male or female.

So let’s use rule number 1:

1. If the SEX_VALUE is “M”, the individual is assumed to be male.
    If the SEX_VALUE is “F”, the individual is assumed to be female.
    If the SEX_VALUE is anything else, or missing, then the sex is unknown.

If that was all of it, we’d be done. But there’s more.

Children have parents. Genetically, they always have a father and a mother, although that isn’t necessarily so for adoptive parents, foster parents, etc. Again, I’m going to restrict myself to DNA interest and assume that there is one male and one female parent, whether or not the parents are known or unknown.

In a GEDCOM file, each individual points with a FAMC tag to the FAM record that contains the person’s parents. An individual could have more than one FAMC tag and point to multiple FAM records. Only one of those FAM records can be the birth parents. All the other FAM records must each contain at least one non-birth parent.

If a person has multiple sets of parents, then it is important to know which parents are the birth parents. GEDCOM does not give any specific rules for ordering FAMC tags. It does give a rule for ordering CHIL (child tags) and states: “The preferred order of the CHILdren pointers within a FAMily structure is chronological by birth”. You would think then, that a logical extension would be that FAMC tags should also be ordered chronologically, with the birth parents always listed first. Behold already checks the “MARR” date and reorders the FAMCs when the dates are out of order. I don’t believe very many programs enforce FAMC order for their GEDCOM output as I’ve seen incorrectly ordered FAMCs in a good number of the test files I use.

The FAMC tag could have a level 2 PEDI tag under it which contains a PEDIGREE_LINKAGE_TYPE value, which is one of: “adopted”, “birth”, “foster” or “sealing”. If this tag is listed and “birth” is specified, then that FAMC tag should be listed first. Now we have more complications. We have to ensure that at most one FAMC tag for an individual has a PEDI tag with a “birth” value. In practise, I have not seen the PEDI tag used very often.

GEDCOM also allows (just to make a genealogy programmer’s job more difficult) a FAMC tag to be subordinate to an individual’s BIRT (birth), CHR (christening), or ADOP (adoption) tag. Here if a FAMC tag is subordinate to a BIRT tag, then the family should be the first FAMC. I have seen this used occasionally.

Okay. Now we’ve established to the best of our ability, the FAM record of the birth parents. Now we have to determine who the parents are.

I was going to describe the FAM record and HUSB and WIFE tags in much more detail, but I don’t have to because I’ll just point you to an excellent article that Tamura Jones just happened to publish earlier today: Marriage in GEDCOM

Tamura correctly states that the FAM record need not contain the HUSB or the WIFE tags. If not, well, then we just don’t know who that parent is.

My interest for my DNA purpose, however, is to determine each parent’s sex. The HUSB and WIFE tag will point to the parent’s INDI record, and the INDI record may have a SEX tag and we can use rule 1 (above).

But what if rule 1 results in “unknown”. Then should we be able to infer the parent’s sex by which one was associated with the HUSB tag and which one was associated with the WIFE tag? I’m not 100% sure yet. I believe, but I don’t know whether many programs enforce this association when exporting to GEDCOM. My next step will be to add a check into Behold that will see if the HUSB tag is pointing to a female individual, or if the WIFE tag is pointing to a male individual.

I would think the SEX tag of the individual (rule 1) normally should overrule the HUSB/WIFE tag pointing to the individual. So I would add rule 2:

2. if the sex is unknown from rule 1, then
    if only HUSB pointers point to this individual, assume he is male.
    if only WIFE pointers point to this individual, assume she is female.
    if both HUBS and WIFE pointers point to this individual, issue an error.

But if these are the birth parents, they cannot be the same sex. If these two rules result in both birth parents being assigned the same sex, then Behold will provide a message pointing out the conflict and indicate that for this case, it will assume the HUSB/WIFE tags to be correct.

If the individual’s SEX is specified, then rule 2 is not needed and the HUSB and WIFE pointers do not have to be looked at. But what if the HUSB or WIFE tag conflicts with the SEX tag? This is possible in the case of same-sex marriages, and assigning both individuals the same sex likely is a reasonable way of adding same sex marriages to a GEDCOM standard that many have said does not allow it. Two individuals can both be males or both be females. But two HUSB tags or two WIFE tags are not allowed. Therefore for a same-sex marriage the HUSB tag would point to one individual, and the WIFE tag would point to the other.

GEDCOM states:  “The family record structure assumes that the HUSB/father is male and WIFE/mother is female.” Note that it says “assumes”, and does not state “requires”.

So in GEDCOM, a same-sex couple could be represented as:

The family record is no different than normal:

0 @F1@ FAM
1 HUSB @I1@
1 WIFE @I2@

The INDI records for two males:

0 @I1@ INDI
1 SEX M
1 FAMS @F1@

0 @I2@ INDI
1 SEX M
1 FAMS @F1@

or for two females:

0 @I1@ INDI
1 SEX F
1 FAMS @F1@

0 @I2@ INDI
1 SEX F
1 FAMS @F1@

For more information about same-sex couples in GEDCOM, read Tamura Jones’ article: Same-Sex Marriage in GEDCOM

With regards to GEDCOM, I daresay that SEX is neither clean nor easy.