Login to participate
  
Register   Lost ID/password?
Louis Kessler’s Behold Blog » Blog Entry           prev Prev   Next next

SEX is Mostly Known - Sat, 16 Jan 2016

My last post took a look at the SEX tag in GEDCOM and came up with a way to guess at it when it is not specified.

Tamura Jones challenged me and thought that records that do not state the sex are rare. He asked how useful is trying to figure it out for other cases, and I should stop after my rule number 1.

That made sense to me. If the SEX tag is mostly defined as M or F, then there really is no need to go to great lengths to figure out all the missing cases.

So I took a number of GEDCOM files from my collection that I use for testing and picked files as different from each other as possible created by different programs. I totalled up the number of SEX tags with the values: M for Male, F for Female, U for Unknown and the number of individuals who did not have a SEX tag. Out of over 450,000 individuals in these files, only 95 used the U tag, and 1966 did not have a SEX tag.

image

Well that’s not that many. If this is representative of the data out there, then 99.6% of individuals are coded as male or female.

Just for fun, I thought I’d see how good my Rule 2 would work on the 2061 missing and unknowns. So here’s the count of the SEX tags of the people that are pointed to by the HUSB tag:

image

And here’s the count of the SEX tags of the people pointed to by the WIFE tag:

image

There are 111 females pointed to by the HUSB tag, and 105 males pointed to by the WIFE tag. These could be same sex couples, or could simply be a mistake on the sex designation.

There are 262 unknowns or missings pointed to by HUSB tags, and 285 pointed to by WIFE tags. These 547 people out of the 2061 could with little effort be designated with reasonable certainty to be male and female.

Tamura mentioned to me a couple of other methods that could be used to reasonably guess at the sex if it is unknown:

1. The first and middle names of the person (if known) may give a clue. But this would require building up a database of male names, female names and male-or-female names, and would have to span languages and cultures. That’s quite a bit more work than I would want to do for this task. Besides, it’s quite likely that the people whose sex is unknown either are so because they have a male-or-female name, or they are a child of unknown first name and unknown sex.

2. The surname of the children (if known) often will match the surname of the male parent and not match the maiden name of the female parent. This passing down of names is not true for all cultures, but it would be a relatively easy test to perform. Once again, it is likely that if the childrens’ surnames and the parents surnames were all known, then the sex of the parents would have been known.

Overall, there is not much to be gained by any attempt to guess at the unknown or missing sexes. That one FTW file (above) would have been the only one of the 12 that would have benefited from this work, and still, it would have only coded an extra 519 of the 33,790 people in the file. That’s hardly worth the effort.

Instead what I’ve done is added a warning to the log file for any person whose sex is unknown or missing. Looking at the 850 warnings from the FTW file, it becomes apparent that 90% of them are obvious just from the name of the person, so really, nothing else need be done other than giving the information through these warnings. Rather than guessing, let the user fix the the data.

Theoretically, the techniques of using HUSB, WIFE, first name and children’s surname could be used to check the assigned sex, and provide a warning if it looks like the sex may be incorrect. But this too may be a bit of overkill, so I’ll leave this up to some other programmer to implement in some utility program.

Sorry about this diversion. I agree with Tamura. It looks like there’s no need to go any further than simply use the SEX value as defined.

Now back to our regularly scheduled programming. I am working on this now

No Comments Yet

Leave a Comment

You must login to comment.

Login to participate
  
Register   Lost ID/password?