Login to participate
  
Register   Lost ID/password?

Louis Kessler’s Behold Blog

How to Program Dates for Genealogy - Mon, 1 Feb 2016

Dates in genealogy are messy. Four years ago, I wrote a series of posts on some of the aspects of dates as defined by GEDCOM and some of the bad dates you see in GEDCOM files in the wild:

Sort of a Date
How About a Date?
Out on a Bad Date

In a recent post by on the RootsDev Google Group, Gary Stanley asked about Incomplete Dates: “Does anyone have any recommendations for working with incomplete dates and storing them within a database such as MySQL?”

For the record, I thought I’d lay out the method I decided on. It is based on GEDCOM’s date definition which as I’ve said in the above articles, I think is quite a good specification.

The good thing about the GEDCOM spec is that it is readable. The bad thing is that it does not sort, so an alternative internal format is needed to sort. So here is what I came up with and use internally in Behold. It is also how I’ll export dates in the database that Behold will produce.

The structure is a 12 or 24 character string (24 if it’s a date range) that can be sorted just by letting the computer naturally sort the strings.

Behold’s internal date format is:
CBYYYYMDD*AA [CBYYYYMDD*AA]

where:
C (Calendar): ‘/’ = Gregorian, ‘A’ = Julian, ‘F’ = ‘French’, ‘H’ = ‘Hebrew’
B (B.C.): ‘1′ = B.C., ‘2′ = A.D. If B.C. 
         If BC, then the YYYY is set to 9999 - YYYY so it sorts correctly
YYYY (Year)
M (Month): ‘1′ .. ‘C’ for JAN to DEC
                 ‘D’ .. ‘O’ for VEND to COMP
                 ‘P’ .. ‘Z’ for TSH to ELL
DD (Day)
* (Date Modifier): 1 = BEF, 2 = TO, 3 = (none), 4 = ABT, 5 = CAL, 5 = EST,
                         7 = INT, 8 = AND, 9 = FROM, A = BET, B = AFT
AA (Alternate year): e.g. the "93" in 1592/93
CBYYYYMM*AA = the second date in a date range (only if needed)

e.g. 4 Nov 1900 = ‘/21900A043  ’

If column 1 is a ‘(‘, then this is a GEDCOM date phrase stored as text between parenthesis, e.g.: (4 days old). Anything that does not fit the standard format automatically becomes a date phrase.

With regards to “incomplete dates’, GEDCOM allows year only, or month and year only. It does not allow month without year or day without month and year. My representation follows this idea and allows ‘/21900A003” (no day) and ‘/219000003’ (no month and day). Other types of incomplete dates will become a date phrase, e.g. ‘(14 Nov)’

I developed this structure before I discovered that RootsMagic uses something similar. RM uses several different formats of dates, but I think it’s worthwhile comparing the RootsMagic text representation of a date as described at http://sqlitetoolsforrootsmagic.wikispaces.com/Date+Formats

Their structure is an always 24 character string, as follows:

RootsMagic Text Dates:
C*BYYYYMMDDA%BYYYYMMDDA%

where:
C (Calendar): D = Standard date, Q = Quaker date, T = Text date 
* (Date Modifier): - = NONE, A = After, B=Bef, F=From, I=Since,
                         O = Or, R = Bet/And, S= From/To, T=To,
                         U = Until, Y = By
B (B.C.): ‘–’ = B.C., ‘+’ = A.D.
YYYY (year)  - can be 0000 if for partial date with no year e.g. Jan 1
MM (month) – can be 00 for partial date
DD (day) – can be 00 for partial date
A (Alternate year): ‘/’ = Double Date, ‘.’ = Otherwise
% (Surety): ? = Maybe, 1 = Prhps, 2 = Appar, 3 = Lkly, 4 = Poss,
        5 = Prob, 6 = Cert, A = Abt, C = Ca, E = Est, L = Calc, S = Say, . = other
BYYYYMMDDA% = the second date in a date range (always included)

e.g. 4 Nov 1900 = ‘D.+19001104..+00000000..’

This is interesting how similar, yet different my system is from RootsMagic. I’ve never seen a Quaker date. RM doesn’t seem to support the Julian and French and Hebrew dates that are in GEDCOM. Their modifier at the beginning of the string will prevent their dates from sorting properly. They include a surety which isn’t part of the GEDCOM date so I don’t include it. And they are always 24 characters, where my format is usually 12 but 24 if it is a date range. 

My one take out of this is that maybe I could save a character for my double dates by using a single character code instead of the two digits.

Regarding the “Sort Dates” that many programs (RootsMagic included) make you enter for events where you don’t know the date, but still want them sorted in a particular order. I think they are superfluous.

My feeling is that if you know enough about the date to be able to order it, then put in what you know using the date modifiers, e.g. if Mary was born in 1832 and John was born after Mary, then I’d say you should put the date for John’s birth in as “AFT 1832” and John will (at least in Behold) be sorted properly after Mary. And make sure you add an appropriate comment and source onto the birth event to state how you know this.

My 1000’th Blog Post - Sun, 17 Jan 2016

Well, it took a little longer for me than the other prolific genealogical bloggers who I follow, but I’ve finally reached this milestone with this post.

image

So a time to look back at the 1000 posts.

My first post was on November 7, 2002. It was a short one just to get started. That was over 13 years ago and genealogy blogs were quite rare. Dick Eastman started his newsletter 20 years ago, in 1996. I opened my websites in January 1997 (so my websites’ 20th is coming up next year).

Back then, I didn’t realize how important it was to use a proper title for a blog post, so I used the date. It wasn’t until Oct 15, 2007 that I started used titles.

Prior to 2008, my Behold site was a subdirectory of my www.lkessler.com site. My blog was done manually with manually edited pages. In 2008, I opened up the www.beholdgenealogy.com website and built and customized the Wordpress blog that you see now. I transferred all my earlier blog posts over to this blog. I integrated it with my site and with the bbPress-based forum for Behold with single login and combined search tools. Both the blog and forum have RSS feeds for easy following.

Of my 1000 Blog posts:

  • 750 were about Behold
  • 54 were about GenSoftReviews
  • 59 mentioned Winnipeg
  • 16 mentioned chess
  • 132 mention Delphi, the programming language I use for Behold

The posts that I am most proud of are the ones where I’ve reported on the various genealogical conferences that I have attended. I tried to put together interesting summaries with pictures from a genealogy software developer’s perspective. These included:

and I will be blogging about the upcoming 10th Unlock the Past Genealogy Cruise that will run from Feb 14 to March 3.

I have Google Analytics for my Behold website since I created it in 2008. Over that time, the entire site has had 370,566 page views. My blog has got over half of those at 213,712 page views with my blog home page getting 72,820 page views. Over the past year, those stats are 74,057 (203 per day) for my site, 45,194 (124 per day) for my blog and 9,142 (25 per day) for my blog home page.

My number one blog post by a wide margin according to Google Analytics had nothing to do with genealogy: It was My Torn Achilles Tendon After Nine Weeks from Oct 18, 2011 with 31,180 page views. (Just to let you know, I was playing squash again all out after 6 months, and 2 years later my leg was fully 100%.)

Numbers 2 through 10 were:

2. Family Group Sheets – Why and Wherefore? - Mar 23, 2012, 5,241 views.

3. Setting up a Solid State Drive with Windows 8.1 - May 17, 2014, 4,124 views.

4. Can Genealogy Software be Rated Fairly? - Jan 6, 2013, 3,283 views.

5. Back to Forum Five - Feb 12, 2008, 2,622 views.

6. How Good are GenSoftReview Ratings? – Jan 4, 2014, 2,508 views.

7. What Ancestry’s “Retirement” of FTM Really Means – Dec 11, 2015, 2,418 v.

8. Standardizing Sources and Citation Templates – Aug 27, 2014, 2,020 views.

9. Source Based Document Organization – Jun 9, 2013, 1,641 views

10. The Future of Genealogy – 6 Predictions – Apr 7, 2015, 1,627 views

This has been fun, and I intend to continue to blog about what I’m doing, advancements in genealogy, and where I’m going with Behold. Please stay with me and follow along as I journey towards my second thousand posts.

SEX is Mostly Known - Sat, 16 Jan 2016

My last post took a look at the SEX tag in GEDCOM and came up with a way to guess at it when it is not specified.

Tamura Jones challenged me and thought that records that do not state the sex are rare. He asked how useful is trying to figure it out for other cases, and I should stop after my rule number 1.

That made sense to me. If the SEX tag is mostly defined as M or F, then there really is no need to go to great lengths to figure out all the missing cases.

So I took a number of GEDCOM files from my collection that I use for testing and picked files as different from each other as possible created by different programs. I totalled up the number of SEX tags with the values: M for Male, F for Female, U for Unknown and the number of individuals who did not have a SEX tag. Out of over 450,000 individuals in these files, only 95 used the U tag, and 1966 did not have a SEX tag.

image

Well that’s not that many. If this is representative of the data out there, then 99.6% of individuals are coded as male or female.

Just for fun, I thought I’d see how good my Rule 2 would work on the 2061 missing and unknowns. So here’s the count of the SEX tags of the people that are pointed to by the HUSB tag:

image

And here’s the count of the SEX tags of the people pointed to by the WIFE tag:

image

There are 111 females pointed to by the HUSB tag, and 105 males pointed to by the WIFE tag. These could be same sex couples, or could simply be a mistake on the sex designation.

There are 262 unknowns or missings pointed to by HUSB tags, and 285 pointed to by WIFE tags. These 547 people out of the 2061 could with little effort be designated with reasonable certainty to be male and female.

Tamura mentioned to me a couple of other methods that could be used to reasonably guess at the sex if it is unknown:

1. The first and middle names of the person (if known) may give a clue. But this would require building up a database of male names, female names and male-or-female names, and would have to span languages and cultures. That’s quite a bit more work than I would want to do for this task. Besides, it’s quite likely that the people whose sex is unknown either are so because they have a male-or-female name, or they are a child of unknown first name and unknown sex.

2. The surname of the children (if known) often will match the surname of the male parent and not match the maiden name of the female parent. This passing down of names is not true for all cultures, but it would be a relatively easy test to perform. Once again, it is likely that if the childrens’ surnames and the parents surnames were all known, then the sex of the parents would have been known.

Overall, there is not much to be gained by any attempt to guess at the unknown or missing sexes. That one FTW file (above) would have been the only one of the 12 that would have benefited from this work, and still, it would have only coded an extra 519 of the 33,790 people in the file. That’s hardly worth the effort.

Instead what I’ve done is added a warning to the log file for any person whose sex is unknown or missing. Looking at the 850 warnings from the FTW file, it becomes apparent that 90% of them are obvious just from the name of the person, so really, nothing else need be done other than giving the information through these warnings. Rather than guessing, let the user fix the the data.

Theoretically, the techniques of using HUSB, WIFE, first name and children’s surname could be used to check the assigned sex, and provide a warning if it looks like the sex may be incorrect. But this too may be a bit of overkill, so I’ll leave this up to some other programmer to implement in some utility program.

Sorry about this diversion. I agree with Tamura. It looks like there’s no need to go any further than simply use the SEX value as defined.

Now back to our regularly scheduled programming. I am working on this now