Login to participate
  
Register   Lost ID/password?
The Behold User Forum » Topic           prev Prev   Next next

Date validation - case sensitive - Categorized in: Report a ProblemReport a Problem

13 posts. Started 13 Apr 2012 by brett. Latest reply 29 Nov 2014 by lkessler. RSS 2.0 feed for this topic RSS
1. Brett (brett)
Australia flag
Joined: Mon, 12 Jan 2009
36 blog comments, 59 forum posts
Posted: Fri, 13 Apr 2012 Permalink

Louis
What do you make of Tamura Jone's comment (http://www.tamurajones.net/SiblingTortureTest.xhtml):

None of these produced any errors or warnings, except Behold 1.04.
Behold warned that the date 13 Apr 2012 is non-standard, and should be 13 APR 2012; the warning is that the abbrevation should be in ALL-CAPITALS.
That is what the specification seems to say, but it does not;
Chapter 2 of the GEDCOM 5.5.1 specification clearly states that All controlled line_value choices should be considered as case insensitive.,
and that values should be converted to all uppercase or all lowercase prior to comparing.
That means that Apr is fine, and that means that you may even write aPR or aPr

It seems to me that all lower or all upper is correct but not mixed, as Tamura suggests.

Brett

2. Louis Kessler (lkessler)
Canada flag
Joined: Sun, 9 Mar 2003
287 blog comments, 245 forum posts
Posted: Fri, 13 Apr 2012 Permalink

Brett,

See my blog post: How To Get A Developer To Fix A Bug.

No, Tamura's correct. The statement: "values should be converted to all uppercase or all lowercase prior to comparing" means that aPR and aPr should be both changed to APR (if uppercase is used for comparison) or to apr (if lowercase is used for comparison). Either way, aPR and aPr are equivalent to apr, APR and Apr.

Louis

3. Brett (brett)
Australia flag
Joined: Mon, 12 Jan 2009
36 blog comments, 59 forum posts
Posted: Fri, 13 Apr 2012 Permalink

So what does:

values should be converted to all uppercase or all lowercase prior to comparing.

actually mean?

Is this when:

1. comparing two dates within a program, such as to work out age or

2. two supposedly identical GEDCOMs are compared for differences?

If 1 above, how does a user know it is being done correctly by the user?

If 2 above, how do we change a GEDCOM to same case in both files, without a large (and possibly manual) conversion.

Brett

4. Louis Kessler (lkessler)
Canada flag
Joined: Sun, 9 Mar 2003
287 blog comments, 245 forum posts
Posted: Fri, 13 Apr 2012 Permalink

I think it simply means comparing for the purpose of interpreting its value.

For a DATE value, I don't just compare the month-part to JAN, FEB, MAR,..., but I compare the uppercased value of the month-part to JAN, FEB, MAR,...

For a TYPE value, I don't just compare the value to STILLBORN, but I compare the uppercased value of the value to STILLBORN.

Louis

5. Brett (brett)
Australia flag
Joined: Mon, 12 Jan 2009
36 blog comments, 59 forum posts
Posted: Fri, 13 Apr 2012 Permalink

I assume this applies to BET, ABt etc in that they can be Bet, bet etc but compared upper or lower cased.

Brett

6. Louis Kessler (lkessler)
Canada flag
Joined: Sun, 9 Mar 2003
287 blog comments, 245 forum posts
Posted: Sat, 14 Apr 2012 Permalink

Yes. All parts of the date. And that actually simplifies the work that Behold is doing.

Personally, I think it is a great idea that the GEDCOM designers had. I should have discovered it earlier, but now that I have, I'll make use of it.

Louis

7. Louis Kessler (lkessler)
Canada flag
Joined: Sun, 9 Mar 2003
287 blog comments, 245 forum posts
Posted: Mon, 30 Jun 2014 Permalink

Subsequent thinking on this makes me now believe that GEDCOM intended that only LINE_VALUEs that are an enumerated list of choices were to be allowed to be mixed upper and lower case.

A DATE_VALUE is a line value. But it is not made up of an enumerated list of choices. It is made up of a substructure, with some components of the substruction (such as month) being enumerated. I now don't believe that GEDCOM intended these complex structures to be allowed as mixed case, but should be precisely as defined (upper case).

Whether or not this is true, at least a warning should be given, because there may be programs that will not interpret all of "JAN", "Jan", "jan and "jAn" to be the month of January.

See also: http://www.beholdgenealogy.com/blog/?p=1087

Louis

8. Brett (brett)
Australia flag
Joined: Mon, 12 Jan 2009
36 blog comments, 59 forum posts
Posted: Mon, 30 Jun 2014 Permalink

By enumerated list of choices, are you meaning 'controlled' as referred to in the specification:

All controlled line_value choices should be considered as case insensitive.
This means that the values should be converted to all uppercase or all lowercase prior to comparing.
The terms UPPERCASE and UpperCase are considered equal. TAGS are always UPPERCASE.

9. Louis Kessler (lkessler)
Canada flag
Joined: Sun, 9 Mar 2003
287 blog comments, 245 forum posts
Posted: Tue, 1 Jul 2014 Permalink

Yes. However, GEDCOM does not define the difference between "controlled" and "uncontrolled" line values.

My interpretation is that controlled line values are line values that are restricted to a specified set of allowed optional values. Anything more complicated than that is likely deemed not to be controlled, since that is the logical meaning of the word "controlled".

Louis

10. arnold (arnold)
Canada flag
Joined: Mon, 24 Nov 2014
10 blog comments, 13 forum posts
Posted: Mon, 24 Nov 2014 Permalink

Just signed up but I have been mulling over this issue for a bit.

To me, the operative words in interpreting the standard (5.5.1) are "prior to comparing":

Til now, my interpretation has been - and I still will need more convincing to alter that - that the case of actual value in the original does not matter.
IMO, the standard addresses the issue of whether data should be rejected due to differences in case and by specifying that the value from the original should be convert to either to upper or lower case 'prior to comparing' makes it clear that any and all combinations are acceptable as long a the complete string matches the string specified in the standard - in a case-insentive way. :-)
If the 'orIginal' string was to be expected to have a specific case formation - all upper or lower case or even leading capital - that is the way the standard should have expressed it.

11. Louis Kessler (lkessler)
Canada flag
Joined: Sun, 9 Mar 2003
287 blog comments, 245 forum posts
Posted: Sat, 29 Nov 2014 Permalink

Arnold,

I think I've flip flopped again back to Tamura's view and your view. Despite my doubts about this, I think I'm going to have to allow any case in the line value without giving a warning message.

This is extra work for the programmer implementing GEDCOM, because the line value case must be retained for user data but the individual, but individual values within it that are to be compared to a set of enumerated values must be converted to one case prior to comparison.

Maybe the GEDCOM developers thought this would be easier for the programmer, but it is an extra step versus requiring the case to be as specified in the documentation. And it is something open to interpretation somewhat as this thread shows.

Louis

12. arnold (arnold)
Canada flag
Joined: Mon, 24 Nov 2014
10 blog comments, 13 forum posts
Posted: Sat, 29 Nov 2014 Permalink

Yes, isn't it nice to have a 'standard' ;-)
The main reason I commented was because, before I found Behold, I had tried different validation apps to check my data and when I found nothing I thought was thorough enough - and also because the ones I did find disagreed with each other and the apps that produced the GEDCOM - I started building my very own validator.

If nothing else, it showed up the 'standard' for its many ambiguities and missed specs :-(
and it is nowhere near as thorough as I would want to be and may never get there.

As some wiseacre said: the advantage of having standards is that we now have so many to chose from :-)

13. Louis Kessler (lkessler)
Canada flag
Joined: Sun, 9 Mar 2003
287 blog comments, 245 forum posts
Posted: Sat, 29 Nov 2014 Permalink

Arnold,

Not too long after Version 1.1 is out I'll be adding Consistency checking, which should really give your data a shakedown. Following that, when I add saving to GEDCOM, I'm going to be implementing even better GEDCOM checking on input.

Despite it's ambiguities, GEDCOM has been remarkably successful. Here 15 years later after the last GEDCOM standard was released, almost all genealogy software has GEDCOM input and output. It may not be perfect, but just the idea that everyone uses it says something for it.

Louis

Leave your Reply

You must login to post your reply.

Login to participate
  
Register   Lost ID/password?