Today I implemented conversion of the ANSEL set to ASCII. Finally some real programming work! What a pleasure!
After figuring out how the ANSEL characters are defined, I had to do a bit of mapping, to map those characters to Extended ASCII. Many of the most common characters can be represented properly, but there are some that cannot. In the latter case, I picked the best representation possible. In the GEDCOM file Paavo supplied me that was in Finnish and Swedish with 196,964 characters, 4,466 needed translation from ANSEL and all but 30 translated properly.
However, among the 30 I noticed some which were not translated because they were split over two lines. A CONC tag split the character in half. I realized in order to fix this, I’ll have to change the way I implemented the CONC tag, and I might as well do the CONT tag as well. CONT is to continue a line with a line break. CONC (concatenate) is to continue one without a line break.
Implementing CONC and CONT won’t be difficult, except that the way CONC is defined in GEDCOM is a bit weird: a line should be divided in the middle of a word, or if at the end of a word, it should have an extra space on the next line, e.g.:
1 NOTE This is a no 2 CONC te divided correctly. 1 NOTE This one is 2 CONC divided correctly as well. 1 NOTE But this 2 CONC one is wrong!
The problem is that some genealogy programs export their CONC tags one of the first two ways, the way GEDCOM wants you to. Others use the more logical but non-standard third way. I’ll have to detect which program produced the GEDCOM file (it’s usually in the GEDCOM header) and use the appropriate method. Later, I’ll have to add an option on each GEDCOM file to change the way the CONC tags are handled if the default is wrong.