Login to participate
Register   Lost ID/password?
Louis Kessler's Behold Blog » Blog Entry           prev Prev   Next next

Sunday, November 20, 2005 - Sun, 20 Nov 2005

My adventures with UTF-8 has taken a very winding path.

First I thought I’d be able to simply write out a little translator like I did for ANSEL, and I was using Czech characters as the basis. What I found was that there were just too many non-translatable characters for it in Czech for it to be of much use for that language.

So then I researched the old-style Code Pages, which was the mechanism allowed in Windows 98 and earlier to change character sets. (Newer Operating Systems now use Unicode, which most genealogy programs do not handle. The ones that do usually export GEDCOMs in UTF-8). There were Code Pages for Greek, for Arabian script, for Turkish, Russian, Vietnamese and dozens of other languages and special uses.

What I found I could do was instead of translating to standard ANSII (which is Code Page Windows-1252 and contains Western script and accents for many Western European languages like French, Spanish, Danish, German and Italian), I could provide a second translation to Windows-1250 which has all the accented characters for Eastern European languages. That could even be extended to all the other Languages as well. I even found there are simple functions to first convert UTF-8 into Unicode, and then convert Unicode into the Code Page desired.

All looked good until I realized that you need the correct font for a particular Code Page to display correctly. Searching around for Code Page 1250 fonts in Arial, I could only find a few free ones. And their quality was horrible. It seems that Fonts are even more regulated and commercialized on the Internet than music is. If you live in Eastern Europe, you’ll get all the Code Page 1250 fonts with Windows, but people in the Western world are not licensed to use those. You have to buy your fonts from a registered vendor, also known as a font foundry. For example, the ones who designed many of the Windows fonts are the Monotype Foundation in the U.K. If you want their version of the Arial font for Code Page 1250, you will have to pay $22 for it. If you want it in bold, italic, and bold-italic as well, you will have to pay $79.50. So that wasn’t the solution.

But somehow as I researched, or just through sheer luck, I found out that there was a font characteristic called “Charset” that was setable for most fonts used in a Delphi program. Wondering how that worked, I discovered that a standard Windows font such as Arial, has a large number of other characters in it grouped into character sets, including the Eastern European character set. It seems that the Charset setting tells Windows which of these sets to replace the 129th to 256th characters with. I tried this in Behold, and it worked!

So I vastly increased my personal knowledge about character sets doing this exercise. This knowledge will help when the time comes to implement Unicode.

One last glitch, though. The characters display correctly in Behold, but didn’t seem to export correctly to HTML nor to RTF (Rich Text Format that Word Processors can read). I searched the TRichview forum and found the answer to fix the HTML. But it seemed the RTF problem needed to be reported and maybe I could get an answer on that, without my needing to spend the time to learn about and experiment with RTF format. So this involves isolating the problem to its simplest form and reporting it. Hopefully it will be a simple fix, but possibly not.

So that’s about 3 days work. But progress has been made.

No Comments Yet

Leave a Comment

You must login to comment.

Login to participate
Register   Lost ID/password?