Login to participate
  
Register   Lost ID/password?
Louis Kessler’s Behold Blog » Blog Entry           prev Prev   Next next

Markup in GEDCOM - Sat, 5 Mar 2011

In correcting one problem with Behold, I noticed that some of the NOTEs and DATA tags were not being displayed cleanly. Words were jammed together and lines did not skip to new lines where they looked like they should. Overall, Behold’s display of this data looked sloppy and unacceptable.

Researching into the GEDCOMs, I found there were two characters inserted into the values: Hex 0B (a line feed) and Hex 09 (a tab). Going through my 527 test GEDCOM files, I found there were 23 files that have these characters, generated by Legacy, RootsMagic, and PAF. It’s not much work to make this look better, and I’ll implement this improvement for the next release.

But that made me realize that there may be other “markup” as well. Markup is some type of commands embedded in a file that describe how parts of the file should be made to appear. What makes web pages work is markup that is called HTML (Hyper Text Markup Language). So I thought I should see what HTML-like markup is in the GEDCOM files I have and see if I can handle that as well.

Doing so, I’ve found HTML markup in 40 of my files generated by 14 different programs. Most included only a few HTML tags for styling text, such as <b> for bold, <i> for italic, <u> for underline, <href> for a hyperlink, <br> for a new line and <p> for a new paragraph. But a few included complete web pages with all the HTML from the page under GEDCOM NOTE tag. One file I had that was created from Ancestry.com Family Trees was full of these web pages.

A few simple tags, I probably could handle relatively easily. But to reproduce entire webpages needs a heavy-duty html viewer to be embedded within my TRichView component. That is possible, but it is not a quick and simple thing. I’ve decided to leave the handling of embedded HTML until I add editing. I’m not sure yet, how much formatting control should be allowed. It is a tradeoff between simplicity and ability. My current thinking is that entire web pages should be links to html files which will be handled similar to how pictures will be handled: i.e. as files on your disk that Behold’s everything report will link to and open on a click. I may include a thumbnail preview in the Everything Report as well - but the picture handling will come later, after editing is implemented.

There is a third type of markup as well. Some programs have custom GEDCOM tags to indicate markup. I’ve only noticed two so far: _ITALIC and _PAREN. Legacy and a few other programs include them under sources to indicate how to format the source’s title. Theoretically, this would be relatively easy to implement, but I shall delay this as well, since it should be done in consistently with the way Behold will ultimately handle HTML.

Markup is normally used to make a document look nice, or to highlight certain information. Genealogy programs in general (and Behold in particular) are designed to display your information and enable you to edit it easily. I feel that within your genealogy program, you need to see your data easily and clearly, and markup can get in the way of this. Maybe you want to bold or highlight your notes and sources in certain ways. This is about as far as markup should be taken in genealogy software. In fact maybe it should be restricted to notes, because the program should know how to format your sources correctly. And even in notes, getting carried away (colors and font sizes and embedded images all the way to full HTML and Javascript) does you no real good. Let’s get back to giving you the tools you need to document your genealogy and leave the markup for other programs (web browsers) to interpret.

14 Comments           comments Leave a Comment

1. Brett (brett)
Australia flag
Joined: Mon, 12 Jan 2009
36 blog comments, 59 forum posts
Posted: Sat, 5 Mar 2011  Permalink

“And even in notes, getting carried away (colors and font sizes and embedded images all the way to full HTML and Javascript) does you no real good. Let’s get back to giving you the tools you need to document your genealogy and leave the markup for other programs (web browsers) to interpret.”

While I think the use of ’strong’ and possibly a couple of others can make Notes look better, the problem as it appears to me is the fault of the Genealogy programs for allowing markup within these fields. It would be better to have section within Notes that have a predefined style, thus, leaving it to the program to display reports, fields etc as already defined.

I would, at this stage, prefer any program that exports to GEDCOM, does NOT include any of their own markup, as the program you may import it to, could have no idea it is markup and display it as pure text.

2. Louis Kessler (lkessler)
Canada flag
Joined: Sun, 9 Mar 2003
287 blog comments, 245 forum posts
Posted: Sat, 5 Mar 2011  Permalink

Unfortunately Brett, many of the top programs do export markup, including RootsMagic, Legacy and Paf. And that’s what currently happens in Behold: the markup is displayed as pure text. So I will have to do something about it eventually.

3. uwe (uwe)
United States flag
Joined: Tue, 14 Oct 2008
20 blog comments, 0 forum posts
Posted: Sun, 6 Mar 2011  Permalink

Why don’t you use a fast HTML parser the first time a document is loaded, and eliminate each and every tag right from the start. Shouldn’t be too difficult to implement, and shouldn’t be slow, either. Check out SynEdit; they have very fast parsers.

4. dearmyrtle (dearmyrtle)
United States flag
Joined: Sun, 6 Mar 2011
2 blog comments, 0 forum posts
Posted: Sun, 6 Mar 2011  Permalink

Thanks for noticing this, Louis. No wonder end users like me have trouble importing a cousin’s data. Until I got into this with BetterGEDCOM, I hadn’t realized how much individual software vendors coding was affecting the export/import. In many ways, if they just agreed to the existing GEDCOM 5.5.1 standard, there would be a lot less headaches. We’d have 14-year old GEDCOM blues, but not the nagging headaches you’ve described in this blog post.

5. Louis Kessler (lkessler)
Canada flag
Joined: Sun, 9 Mar 2003
287 blog comments, 245 forum posts
Posted: Sun, 6 Mar 2011  Permalink

Uwe: I’ve looked at a few possibilities to parse the HTML But it’s not just a matter of eliminating the markup. The 6 markup items I’ve specified above probably could and should be displayed, especially the hyperlinks which actually contain data about the link embedded into the markup.

6. Louis Kessler (lkessler)
Canada flag
Joined: Sun, 9 Mar 2003
287 blog comments, 245 forum posts
Posted: Sun, 6 Mar 2011  Permalink

DearMyrtle: Thanks for your first comment on my blog. We did discuss markup in BetterGEDCOM and I had concluded that it should be allowed in NOTE and TEXT fields and maybe a few others.

Unfortunately, markup was not even considered in GEDCOM. Including markup in GEDCOM NOTEs does not do anything to break the standard, so many vendors decided to include it because with it they can conveniently store any formatting they allow in their notes. So in this case, they are all still following the GEDCOM standard, but are creating GEDCOMs that some programs (who don’t read and/or interpret the markup the same way) are unable to display similarly.

Markup problems can explode. There are hundreds of tags in HTML. It shouldn’t be the genealogy programmer’s responsibility to have to include a full web browser in their application and interpret JavaScript and CSS and the whole bit. But some markup tags would be useful in a new BetterGEDCOM standard to give limits as to how the programmers should go with this. A limited set would serve the purpose of highlighting what needs to be highlighted. Any more and the user should be advised to create a digital image of their evidence or a copy of the web page that they refer to and link to that from the note, rather than try to reproduce it in the note.

7. uwe (uwe)
United States flag
Joined: Tue, 14 Oct 2008
20 blog comments, 0 forum posts
Posted: Sun, 6 Mar 2011  Permalink

What about TRichView’s ability to generate its own hyperlinks? You do not need HTML or CSS to format your everything report.

Just my 2c, of course… ;)

Uwe

8. Louis Kessler (lkessler)
Canada flag
Joined: Sun, 9 Mar 2003
287 blog comments, 245 forum posts
Posted: Sun, 6 Mar 2011  Permalink

Uwe: Thanks. I already use TRichView’s hyperlinks in Behold. Implementing them for this isn’t the issue. It’s parsing the information that’s the bigger problem. I want to display the link as text with hypertext but also the description as text. What about other attributes of the <a> tag? There’s always issues. After all this discussion, maybe I should just take the time to try to implement those 6 common markups and be done with it.

I made your edit to your post for you. As far as that goes, when developing this blog and the forum, I went back and forth between allowing editing in the comments and not. The real problem I had was with the editor itself and I wanted consistency between the blog (WordPress) and the forum (bbPress) which had different methods. So I took editing out completely and then forgot about it. I, of course, can edit the posts. But I understand you. I hate forums where you can’t edit at least for 5 minutes to get rid of your typos. That’s one of my biggest complaints about WikiSpaces where BetterGEDCOM is. I’ll take a quick look and see if there is a new plugin or something I might be able to install to allow some editing. … Found Ajax Edit Comments. Looks promising.

9. uwe (uwe)
United States flag
Joined: Tue, 14 Oct 2008
20 blog comments, 0 forum posts
Posted: Mon, 7 Mar 2011  Permalink

Hi Louis

The anchor tag has only one function: to mark the beginning and end of a hyperlink, and to point to a defined web address. The rest - for example inline CSS (if used at all, because it defies the purpose of Cascading Style Sheets) - is only used to format the text. What I meant is to simply read out the address of the link and convert it into a RichView link. I thought the user decides how the Everything Report looks like, and not some third-party software which writes formatting information into the GEDCOM file. Or am I missing something? HTML is a markup language to describe a web document; nothing else.

Thanks for thinking about implementing some basic editing feature here. That would really help.

Uwe

10. Louis Kessler (lkessler)
Canada flag
Joined: Sun, 9 Mar 2003
287 blog comments, 245 forum posts
Posted: Mon, 7 Mar 2011  Permalink

Uwe: The tricky part is this: Some programs allow basic formatting of the notes to make them clearer. If that is not reproduced, then you lose the clarity that was intended. So I do think this should be reproduced in Behold.

But then there’s the other extreme: Programs that allow copying and pasting from a webpage into a note. Those have all the baggage that HTML/CSS/Javascript has. To me, that needs to be kept as a file or image or something external to the note, but linked from the note possibly with a thumbnail image.

And really, the user only decides how to present what is in the Everything Report. You don’t get to decide how to format it. Eventually Behold may do more with formatting, e.g. highlighting relationships or sources or showing differences between merged data with different fonts or color, etc. But anything you want custom formatted is better done in a word processor or html document or spreadsheet, and then linked to from the notes in your data.

That 3rd party software is only manipulating the NOTEs and DATA fields. I think that is reasonable. But formatting sources is going a bit too far. And it’s regarding the formatting of sources where I disagree with Randy Seaver.

11. uwe (uwe)
United States flag
Joined: Tue, 14 Oct 2008
20 blog comments, 0 forum posts
Posted: Mon, 7 Mar 2011  Permalink

Louis, I wrongly assumed that Behold will store the Everything Report in RVF format, too. If that would be the case, text formatting being done from within the report/ Behold would make sense. Maybe that’s an idea for later… ;)

12. tjforsythe (tjforsythe)
United States flag
Joined: Sat, 25 Feb 2012
5 blog comments, 0 forum posts
Posted: Wed, 29 Feb 2012  Permalink

Louis, It’s a little clunky, but I webify all notes and text fields by replacing all “<” symbols with “<” and then go back and replace the “<” with “<” for just the limited set of markup fields I want to support.

13. tjforsythe (tjforsythe)
United States flag
Joined: Sat, 25 Feb 2012
5 blog comments, 0 forum posts
Posted: Wed, 29 Feb 2012  Permalink

I see the previous post out-thought me. I replace all “<” with “& l t ;” - remove the space of course.

14. Louis Kessler (lkessler)
Canada flag
Joined: Sun, 9 Mar 2003
287 blog comments, 245 forum posts
Posted: Wed, 29 Feb 2012  Permalink

Tim: (Sorry about the primitive markup allowed in my comments.)

Then you are not translating that back into HTML tags and then the person will see the HTML and tags and not what was intended to be rendered.

I think that is correct. Notes were never intended in GEDCOM to include markup, and I agree and don’t believe they should. Formatting is not data. Formatting embedded in data wrecks the look of displayed text within its own display if it was not intended for your framework. You are in charge of your program’s display and should be allowed to apply your own formatting.

Louis

Leave a Comment

You must login to comment.

Login to participate
  
Register   Lost ID/password?