Login to participate
  
Register   Lost ID/password?

Louis Kessler’s Behold Blog

Your DNA Raw Data May Have Changed - Sat, 20 Nov 2021

To my surprise, I downloaded my raw data from my 23andMe DNA test and it was different from my earlier downloads.

I would have thought your raw data from a company wouldn’t change. I took one DNA test there, so my results should be determined once, and that’s what should be represented in my raw data. I don’t care if the format of the file changed, but I do care if the data represented in that file changed.

So lets see what might have happened here.


Three 23andMe Raw Data Downloads

I took my 23andMe test in Nov 2017.  I’ve downloaded my raw data 3 times since then.

I talked about my original download in my August 2018 article Comparing Raw Data from 5 DNA Testing Companies. My April 2020 article Determining the Accuracy of DNA Tests used my 2nd download. I noticed a difference in the counts from my 23andMe file from and the earlier article, and I said at the time “I’m not sure why there’s a difference” and I assumed I must have made some mistake with the numbers. But what I see now is that the files were slightly different.

Let’s compare the counts in the 3 files I now have:

image

Above are the counts of the SNP values in groups: the autosomal homozygous (same valued) SNPs, the autosomal heterozygous (different valued) SNPs, the autosomal Insertions and Deletions, the Y and mt chromosomes which have just one value, and the no-calls (values which could not be determined with sufficient accuracy that are shown as a double dash “- -“).

Comparing the counts between the first two downloads, we see they are fairly close with the total number increasing by 10 and the largest difference only being 14 in the G group.

But you can see my most recent download had a significant change, with the total number of SNPs decreasing by almost 6500. About 4500 of those was from a reduction in the number of no-calls, but the other 2000 were because of fewer actual values. Surprisingly, the number of autosomal heterozygous SNPs went up by 19.


So What’s Changed?

Since I created my “All 6” Combined Raw Data File in my May 2019 article Creating a Raw Data File from a WGS BAM file, and that was based on my Sep 2018 download from 23andMe, I’ll compare that download to the one I just did.

image

So there were 6673 SNPs in my old file that are not in my new file. 4610 of those were no calls in my old file, so those don’t matter. And 52 were deletions or insertions that other companies don’t even report. But that still leaves 2010 SNPs that had values previously and no longer do. Of those, 1706 are autosomal SNPs that are important for DNA matching.

There were 191 new SNPs in my new file that were not in my old file. And 136 of those are autosomal SNPs that are important for DNA matching.

And there were 16 SNPs that changed values in my new file. Fortunately none of the changes were important. There were 6 nocalls changed to values and 9 values that were either changed to nocalls or insertions or deletions.


Did We Lose Useful Information?

The question is whether we lost useful information in those 2010 deleted SNPS that had values previously or if we gained useful information from those 182 new SNPs.

To find out, I have to go to my April 2020 article Determining the Accuracy of DNA Tests. See the section about the Accuracy of Standard Microarray DNA Tests. If I do the same procedure and compare the values from the 4 BAM files that agree with each other to these deleted and added SNPs, I can get an approximate accuracy estimate for them.

Of the 1706 autosomal non-indel SNPs that were in the old 23andMe file, 1518 had identical values in the 4 BAM files. Of those, only 1164 match the deleted value. That’s an error rate of 23.3% which isn’t good at all. So what we lost were SNPs with a high error rate.

Of the 136 new autosomal SNPs, 129 had identical values in the 4 BAM files. Of those, 128 matched the new 23andMe value. Just one, which was AG in the 23andMe file and was AA in the BAM files didn’t match. That’s an error rate of 1 / 129 = 0.8% which is okay. So what was added was useful.

The values deleted had a high error rate. The values added had a low error rate. I don’t know the reason why 23andMe made these changes or what they did to make the changes, but the net result was a slight overall improvement to the accuracy of their raw data file.


Did the Raw Data File of Other Companies Change as Well?

Let’s see.

Family Tree DNA:  My first build 37 raw data file download was from Aug 2018. My download today is identical to it.

Ancestry DNA: My first raw data download was Mar 2018. Compared to my download today, 56 SNPs from the early file have been deleted. All the deleted SNPs had values and none were no-calls.  No SNPs were added and no SNPs were changed. So this is a change, but a minor change.

LivingDNA: My first raw data download was Aug 2018.  My download today is identical to it.

MyHeritage DNA: My first raw data download was March 2017. But I was surprised to find my download today is very different, much more different than the 23andMe data, and needs its own analysis, which follows.


MyHeritage DNA Raw Data Changes

Here is my MyHeritage DNA comparison table:

image

There are over 110,000 fewer SNPs in the new dataset from MyHeritage. Most of the reductions appears to be among the heterozygous SNPs which halved in numbers.

I hadn’t heard that any change was made to MyHeritage DNA’s raw data files, so I didn’t expect this. I also see they added a few indels to their file, similar to the way 23andMe does and have cut down the number of no-calls.

But just take a look at these changes:

image

Of the 720,816 SNPs they had in the original file, they only retained 214,353 of them, changed 3436 of them, and added 391,635 new SNPs that weren’t there before.

This is a major change! The original MyHeritage raw data I got in 2018 is nothing like the new one I now get. The SNPs they are using now are 65% different.

Hopefully they improved their accuracy. Let’s see.

When I wrote my Determining the Accuracy of DNA Tests article in Apr 2020, I  produced the following table:

image

My numbers showed that MyHeritage had the best accuracy of the 5 companies, just 1 error out of every 603 SNPs.

When I do the same comparison of my new MyHeritage DNA raw data, I get this:

My new data has 574,057 autosomal values on Chr 1 to 22 that are not no-calls. Of those,  534,179 have agreeing values in my 4 BAM results that I can compare them to.  529,707 of those match the MyHeritage value.

That means 4472 are incorrect out of 534,179 or 0.8%.  So 1 out of 119 values in my new MyHeritage raw data file are incorrect. It’s error rate has increased by a factor of 5.

With this change, MyHeritage went from being the most accurate of the 5 companies, to being the least accurate.

I don’t know exactly when or why MyHeritage made changes to what it puts in your raw data download file, but whatever they did (maybe their imputation and splicing) decreased the quality of its raw data considerably.

I cannot say for sure what that does to MyHeritage’s matching accuracy. That will depend on their matching algorithm and whether they are allowing for the possibility that 1 out of 100 SNPs may have an incorrect value, rather than the 1 in 600 that they had before. If they did compensate for this and lessened their requirement to, say, 1 mismatch every 50 SNPs, then you will have more false segments than you did before.

Conflicting Information in GEDCOM - Tue, 9 Nov 2021

An issue about GEDCOM has once again come to my attention.

In the GEDCOM 5.5.1 standard they write:

Conflicting event dates and places should be represented by placing them in separate event structures with appropriate source citations rather than by placing them under the same enclosing event.

I addressed this over 8 years ago in my article: Multiple Events and Unions in GEDCOM where I said this:

What this means is that if you have two conflicting sets of information for an event, such as a birth event, then there should be separate event structures for them, e.g.:

1 BIRT
2 DATE 1880
1 BIRT
2 DATE 1870

Presumably you’d have more information with each including the full dates, the places, your sources and notes about each bit of evidence. Because of the GEDCOM rule, the first of the two would be considered the preferred, i.e. most credible date.

This is all fine and good for events like Birth and Death that, other than extremely extended circumstances (e.g. brought back from a coma, or science fiction), normally occur only once in any person’s life.

The trouble is that almost any other event can occur multiple times in a person’s life: adoption, naturalization, census, education, retirement. There have been people who have had multiple baptisms and even multiple burials.

This results in a problem. For events other than Birth and Death, if the events are represented like the 4-line GEDCOM example above, how do you tell if they are two different events of the same type, or if they are two sets of conflicting information about the same event?

The answer is, you can’t. GEDCOM does not explain how to distinguish the difference.


A Standard Needs to be Standardized

You would want a standard like GEDCOM to be followed by all developers. You would hope that the standard is internally consistent in how similar objects are represented.

Here we have an inconsistency, where more than one occurrence of the same event type can represent either:

  1. Two different events, or
  2. Conflicting information for the same event.

This type of inconsistency should never happen in a standard. So what is the possible solution?

Conflicting events are not multiple events. They are different versions of the same event based on different sources that give different information for the event. If we are talking about the same event, then all the information available for the event, conflicting or not, should be included in the event, for example, one idea to change GEDCOM might be like this:

1 BIRT
2 DATE 1880
3 SOUR @S1@
2 PLAC Wilmington, Delaware, USA
3 SOUR @S1@  
2 ALTDATE 1870
3 SOUR @S2@ 
2 ALTPLAC New York City, New York, USA
3 SOUR @S2@

So this add new tags ALTDATE and ALTPLAC to indicate alternative information for the same event. Note that the source for that information is indicated.

Personally, I don’t like the idea of GEDCOM adding a million new tags, and adding each item of information individually unnecessarily causes repetition of source information. So this would not be an easy thing for developers to manage.

Maybe what would be better then, would be to include groups of alternative information with each group denoted by a single new tag, maybe ALT, e.g.:

1 BIRT
2 DATE 1880
2 PLAC Wilmington, Delaware, USA
2 SOUR @S1@ 
2 ALT
3 DATE 1870
3 PLAC New York City, New York, USA
3 SOUR @S2@ 
2 ALT
3 DATE 1875
3 SOUR @S3@

And then conflicting information for non-unique events, which don’t currently have a mechanism, can be done the same way.

I should mention that FamilySearch GEDCOM 7.0 does not address the issue of conflicting information and has the same problem as GEDCOM 5.5.1.


NO Is Wrong

While I’m at it, I should mention one of the few changes FamilySearch GEDCOM 7.0 introduced is sort of related to the conflicting information issue.

FSG 7.0 introduced a NON_EVENT_STRUCTURE indicated by the tag: “NO”, which they say:

Indicates that a specific type of event … did not happen within a given date period (or never happened if there is no DATE substructure).

with this example:

1 NO MARR
2 DATE TO 24 MAR 1880

Well gee thanks! That will break just about every genealogy developer’s code, will need to be handled for every possible event tag, and may require changes to the program’s database as well.

In this particular example, I don’t know why the date can’t be specified as:

1 MARR
2 DATE AFT 24 MAR 1880

And if you wanted to indicate that the couple didn’t marry, why not follow the model that GEDCOM 5.5.1 already had and allow it to be specified as:

1 MARR N

After all, GEDCOM already allows the following to indicate that a marriage happened but without additional information:

1 MARR Y


Two People Married More Than Once

The issue of conflicting information was brought back to my attention a few days ago by a discussion in a GEDCOMGeneral Google group. The question raised was if two people married and then separated and then married a second time, should they be included as one FAM (Family) record, or two? 

e.g. as one FAM

0 @F1@ FAM
1 HUSB @I1@
1 WIFE @I2@
1 MARR
2 DATE 1950
1 DIV
2 DATE 1960
1 MARR
2 DATE 1970

or as two FAMs:

0 @F1@ FAM
1 HUSB @I1@
1 WIFE @I2@
1 MARR
2 DATE 1950
1 DIV
2 DATE 1960

0 @F2@ FAM
1 HUSB @I1@
1 WIFE @I2@
1 MARR
2 DATE 1970

In the first case, children from both marriages will be together. In the second one, they would be split into the two families, even though they are full siblings.

Well the issue gets more complicated. What do you do then with a child born between the divorce and the 2nd marriage?

This really should not be an issue at all. It is clear that there should be only one FAM representing all the relationships of two people and all the children they have. Multiple MARR tags should be allowed under a single FAM tag.

But, when they are, are they considered to be two marriage events, or conflicting information for one marriage event?  GEDCOM is ambiguous.


Remembering Past Articles

It’s not good enough for the people writing the standards to just think about an issue and imagine what might be best. Each issue should be studied in detail, and when it comes to GEDCOM, there has already been a lot of study and discussion of most issues. Years of BetterGEDCOM, FHISO, and independent thinking by many genealogy developers should not just be re-thought without referring back to the work that has already been done.

So let me refer you back to:

  1. My article from 2013:  Multiple Events and Unions in GEDCOM
  2. Tamura Jones’ article from 2019:  Married, Divorced, Married Again

In case you don’t want to bother reading the two articles, both conclude that because of the way GEDCOM was written, and because of the way developers implemented GEDCOM, a FAM needs to represent just one union. The MARR and DIV (or ANUL) events therefore represent the start and the end of the union, just like the BIRT and DEAT events represent the start and end of an INDI which represents one life. Multiple MARR and DIV events within one union represent conflicting information, just as multiple BIRT and DEAT events represent conflicting information within an INDI.

All other tags, e.g. CENS, RESI, OCCU, EDUC, etc., represent events that can occur multiple times, so there’s no way to represent conflicting information for those events.  This is an inconsistency in GEDCOM that should be fixed.

Until this change to conflicting information is made, the FAM must remain as one union from MARR to end of MARR.  But once multiple events no longer are used to represent conflicting information, then the FAM concept can be changed to represent the more logical concept of all the relationships between two people and the children they have together.

So the GEDCOM-standards writers really need to change how conflicting information is handled so that the FAM concept can be repaired.

My Computer History - Sun, 7 Nov 2021

Prompted by this week’s Saturday Night Genealogy Fun Genealogy post by Randy Seaver, I thought I’d like to document this in a blog post.

1971: As I entered high school (grade 10), my super-smart neighbor and friend who was two grades ahead of me recommended I follow his lead and get into programming at school. The high schools in Winnipeg had a Control Data Corporation (CDC) mainframe and our school had a card reader and printer that connected to it.  We learned FORTRAN and I had fun with my best friend Carl writing various programs. See: 25 Years of Delphi

1974: My friend Carl and I both wrote computer programs to play chess. In Grade 12, we had our programs play each other..This was covered in both of our city’s newspapers. Carl called it a contest between brute force and finesse. See: The Beginnings of a Chess-playing Program and BRUTE FORCE vs FINESSE.

1974: I took Statistics at the University of Manitoba and mixed a few Computer Science courses in as well. Hundreds of students would stand in line to use the keypunch machines (the older KP-26 and the newer KP-29 models) and then stand in line at the card reader and hand their deck of cards to the person whose job was to feed the cards into the card reader. We’d then walk past another person who was separating the fan fold paper coming out of the printer and then placing each of our outputs on the pickup table. If our coding had an error, it required standing in line at the keypunches, retyping the cards that needed fixing and repeating the process.

1975-1977: Fortunately, dumb terminals were becoming available at the University. These were Cathode Ray Tubes (CRTs) that simply acted as an  interface to the University’s mainframe. What that accomplished was to store the programs on the Mainframe, so no more computer cards!

My first genealogy program was a Script Document Processor utility on my University’s mainframe. It used markup similar to HTML to specify how to make everything look, and included features to create a table of contents and an index of names and an index of places.

In what remaining spare time I had at University, I also continued to work on my chess program.

I worked as a summer student for 3 years at Manitoba Hydro, our electrical utility in the province. They liked my FORTRAN knowledge and my math/stats skills and I got to work on cleaning up the code of some of their mainframe programs to help design Hydro Towers and place the Towers optimally along their route.

1977-1978: My Chess program Brute Force was accepted into the 8th and 9th North American Computer Chess Championships. The 8th took place in Seattle, Washington, and the 9th took place in Washington, D.C. We would use modems to relay the opponent’s move to our home computer and wait for our programs response which we would then physically make on the board for it. See: .Computer Chess - A Memorial to Brute Force

1978-1980:  I completed my Masters Degree in Computer Science at the University of Manitoba.

1980- 1988: I was hired full-time at Manitoba Hydro after I graduated and worked my first 8 years as a programmer and systems analyst working on various engineering projects and models. Our company had its own mainframe, and we developed engineering systems in FORTRAN, one in PL/I and one in Pascal on Apollo Computers which were UNIX-based minicomputers that were awesome!

1988:  At Manitoba Hydro, I accepted a position in the Load Forecasting Department. This was my real introduction to PCs. The company had been using 286 computers up to that time. One of my first tasks was to justify to our Division Manager the purchase of what would be the most powerful computer in the company: A Compaq 386 20 Mhz computer for $10,000, a 300 MB hard drive for it for $10,000 more, and the Operating System and Software for $5,000 more. We got the computer and I started developing our Department’s Customer Information Database on it. We used a database called PC-FOCUS developed by Information Builders which was a fantastic program.

1990: My use of PCs at work for the past few years gave me an wanting for one at home. It wasn’t until about 1990 that prices came down to something reasonable and I purchased an IBM PC 286 no-name clone for about $2,500. I think it was a 12 MHz computer with 8 MB of RAM and a 20 MB hard drive. 

1992-1993: Hard drive capacity was growing fast. I upgraded in 1992 to a 60 MB hard drive and in 1993 to a 260 MB hard drive.

1992-1995: I tried various genealogy programs. The one I liked best was Reunion for Windows. I used it until 1997 when Leister sold it to Sierra who were developing it to be released under the name of Generations. I became a Beta tester for Generations. Sadly Generations was purchased by Genealogy.com and simply dropped it, supporting their own Family Tree Maker program instead. My last entry of my genealogy data into Generations was in 1999. I never updated my genealogy data again until 2018 when I started using MyHeritage and Family Tree Builder. See: So How’s My Genealogy Going

1995: Upgraded my system board finally to a 386 and 8 MB of RAM.

1997: I had to upgrade my computer by buying 16 MB more RAM for $99  to get to 24 MB RAM and replace my 260 MB hard drive with a 2 GB hard drive for $360 so that I could upgrade from Windows 3.1 to Windows 95.  See: Computers 23 years ago

1999: Purchased a new computer with an Intel Pentium III at 600 MHz running Windows 98.

2006: Purchased an HP Media Center PC, 3 GHz, 1 GB RAM. See: Wednesday, January 11, 2006. Two days later, my old Windows 98 computer died: See: Saturday, February 4, 2006

2007: Upgraded my computer to Windows Vista: See: Sunday, June 3, 2007. Surprisingly, I never had the troubles others had with Vista. Worked fine for me.

2009: Purchased a PC with an AMD Phenom 9650 Quad-Core CPU and 7 GB RAM running 64-bit Windows Vista.

2010: This was the tech I had at the time: What I Do

2014:  My current computer was now five years old. See: When Is It Time To Get A New Desktop Computer. So I purchased an HP Envy 700-209 with an Intel i7-4770 Quad-Core with 12 GB RAM and a 2 TB hard drive running 64-bit Windows 8.1. It was 3 times faster. I bought and installed a 240 GB SSD (Solid State Drive). See: Setting up a Solid State Drive with Windows 8.1 – I also bought two identical HP Pavilion 23tm (23 inch) monitors which I love and have been using ever since and I hope they never die.

2019:  Never tried Python before, so I had a bit of fun with this:  50 Years, Travelling Salesman, Python, 6 Hours

2020:  My HP Envy died. See: When Everything Fails At Once. I replaced it with a HP Z420 Xeon Workstation with 32 GB RAM, 512 GB SSD and a 2 TB hard drive for $990 with 64-bit Windows 10 installed on the SSD drive.

Today: I’m very happy with my current Xeon computer. However, I’m very disappointed that it does not meet the minimum system requirements for Windows 11. The CPU is not supported and it only has TPM 1.2 and not 2.0. I’ll likely wait to see if Microsoft loosens the requirements a bit to allow my machine to upgrade. If not, I’ll probably wait until the end of life of Windows 10 in 2025 and buy a new computer that already has Windows 11 installed.

Also: Today, Nov 7, 2021 is my 19th blogiversary. My first blog post was 19 years ago on Nov 7, 2002. And this post is my 1200th post!

Those of you who see me on Zoom will see this background behind me. When I’m on Zoom, I’m actually sitting at this desk with a blank wall behind me. My HP Z420 desktop is at the back left of the desk and you can see my two HP Pavillion 23 inch monitors. In front of my desktop is my Epson DS-860 scanner. Behind it is my Epson WF-4740 printer. Above my desktop on the wall is my Boomer and her Friends calendar.