Login to participate
  
Register   Lost ID/password?

Louis Kessler's Behold Blog

Raw Data Comparison: FamilyTreeDNA vs MyHeritage DNA - Tue, 28 Mar 2017

Before I leave DNA and get back to Behold for a few weeks, I had one more set of results I wanted to report on.

A couple of weeks ago, I compared my MyHeritage DNA ethnicity results to my FamilyTreeDNA results, and also compared my match results.

There was one other comparison I had wanted to do. It’s to compare the Raw Data files of the two companies. My questions are:

  1. How similar the raw data downloads are.
  2. Do the differences significantly affect match results.
  3. Do the crossover points of segment matches significantly change.

 

Downloading Your Raw DNA Data

To download your raw data from FamilyTreeDNA, go to your Dashboard and click on “Download Raw Data”

image

On the next screen, select “Build 37 Raw Data Concatenated”

At MyHeritage DNA, it is not quite as obvious. Originally, I couldn’t find it and assumed you couldn’t download your data there, until I was shown how. What you do is go to your Manage DNA kits page, click on those 3 dots on the right, and select Download.

image

 

Comparing the Raw Data Files

The two companies, FamilyTreeDNA and MyHeritage both use the same DNA testing company Gene by Gene, Ltd. in Houston, Texas. In fact, Gene by Gene is the parent company of FamilyTreeDNA. MyHeritage chose Gene by Gene to be their lab, and Gene by Gene accepted the offer even though you could imagine MyHeritage DNA to be a competitor to Gene by Gene’s FamilyTreeDNA. I’m sure Gene by Gene must have thought it better to get MyHeritage’s lab business than to let them go off to some other lab. Even if this was a financially-based arrangement, it’s still nice to see a little bit of cooperation here between genealogy companies, just like it is to see FamilySearch’s partnership with MyHeritage and Ancestry and FindMyPast to share resources.

Given that it is the same lab doing the test, one would naturally expect the the lab results to be quite similar. I downloaded my two datasets and put them in one spreadsheet to compare them. They had exactly the same format. Here’s the first few lines of the two files side by side:

image

Think of RSID as the name of a particular position on a chromosome. The Position is in base-pair (bp) units from the beginning of the chromosome and is the information that Double Match Triangulator shows in its output. The result is one of the allele’s (A, C, G or T) from each parent at that location.

The data from the two companies both had 702,442 lines for chromosomes 1 through 22 with identical RSID, Chromosome and Position, and the entries of those were in the same order in each file, ordered not by RSID, but by Position. Having the first three fields matching exactly is a very good thing. They indicate that these download files of MyHeritage and FamilyTreeDNA are both using the same RSID definitions which are defined in what’s called a “Build”.  FamilyTreeDNA allows you to download Build 36 or Build 37. MyHeritage only allows the download of Build 37, so I’m comparing Build 37 here.

FamilyTree DNA gives a FAQ page: How do I read my Family Finder raw data file? In that FAQ they give the following useful table for interpreting the results:

image

I’m not sure why the table only lists two of the heterozygous values. There are 4 more:  AC or CA, AT or TA, CG or GA, and GT or TG as you’ll see in the tables I created below. There were no insertion or deletion values in either of the downloads.

 

Comparing Autosomal Chromosomes 1 to 22

Comparing the Results field for those 702,442 values on chromosomes 1 to 22 gives for me the following counts:

image

578,890 (82.41%) of the entries (light green) match exactly.

FamilyTreeDNA does a nice thing and in their download shows the allele values of each pair in order alphabetically. So it only lists CT and not TC, only AG and not GA.

MyHeritage is not so nice. They show some of the pairs in the other order, with the higher alphabetical allele listed first. They do this for GC, TA, TC and TG (counts shown in dark green). And they show GC both ways, also as CG, and TA both ways, also as AT. Doing this makes me worry that there may be some third party tools that assume the order of alleles is one way or the other. If they do, they could present erroneous results from MyHeritage’s raw data. 100,898 (14.36%) of MyHeritage’s allele pairs match FamilyTree but are shown in the opposite order.

The FamilyTreeDNA table from their FAQ says that the double dash “—“ represents results that were not clear. They say this happens for a small percentage of the microchips. Well, 17,661 (2.5%) of the MyHeritage results are “unclear”, and 19,850 (2.8%) of the FamilyTreeDNA results are “unclear”. Of these, both companies agree that 14,899 (2.12%) of the pairs are “unclear”. At least they agree on most of them.

So up to now, we have 82.41% + 14.36% + 2.12% = 98.89% of the allele pairs matching between the two sets of raw data. That means we have a little over 1% that do not match. We are seeing what is the error rate between two different samples from the same person that are analyzed by the same lab. I don’t know the technical details as to how the companies determine the raw data from the samples, so I can’t speculate as to the reasons for the differences.

Breaking down the differences:
For 2,762 (0.39%), FamilyTreeDNA found a pair, but MyHeritage was unclear.
For 4,951 (0.70%), MyHeritage found a pair, but FamilyTreeDNA was unclear. 
For 42 (0.01%), both companies found a pair, but the pair differed.

 

Build 36 versus Build 37

FamilyTreeDNA currently uses Build 36, not Build 37 when matching segments between people.As Gerrit van der Ende wrote: “A Build is a Genome assembly. As more is learned about the human genome, new Genome assemblies are released.”

The Chromosome Browser at FamilyTreeDNA, and the Chromosome Browser Results file you download from FamilyTreeDNA has positions based on Build 36. Build 36 had a few more RSIDs (702,457 for chromosomes 1 to 22 versus 702,442 for Build 37). There were 15 RSIDs deleted. Here is the beginning of my Build 36 download from FamilyTreeDNA:

image

Compare this to the Build 37 at the beginning of this article. The RSIDs are the same and the Results are the same, but all the Positions are different. The positions are not important for matching. Only the order of the RSIDs and the Results are important for matching. There were only 100 or so RSIDs that had a slight order difference, so different builds can be relatively easily translated into each other and matched against each other. What will be different between Builds are the Positions of the matching segments and the size of the segments.

GEDmatch, like FamilyTreeDNA, uses Build 36 for its comparisons. But 23andMe uses Build 37. So you can’t compare exact positions in Double Match Triangulator that were computed for FamilyTreeDNA or GEDmatch files with those computed at 23andMe..

MyHeritage’s positions in its raw data are all matching FamilyTreeDNA’s positions from the latter’s Build 37 download, so MyHeritage’s raw data is Build 37. I will not be able to tell whether their matches are Build 37 until MyHeritage provides a segment match download or a utility like a chromosome browser that shows segment match results. However I would guess, since they are a new company, they would use Build 37 matches, making their Positions compatible with 23andMe.

FamilyTreeDNA and GEDmatch are sort of stuck. They put together a matching system based on Build 36 and they’d have to remap all the results if they went to Build 37 for their matching. It would change the positions, but likely not change the match results significantly. That’s a lot of work for little gain, so I can see their reluctance to make the change.

Comparing Build 36 to Build 37 gives almost all the mapping that is needed. If it becomes important in the future for Double Match Triangulator, I see that I’d be able to do the mapping and present FamilyTreeDNA, GEDmatch, MyHeritage and 23andMe results all with comparable Positions, either Build 36 or Build 37.

 

Comparing the X Chromosomes

Doing the same comparison for the X chromosomes shows more differences between FamilyTreeDNA and MyHeritage DNA than chromosomes 1 to 22 did:

image

First of all, MyHeritage is missing 16 of the RSIDs that FamilyTreeDNA has. This wasn’t a problem for chromosomes 1 to 22 which matched exactly.

Then, if you look again at the FAQ above, you’ll see it says that for men who only have a single X chromosome, the one allele will be doubled, allowing only AA, CC, GG and TT. This is my raw data file, and I’m male. But the results show 46 combinations that include AC, AG, CT/TC and GT/TG. Those all have to be incorrect and I’ve marked them such.

And instead of only about 1% of the results where one company found a pair and the other was unclear, we are now up to over 5% of the X results being “unclear” for one of the companies, and another 641 or 4% being “unclear for both”. That means that about 9% of the X chromosome results are unknown or unagreed upon by the the test results that Gene by Gene produces from two DNA samples of the same person.

If 9% of the X chromosome results are missing or wrong, then for two people. 18% of the locations may be wrong between them. What effect might this have on X chromosome matching?

 

The Y chromosome

I was very surprised to see that the MyHeritage DNA raw data includes the Y chromosome. FamilyTreeDNA does not. So I can’t compare the two. All I can do is report on the Y results of MyHeritage DNA:

image

Again, there is only one Y chromosome, so according to the convention, the allele should be doubled. We see that only 60% of the 481 RSIDs have valid values of AA, CC, GG or TT.

Even without the FamilyTreeDNA raw data for the Y to compare with, the MyHeritage DNA raw data does not give much confidence regarding the accuracy of the Y chromosome interpretation as far as single allele processing goes. MyHeritage does not yet make report any results based on the Y chromosome, but they should double check this before they do.

 

Comparing Match Results at GEDmatch

The question now is whether these differences affect match results. One way to check this is to upload both files to GEDmatch.

Doing a One-to-one compare of the two files shows just 22 matches – one match for the length of each pair of chromosome. GEDmatch uses 3587.0 cM as the size of the 22 pairs, and that’s exactly what the One-to-one compare gives. GEDmatch must somehow filter out the 1% mismatches in its comparisons, which is good.

Comparing the 2 me’s to my uncle gives very close results. Out of 61 matching segment, one start location and one end location are a bit different. The total matches using the FamilyTreeDNA raw data is 2,006.4 cM and using the MyHeritage DNA data is 2,005.9 cM. Both give a largest segment of 88.3 cM.

For a more distant relationship, such as my 3rd cousin, the results are almost the same with only a few small differences:

image

It does appear that even though there might be what appears to be a significant number of differences in the Raw Data files, they do not have a significant effect on the matches and only affect a few of the starting and ending locations, but not by much.

Checking out the X Chromosome and spot checking a few of my closest X matches, the results are similarly close, and X matching is not significantly affected.

 

Comparing Match Results at FamilyTreeDNA

As a double check, I uploaded the MyHeritage DNA raw data into an account at FamilyTreeDNA. My original FamilyTreeDNA test give me 9860 matches. The MyHeritage raw data gives me 9724 matches.

Of those, the cM total matches changed for 3717 of them, but the largest change was only 7.9 cM with the FamilyTreeDNA raw data giving a match of 107.1 cM and the MyHeritage DNA raw data giving a match of 99.2 cM. For this extreme case person, here is the comparison:

image

FamilyTreeDNA includes 2 segments of 2.37 cM and 3.21 cM that MyHeritage doesn’t, and one segment has a different start location. So even in this extreme case, the differences are not major.

Only 114 of the longest segments of the matches were different, with the largest difference being 3.6 cM that reduced a 16.4 cM longest segment down to 12.8 cM.

Again, this confirms that the differences in the Raw Data files do not have much of an affect on the match results.

 

Conclusions

  1. Comparing the raw data from FamilyTreeDNA and MyHeritage shows that for Chromosomes 1 to 22, there is disagreement or the result is unclear for 1.5% of the RSIDs. On the X chromosome, that percentage rises to 9%. On the Y chromosome, the percentage rises to 40%.
  2. These differences do not seem to have a significant effect on match results.
  3. A small number of start and end locations of segment matches may be different. This is worthy to note when I start getting Double Match Triangulator to analyze crossover points, but likely wont cause problems.

The raw data is more different than I expected it to be, but I’m very happy that it will make little difference to the match results.

 

—–

Update:  Sept 28, 2017:  

Raw Data Transfer

I was curious what raw data each company changes when you transfer the raw data from one to the other.

First I compared the MyHeritage transfer to FamilyTreeDNA with the original MyHeritage raw data.

The result is good. All FamilyTreeDNA changed in the raw data transferred from MyHeritage were the allele values GC, TA, TC and TG to CG, AT, CT and GT so that they would be in alphabetical order.

Unfortunately, I could not test the FamilyTreeDNA transfer to MyHeritage, because MyHeritage does not let you download the raw data of a transfer.

Done and Onward - Sun, 26 Mar 2017

What a few months! Finished up RootsTech and left with 3rd place in the Innovator Showdown. Took a much needed one-week vacation with my wife. Finally got out of my boot and started driving again. And I finished DMT’s new website at www.doublematchtriangulator.com and last week released version 1.5 of DMT. With that version, DMT is no longer free – a lifetime license now costs $40 US. There’s just too much work and subsidiary costs involved in supporting a product for others to be able keep it as freeware.

Version 1.5 of DMT included a couple of fixes if you use the By Chromosome option. In that run’s People file, the numbers in column AH and after were not correct as they included the a-b match which they shouldn’t have. Also, not  matching to anyone will no longer crash the program.

In the meantime, I’ve arranged to do a number of talks.

  1. I’ll be giving a demo of Double Match Triangulator to the Association of Professional Genealogists (APG) Virtual Chapter on April 1, only open to APG members.
    image
  2. I’ll be attending IAJGS 2017 in Orlando in July and giving a Workshop on using DMT
    image
  3. I’ll be attending the Great Canadian Genealogy Summit in Halifax in October and giving 3 talks on DNA testing and using the results.image

 

Onward with Behold

Now that DMT’s good for a bit, it’s time to shift back to Behold, whose Version 1.3 has been patiently sitting waiting for me to get back to it and finish it off. I’m excited about getting this version done and releasing it. It will have the last set of changes to the Everything Report that I feel are needed prior to starting to work on changing Behold from being just a GEDCOM reader into the fully capable genealogy editor that I want and need it to be.

The thrust of this last set of changes is adding some important DNA information that I know I’ll be using. I’m sure any of you who have already got into DNA testing or are planning to, will want these features as well. Relationships, chance of matching, expected amount of match, as well as DNA candidates will be available. I’ll write up a full blog post on them as I approach completion.

 

Then more for DMT

Flipping back to DMT, I know I want to remove the dependency that DMT has with the Excel libraries that are available only if you have Microsoft Office on your computer. I found a package called TMS FlexCel which I’ll purchase which will allow the creation of Excel files without having Excel.

The other thing this package will allow me to try, is to see if I can convert DMT to be compilable not only for Windows, but also for Mac. I currently use the VCL (Visual Component Library) which is the framework for developing Windows applications. With TMS FlexCel, I can try to convert from VCL to the FMX (FireMonkey) framework which would allow me to produce versions of DMT that would run on Windows, MacOS, and Linux.  I may also want to see if I can make DMT a Universal Windows Platform (UWP) app and add it to the Windows Store. And then, if I could figure a way to display a huge spreadsheet nicely on a phone, and figure out why I might want to do it, FMX would allow me to make a version of DMT for your Android or iOS phone.

Then there’s a few important improvements needed to DMT. It needs to be able to read in 23andMe match files directly, as well as read in GEDmatch matches directly. Then you’ll no longer have to convert those formats to FamilyTreeDNA format by hand, which will make sure the conversion is done correctly and save you time.

 

And Back to Behold

I’ll have to design the Behold database, build in GEDCOM export, and add editing.

I was at one time considering SQLite as the database that I’d use, but since then I’ve been convincing myself that the proper solution is a NoSQL database. SQL databases are relational, and have a fixed predefined structure. This is limiting in genealogical software and requires a rebuild whenever a field is added or changed. NoSQL databases are unstructured and extendable. The allow flexibility and can handle very large data sets (Google, Twitter, Ancestry and FamilySearch all use NoSQL) and if you need more web processing power, you just add another server. Using a NoSQL structure will allow me to keep data from other sources in their near-native form rather than force-converting them into something else.

I will have time to decide on the final database structure, SQL or not, before I start this work.

 

And Back to DMT Again and Behold some more

Double Match Triangulator doesn’t do everything I want it to do yet. It’s got to take that final leap and be able to do that mapping of your ancestral segments to your DNA for you. Currently DMT only lays out your matches for you to analyze. But if I can attain that next step, then DMT will become something amazing.

And Behold, once editing is added, needs to interface with the online systems, FamilySearch, MyHeritage, Ancestry (if they’ll let me) and whoever else is an important system that I’ll want to obtain genealogy information from or sync with.

DNA interfaces would be nice as well. Why not load your DNA matches into your genealogy program? It’s just a matter of figuring out what ties between your genealogy research and your DNA test results are important and useful. Debbie Parker Wayne recently wrote: Wanted: Genetic Genealogy Analysis Tools Incorporating Family Tree Charts. Debbie gives some good ideas. I commented on her post and there’s some other good comments there as well.

Lot’s to do. Back to work.

My MyHeritage DNA Results Have Come In - Sun, 12 Mar 2017

First off, that didn’t take very long! At RootsTech, just over one month ago, I took a DNA test at the MyHeritage booth. I didn’t have to mail it back. Instead the MyHeritage people delivered all the samples they collected at RootsTech directly to the lab in Texas.

A couple of weeks before RootsTech, I was selected to receive a free MyHeritage DNA test kit at RootsTech. I wasn’t going to turn down that opportunity. Previously, I had tested my uncle and myself at FamilyTreeDNA. I was interested in comparing the results and seeing what MyHeritage, the new DNA kid on the block, was going to provide.

The first email I got was yesterday morning. image

Clicking on the View DNA results link led me to their site

image

So this was going to give me my ethnicity make-up.

I know the ethnicity results from the various DNA companies vary. Each is based on a base of several thousand people they use, and the assumption is that the people know accurately their ancestral origins. This is tricky for them to do, because if one of the people is 50% Swedish and 50% Mexican, then anyone who matches this person will get a little of both when they really are only related on one side. Obviously, the companies will not include this extreme case in their base, but the illustration is accurate, because the same thing can happen 2 or 3 or 4 generations back, and therefore allocate a significant percentage of inaccurate ethnicity to a person’s whole.

The second inaccuracy in ethnicity percentage is because a person does not get the same amount of DNA at each ancestral level from each ancestor. For example, the normal case is having 32 great-great-great-grandparents. Therefore, each should average just over 3% of your ethnicity makeup. So if two of your g3-grandparents were from Sweden, you’d expect that 6% of your ethnicity report would be from Sweden. But DNA does not pass down evenly. The amount of DNA passed down from each g3-grandparent can vary greatly. You might not get any from some and could get as much as 6% or even 8% from others.

In my case, none of that should be a problem. I am a good test case for how good the base is, since as far as I know, anything less than 100% Ashkenazi for me is likely incorrect. All my lines as far back as I can go don’t indicate anything otherwise. Even my Ancestral Birthplace Chart is boring, with my father’s side all Romania, and my mother’s side the country right next door: Ukraine.

So let’s see the results at MyHeritage:

image

Hmm. They got 83.8% right. The East Europe 3.8% could be argued that they got the locale right but the group wrong. I really have to laugh at the 1.0% Eskimo/Inuit though. Maybe that’s a result of my living through the frigid winters in Winnipeg all my life, that my genes have evolved into Eskimo.

Let’s compare to my FamilyTreeDNA ethnicity estimates.

image

They only got 79% right with an 11% locale correct for Eastern Europe. Wonder where they got the 2% British Isles from. And my total is just 99%.

My uncle was only tested at FamilyTree DNA and he is 100% from Romania. He came out to 89% Ashkenazi Diaspora, 2% Eastern Europe and 8% Eastern Middle East. Well, like my total, that also totals only 99%. And how does my uncle get Eastern Middle East but I get Asia Minor?

To me, the ethnicity results provide me with no information (although maybe I’ll flaunt my being part-Eskimo). What’s really important to me are the matches.

At FamilyTreeDNA, I currently match to 9,637 people. The high number is likely because I match to most of the people of Ashkenazi heritage who have tested there due to the great amount of endogamy in this population. If the Ashkenazi could map everyone just like the Icelandic people did, we’d be able to use an app like they’ve got to determine how we’re related. Unfortunately, our records don’t go back to 1000 A.D. and to make things more difficult, our people were one of the last to adopt surnames and that happened in the early 1800’s, only about 5 generations ago. So of those 9,637 people, I only have confirmed relationships of two: my uncle and a 3rd cousin.

MyHeritage DNA is a new DNA testing company. They only started up last year but with the enormous reach and large worldwide membership of their MyHeritage site, they are growing quickly. I was interested to see how many matches I had. That number initially turned out to be 260. They are shown for me on 26 pages, 10 per page, in order of decreasing shared DNA. My first three entries look like this:

image

They provide the name of the person, sometimes a picture of the person, their approximate age (I like that), where they are from, the possible relationship range, a percentage of shared DNA (I find that useless if the cM is given), the shared cM, the number of shared segments, the largest segment in cM, and the size of their tree at MyHeritage along with a link to their tree.

That is all very nice. MyHeritage is of course trying to use the DNA testing to get more people to use their services. This is a great initial step and they seem to be doing all the right things so far.

One odd thing in their relationships. I wonder why they state: “1st cousin twice removed”. I would sooner them state “2nd cousin” which is the same genetic distance. It is more likely your match is at the same generational level to you than for them to be 2 generations before or after you.

The big question is whether the MyHeritageDNA match information is compatible with the match data from other services. MyHeritage and FamilyTreeDNA use the same company in Texas to analyze their DNA tests. You would think the test results should be similar.

I did find a few of my matches who tested at both companies. Here’s the comparison:

image

MyHeritage is too optimistic about the Possible Relation. With endogamy, the relationships should be lessened at least to what FamilyTreeDNA has.

My Heritage Total cM is less than FamilyTreeDNA’s. That is okay. All that means is that FamilyTreeDNA is including smaller segments than MyHeritage. FamilyTreeDNA includes segments as small as 1 cM in their total. MyHeritage likely only goes down to, say, 3 cM or 5 cM.

But it’s the largest cM that bothers me. For this the two companies should have the same values, but don’t. And they’re not out by a small amount either. MyHeritage’s largest segment in all cases are larger than FamilyTreeDNA’s. I have no explanation for this, but it is indicative that the two sets of analysis have something significant that is different between them.

What MyHeritageDNA haven’t done yet, and it remains to be seen if they do, or if they hold out like AncestryDNA, is whether they provide you the ability to download your match data. Currently, if you want a list of the people you match with, you’ll have to go through your pages and record the info yourself, one by one. Nor is the segment match data supplied. As a result, I cannot check the individual matches to see why they differ from FamilyTreeDNA.

MyHeritage does allow you to download your raw data, and you can import that into GEDmatch. So currently, the only way you can use your MyHeritage data with Double Match Triangulator is through GEDmatch.

None-the-less, it’s a good start for MyHeritage. They’ll grow quickly and likely join the big-3: AncestryDNA, 23andMe and FamilyTreeDNA as the 4th major player in the DNA-testing circuit. I hope they make the decision to implement some DNA analysis tools and allow you to download your own data. And lets also hope that they don’t become one of those companies that sells your data to others, and hide that in their terms of agreement.

Now, what can I do to find my Eskimo relatives?

 


Followup: March 13, 2017: Ann Turner and Annemieke van der Vegt pointed out on the ISOGG Facebook group that the raw data can be downloaded from MyHeritage. There are 3 little dots that you can click on and the download option will appear. The raw data then can be uploaded to GEDmatch. I’ve updated my post to reflect this info.

Update: April 4, 2017:  FamilyTreeDNA did a major update to their Ethnic Makeup algorithm. Many people have said it is much improved. It is as well for me, with more Ashkenazi accounted for.

Here’s my new results:

image

Ashkenazi up from 79% to 92%. British Isles and Asia Minor are gone. I don’t believe the 7% West and Central Europe. It should be East Europe. The trace of West Africa is new and perplexing.

My uncle improved as well. He went from 89% Ashkenazi to 96%, with trace amounts from Southeast Europe (okay), West Middle East (maybe), South Central Africa (huh?) and Central Asia (nope). You think maybe my uncle’s Central Asia is where my 1% Eskimo at MyHeritage came from?


Update: May 31, 2017:  I transferred my FamilyTreeDNA raw data over to MyHeritageDNA a couple of months ago to see how it would compare. Both my tests were done on the same person, i.e. me, and analyzed at the same lab, since MyHeritageDNA and FamilyTreeDNA both use the same lab in Texas.

I would have expected my ethnicity as analyzed by MyHeritageDNA, whether from the MyHeritageDNA kit or from the FamilyTreeDNA Kit to be very similar. Here’s how they look:

image

I would only consider the Ashkenazi, East European and Balkan to be correct. The discrepancies in each line between the two tests are surprising. In one test, MyHeritageDNA says I have Iberian, North African and Eskimo. In the other it says I have NorthWest Europe and Irish/Scottish/Welsh.

Identical siblings should not give different ethnicity results, and neither should raw data from the same person analyzed for ethnicity by one company.

Some people comment that “ethnicity estimates are not an exact science”.

Well I would say it’s more like “an approximate art”.