Login to participate
  
Register   Lost ID/password?

Louis Kessler's Behold Blog

Comparing Single Matching to Double Matching - Wed, 25 Jan 2017

Important Note:  After this article was written, I found many people had trouble understanding the concepts, as the diagrams were confusing them more than helping them.

This article has been completely rewritten (just two days later) and uses a different diagramming that is akin to looking at the matches in a Chromosome Browser. It should not only be much more easy to understand, but it adds comparisons with the ADSA and GEDMatch Triangulation tools.

I’m leaving this article here as another method of explaining the same thing, but if you haven’t read the other article yet, I’d recommend you read it first:
Triangulation, Single Matching and Double Matching

 

Let’s see if we can define everything in an understandable way.

 

Single Matching

For Person a, find all the Persons c, d, e, … who match or overlap on the same DNA segment.

This is what FamilyTreeDNA and 23andMe give you today. MyHeritage is promising a Chromosome Browser but no word yet on whether you’ll be able to download segment matches. AncestryDNA does not provide you with your segment match data.

image

The goal here is to find the people who get that DNA segment from the same common ancestor, Identical by Descent (IBD), as that will prove a relationship. But this must be checked thoroughly, because one of each pair of chromosomes comes from the mother and one from the father, and the DNA company’s matching process cannot distinguish one from another. So any match, one that even criss-crosses between the mother’s and father’s chromosome will count, as will random matches by chance. With small segments below 15 cM in size, there is a significant likelihood of there being a false match that is not IBD. Even above 15 cM, the segments may still be to the different parent. The main technique to help you identify if the segment is IBD is Triangulation (see below).

 

Double Matching

For Person a and For Person b, find all the Persons c, d, e,… who match both Person a and Person b on the same segment.

This is what my Double Match Triangulator Program will give you today. And this is what you want FamilyTreeDNA, 23andMe, AncestryDNA and MyHeritage to be giving you.

image

Double matching does a lot for you. It uses a second person to help confirm that Person c, Person d and Person e all match each other and are not just matches by chance. It eliminates the extra bits of random match that Person a and Person b may have with the third person. If Person a and Person b are not direct-line related (i.e. parent-child, grandparent-grandchild), then it will reduce the threshold of where false matches will occur, down to maybe even 5 cM as Jim Bartlett has concluded. I plan to do a study of this soon and will put my results in an upcoming blog post.

 

Triangulation

Triangulation is a technique to help conclude (I won’t say “prove”) that three people share a segment that comes from a common ancestor and that the segment is Identical by Descent.

It requires that Person a match Person c on a segment, Person a matches Person d on the same segment and also that Person c matches Person d on the same segment.

image

This statistically reduces to almost zero the possibility of the criss-crossing matches between the two parental chromosomes. It is still possible that one of the 3 people matches by chance to the other two people. But should that chance match be disproved, maybe by multiple Triangulations with other people, then it can be concluded that these people obtained that common segment from a common ancestor.

 

Single Match Triangulation

This is the technique commonly in use today because you are only supplied with Single Match information by FamilyTreeDNA and 23andMe.

That matches between Persons c, d and e are not included in the Single Match data you get. You don’t have this information in your matches. What you need to do is contact either Person c, d or e and ask them to look in the Chromosome browser and see for you if they match the other people on that particular segment. If they do, then you Triangulate on that segment. If they don’t match to some of the others, then you’ll have to contact the others to get them to check.

image

You could have tens of thousands of segments that SIngle Match with others. You may have dozens of people who overlap on a segment. So to be practical, most people just concentrate on their largest size segments, or on a segment connected to people they are trying to figure out their relation to. This is manual labour as far as I’m concerned. And it only verifies one segment for a few people. You have all your others you can do as well that have so much info to give you.

So what people often do is they get lazy. Maybe they verify with one or two people and then incorrectly conclude that all the other matches on the segment are valid. Then maybe they just look to see if the other people are “In Common With” meaning they match somewhere, but not necessarily on the desired segment, and then conclude the segment Triangulates, which is not a conclusion you can make.

Single Match Triangulation is what Jim Bartlett has done over the past five years. He has done it correctly and meticulously. By mapping his segments to his matches, he has manually Match Filtered (I’ll explain what that is in a future blog post) to his parents and has been able to map most of his segments to his ancestors. But it took him 5 years! It’s not easy.

There is one tool that does true Triangulation for you. It is the GEDMatch Tier 1 Triangulation Tool. It is the only online tool that will properly check the third leg of the triangulation for you and guarantee that it is a true Triangulation. All the other tools out there use “In Common With” or less. However, with GEDMatch, you are limited to the kits that have been uploaded there, only your closest 500 matches are used, the minimum cM match is 7 cM and 500 SNPs and it gets cut off at 10,000 Triangulations.

 

Double Match Triangulation

This is the technique I implemented in my Double Match Triangulator (DMT) program that uses Double Matching.

The basis is simple. Once you’ve Double Matched Person a and Person b with other people on a segment, you have all the matches you need except one: the Person a with Person b match. And that you’ve got that right in your own Single Match file.

image

The matches between Person a and Person b could then be compared to all the Double Matches, and those that overlap all Triangulate, and those that don’t are Missing a-b Segments (another word I invented).

With the Chromosome Browser Results (CBR) files of Person a and any Person b that is Person a’s match, you can find every segment that Triangulates and all the people that Triangulate with them on every segment in one fell swoop.

If you can get CBR files from more of your DNA matches and put them all together, you will be doing what I call EAST (Extreme Autosomal Segment Triangulation).

 

Hopefully this post makes the concepts all a bit clearer for you.

Double Match Filtering for an Endogamous Population - Sun, 22 Jan 2017

A few days ago, Roberta Estes posted: Concepts – Segment Size, Legitimate and False Matches where she compared a child’s matches against those of her parents. She downloaded the Chromosome Browser Results (CBR) file from FamilyTreeDNA for a set of parents and a child, and then explained how she did the matching in a spreadsheet.

Roberta’s key result was a Parent Child Phased Segment Match Chart which show she passed the 50% mark for false matches for 7 to 7.99 cM segments rising to 87% false matches once segments are as small as 3 to 3.99 cM.

Roberta refers to this technique as “double parent phasing” (no caps) whereas I’d like to call it “Double Match Filtering” (with caps). My reason for naming it this is because it is exactly the same technique I use for what I call Double Match Analysis.

What is being done is we are taking a child as Person a and one parent as Person b, and we are finding all the Person c people that match to both. Then we do it a second time, with the same child again as Person a, the other parent as Person b, and we then are finding the Person c people that match to them. Using these two sets of Double Matches, we go back to all the child’s Single Matches and see which do not double match to either parent. Those non-matches cannot be Identical by Descent (IBD) since one parent would have had to match to pass the segment down from the ancestor, through them to the child.

The high percentage of false matches for small segments under 8 cM in Roberta’s results is what scares genealogists from using small segments. And this is the downfall of Single Match Triangulation. A large number of small single matches are likely false and are not IBD.

Towards the end of her article, Roberta said:

“I hope that other people in non-endogamous populations will do the same type of double parent phasing and report on their results in the same type of format.  This experiment took about 2 days.

Furthermore, I would love to see this same type of experiment for endogamous families as well.”

An Endogamous Family

I’ve had plans to do this anyway. I need to analyze how the matches pass down as part of my investigation into methods to use Double Match Triangulation to map segments onto ancestors.

So I’m taking a number of Chromosome Browser Results files that were sent to me by Arnold, a DNA-cousin of mine, to help me develop my Double Match Triangulator program and see if I can use it to figure out how we’re related.

(By the way, I define a “DNA-cousin” or “DNA-relative” as someone who is a DNA match, but neither of us have the foggiest idea of how we’re actually related.)

Arnold has been doing DNA analysis with FamilyTreeDNA for a long time, and he had about 20 CBR files that he let me use. He, like me, comes from a endogamous Ashkenazi population.

His files include a father, mother, son and daughter, as well as other relatives of those four. Endogamous population gives those involved many more matches than you’d expect. That’s because everybody is related to everybody else often in multiple ways. Here’s the statistics for the four people I’ll use:

The father has 163,249 single match segments with 7,654 people.
The mother has 149,083 single match segments with 7,139 people.
The daughter has 146,767 single match segments with 7,271 people.
The son has 142,066 single match segments with 7,014 people.

To add an interesting complication, the father and mother are related. They have 25 matching segments that match each other totalling 98.0 cM with the longest being 18.9 cM. This would normally make them something like 3rd cousins. But because of endogamy, they are more likely 5th and 6th cousins in several different ways.

The Spreadsheet Analysis

I basically did what Roberta said to do. I did it twice, once for the son with his parents, and once for the daughter and parents. Each file has about 450,000 lines in it. These are big Excel files that ended up (with analysis equations) being about 80 MB in size each.

I didn’t delete the segments under 3 cM like Roberta did. She was visually inspecting each match herself, so wanted a manageable number of matches to work with. Her non-endogamous CBR files had about 25,000 segment matches in each one, and removing the under 3 cM ones left her with about 6,000 matches in each, for a total of 18,000 lines to work with, and that was plenty to provide reasonable results.

I was able to develop Excel formulas to do the match comparisons that Roberta did by hand. Since I was letting the computer do the work, I didn’t need to cut down the size of the analysis and I could work with the whole dataset.

Roberta didn’t mention it, but you do have to remove the father, mother and child wherever they appear as the “MATCHNAME”. They all match each other on many segments, including the father and mother as I mentioned above. You don’t want to count those in these statistics.

Also, it’s really important is to check the date of your downloads of the two parents and the child file. If they were not downloaded at the same time, a later downloaded file will contain matches to people that an earlier download did not. This will make it look like one person matches and the other does not when what is really true is that you just don’t have the matches for the other person.

These one-sided matches had to be eliminated. I found the best way was to see if the child had matches to a Person c that neither their mother or father had. For this Person c to show up in the child’s match list, they had to have at least a half dozen matches totalling at minimum around 20 cM. For that to happen and for none of those segments to match either parent is practically impossible meaning the matches for the parent is missing. So I deleted these from the analysis. They amounted to about 5% of the matches and did not really change the results other than reducing the number of large segments that did not match.

And because the parents were related, I knew there would be some matches that would be on both parents sides, so I made sure I was able to count those so I’d have them for future analysis.

The Double Match Phasing Results

These results include only matches on the 22 autosomal chromosome pairs. The X chromosome is a bit different so I removed them and will analyze them separately in a later post.

Here’s the results of the daughter versus her father and mother:

image

And the results of her brother (the son) versus the same father and mother were very similar:

image

The results showed that there was much less chance of a non-match in small segments for these endogamous people than what Roberta was showing as her results. Yellowing in the 50% point, it comes in at the 2 – 3 cM range, as compared to Roberta’s 50% point which for her comes in at the 7 – 8 cM range. This surprised me so much that I went back and double and triple checked my equations to make sure they were identifying segments correctly and totalling everything correctly. They were.

Here is a plot of % Non-matches by segment size from several different analysis. In addition to my results and Roberta’s results, I’m including John Walden’s False Positive both sides phased results that are on the ISOGG Wiki which Blaine Bettinger talks about in his “Small Matching Segments – Friend or Foe” article of 2014. Also I’m including Ann Raymont’s findings in her “When is a match a false positive?” post from 2016.

image

It seems that every other study, all non-endogamous populations, give similar results, but mine is different. I currently do not know why this is. I can’t think of a reason why endogamy might give fewer non-matches for a given segment size. Unless my analysis is being done differently (or incorrectly) and I don’t believe it is, and my number of observations used is certainly large enough, then I think I may be showing something quite significant and relevant.

Among my 68 Chromosome Browser Results files that I have and that my DNA-relatives have given me, this father/mother/son/daughter was the only set of both-parents with child that I have. I would like to test some more, both endogamous and not.

I made my analysis spreadsheet quite general so that I could easily do this analysis for any father/mother/child triplet. If you’re interested in seeing what your non-match percentage looks like and would like to help me with this research that I’ll use to give my Double Match Triangulator program some smarts, please send me your set of CBR files. In return, I’ll be happy to send you the spreadsheet with your data in it and the results.

So if you have any set of CBR files from FamilyTreeDNA that include both parents and 1 or more children, would you be willing to send them to me so that I analyze them the same way?  Thanks.

Double Match Triangulator - Version 1.4 - Fri, 20 Jan 2017

DMT is a semi-finalist in the #InnovatorShowdown at #RootsTech 2017. This is a new version of the program with several improvements.

You can get the new version on my DMT page. It is freeware to help you do  autosomal DNA segment analysis.

Now Works with Older CBR files

My own FamilyTreeDNA results came in 11 days ago. When I downloaded my Chromosome Browser Results (CBR) file and ran it through DMT, it didn’t find any triangulations with anybody. That’s because my results were brand new. The other CBR files I had did not know about my results because when they were downloaded, my results weren’t in the system yet.

DMT used to check that Person a’s file had matches with Person b and Person b’s file had the equal matches with Person a. If not, DMT wouldn’t use the a-b matches. So there would be people who Double Matched, but nobody would Triangulate.

To handle this situation, Version 1.4 now only needs the a-b matches in either Person a or Person b’s file. Now you won’t need to update all your older CBR files whenever you get a new tester in your family. Of course, you’ll only Double Match with Person c people who got their results after the older of your Person a and Person b files. Eventually you may want to update your older CBR files with newer ones, especially if there’s a particular Person c missing from the analysis. But updating your files is no longer necessary.

Prevents the Same Person from being used Twice

This was annoying. If you had several CBR files for a person, downloaded on different dates, and you ran By Chromosome to combine everything, then the person would be included as Person b multiple times.

Now DMT checks the names of the Person b people. If the same name shows up, it will only use the last file when ordered by filename alphabetically, which should be the one with the latest date.

This way, you can download new CBR files and leave them with their older ones for comparison, and DMT will only use the newest in its By Chromosome runs.

Excludes non-matches from the By Chromosome Analysis

Originally, I thought it was okay to include all the Chromosome Browser Results files in the By Chromosome analysis. I thought that even if Person b does not match Person a, the Double Matches should still be meaningful.

Yes that is true, but …

This will yield to false interpretation if Person b actually does match Person a on some segments, but they are below the threshold of FamilyTreeDNA to consider them a match. The segments that were a-b matches would then incorrectly show up in DMT as Missing a-b Segments rather than as Triangulations. This is very bad because Double Match Theorem 1 would get you to conclude that this segment is on the other half of the Chromosome pair than it really it. That would make you conclude that this is a paternal match when it is really maternal, and vice versa.

So that had to change. Non-matches are excluded in the By Chromosome Analysis.

Better Handling of Duplicate Segments in CBR files

FamilyTreeDNA unfortunately downloads matches in its CBR files by match name rather than kit number. If two people have the same John Smith, or if one person tested twice under the same name, all those matches will be in the CBR file mixed together looking like one person.  DMT puts a  “##” before the name of people with this problem, so that you will be aware when you use those segment matches.

Duplicate segments will be because a person tested twice. In most cases, all the segments are duplicated (or even triplicated if someone did 3 tests). This case is easy to detect and remove all the extra entries. Then this Person c can be used without worry. DMT now fixes this for you and there is no “##” before these people’s names.

For the overlapping people, if you really need to fix one or two because they are critical in your analysis, you can go to FamilyTreeDNA’s Chromosome Browser and look that name up. You’ll see more than one person. You can download their individual matches and manually doctor up your CBR files, but you’ll have to make up a different name for the other, e.g. John Smith and John Smith2. This is messy because your CBR file for Person b will also have its John Smiths together, and your John Smith2 won’t match anyone in Person b’s file unless you fix that file as well. Ugh!  Better to wait for FamilyTreeDNA to fix this problem, if anyone knows how to let them know about it.

Improvements to the People Page

This is likely the most visible improvement. It is on the People page for individual Double Match runs, and for the By Chromosome run. The two have been made more consistent.

image

And now all segment matches use consistent notation for the largest Single Matches between Person a and Person c on each Chromosome, 1 to 22 and X (sometimes referred to as 23)

If a-c Triangulate on that Chromosome, then the largest length in cM of any a-c segment that Triangulates is prefixed by the letter "T" and is shown in green so it can be easily picked out, e.g. image

If a-c does not Triangulate on that Chromosome, but does Double Match, then the largest length in cM of any a-c segment that Double Matches is prefixed by the letter "D", e.g. image

X matches will be shown in column ACX with red text and the prefix after the letter "T" or "D" will be "X", e.g.  image or image

Also all Triangulating people are shown first, ordered highest to lowest in their total a-c cM, so the closer relatives will be listed earlier on.

————————–

I found that I needed the above changes once I downloaded my own data. I’m sure they’ll be useful to you as well if you use DMT.

It took me 6 days to make these changes. I know I worked hard to get this working over that time. So I was curious and I counted up the number of DMT runs that I had to do to implement, test and debug all this. I was able to total up the number of DMT log files that were created each day. They counts were:

Sunday, Jan 15 - 48
Monday, Jan 16 – 33
Tuesday, Jan 17 - 30
Wednesday, Jan 18 – 80
Thursday, Jan 19 - 73
Friday Jan 20 – 33

Wow! I thought I worked hard on this, but I never expected that it would have taken me 297 runs of Double Match Triangulator to get the changes in this version working.

In total I’ve got log files for 1,668 Double Match Triangulator runs dating back to my first prototype run on June 26, 2016 when I first added the log file.