Double Match Filtering for an Endogamous Population - Sun, 22 Jan 2017
A few days ago, Roberta Estes posted: Concepts – Segment Size, Legitimate and False Matches where she compared a child’s matches against those of her parents. She downloaded the Chromosome Browser Results (CBR) file from FamilyTreeDNA for a set of parents and a child, and then explained how she did the matching in a spreadsheet.
Roberta’s key result was a Parent Child Phased Segment Match Chart which show she passed the 50% mark for false matches for 7 to 7.99 cM segments rising to 87% false matches once segments are as small as 3 to 3.99 cM.
Roberta refers to this technique as “double parent phasing” (no caps) whereas I’d like to call it “Double Match Filtering” (with caps). My reason for naming it this is because it is exactly the same technique I use for what I call Double Match Analysis.
What is being done is we are taking a child as Person a and one parent as Person b, and we are finding all the Person c people that match to both. Then we do it a second time, with the same child again as Person a, the other parent as Person b, and we then are finding the Person c people that match to them. Using these two sets of Double Matches, we go back to all the child’s Single Matches and see which do not double match to either parent. Those non-matches cannot be Identical by Descent (IBD) since one parent would have had to match to pass the segment down from the ancestor, through them to the child.
The high percentage of false matches for small segments under 8 cM in Roberta’s results is what scares genealogists from using small segments. And this is the downfall of Single Match Triangulation. A large number of small single matches are likely false and are not IBD.
Towards the end of her article, Roberta said:
“I hope that other people in non-endogamous populations will do the same type of double parent phasing and report on their results in the same type of format. This experiment took about 2 days.
Furthermore, I would love to see this same type of experiment for endogamous families as well.”
An Endogamous Family
I’ve had plans to do this anyway. I need to analyze how the matches pass down as part of my investigation into methods to use Double Match Triangulation to map segments onto ancestors.
So I’m taking a number of Chromosome Browser Results files that were sent to me by Arnold, a DNA-cousin of mine, to help me develop my Double Match Triangulator program and see if I can use it to figure out how we’re related.
(By the way, I define a “DNA-cousin” or “DNA-relative” as someone who is a DNA match, but neither of us have the foggiest idea of how we’re actually related.)
Arnold has been doing DNA analysis with FamilyTreeDNA for a long time, and he had about 20 CBR files that he let me use. He, like me, comes from a endogamous Ashkenazi population.
His files include a father, mother, son and daughter, as well as other relatives of those four. Endogamous population gives those involved many more matches than you’d expect. That’s because everybody is related to everybody else often in multiple ways. Here’s the statistics for the four people I’ll use:
The father has 163,249 single match segments with 7,654 people.
The mother has 149,083 single match segments with 7,139 people.
The daughter has 146,767 single match segments with 7,271 people.
The son has 142,066 single match segments with 7,014 people.
To add an interesting complication, the father and mother are related. They have 25 matching segments that match each other totalling 98.0 cM with the longest being 18.9 cM. This would normally make them something like 3rd cousins. But because of endogamy, they are more likely 5th and 6th cousins in several different ways.
The Spreadsheet Analysis
I basically did what Roberta said to do. I did it twice, once for the son with his parents, and once for the daughter and parents. Each file has about 450,000 lines in it. These are big Excel files that ended up (with analysis equations) being about 80 MB in size each.
I didn’t delete the segments under 3 cM like Roberta did. She was visually inspecting each match herself, so wanted a manageable number of matches to work with. Her non-endogamous CBR files had about 25,000 segment matches in each one, and removing the under 3 cM ones left her with about 6,000 matches in each, for a total of 18,000 lines to work with, and that was plenty to provide reasonable results.
I was able to develop Excel formulas to do the match comparisons that Roberta did by hand. Since I was letting the computer do the work, I didn’t need to cut down the size of the analysis and I could work with the whole dataset.
Roberta didn’t mention it, but you do have to remove the father, mother and child wherever they appear as the “MATCHNAME”. They all match each other on many segments, including the father and mother as I mentioned above. You don’t want to count those in these statistics.
Also, it’s really important is to check the date of your downloads of the two parents and the child file. If they were not downloaded at the same time, a later downloaded file will contain matches to people that an earlier download did not. This will make it look like one person matches and the other does not when what is really true is that you just don’t have the matches for the other person.
These one-sided matches had to be eliminated. I found the best way was to see if the child had matches to a Person c that neither their mother or father had. For this Person c to show up in the child’s match list, they had to have at least a half dozen matches totalling at minimum around 20 cM. For that to happen and for none of those segments to match either parent is practically impossible meaning the matches for the parent is missing. So I deleted these from the analysis. They amounted to about 5% of the matches and did not really change the results other than reducing the number of large segments that did not match.
And because the parents were related, I knew there would be some matches that would be on both parents sides, so I made sure I was able to count those so I’d have them for future analysis.
The Double Match Phasing Results
These results include only matches on the 22 autosomal chromosome pairs. The X chromosome is a bit different so I removed them and will analyze them separately in a later post.
Here’s the results of the daughter versus her father and mother:
And the results of her brother (the son) versus the same father and mother were very similar:
The results showed that there was much less chance of a non-match in small segments for these endogamous people than what Roberta was showing as her results. Yellowing in the 50% point, it comes in at the 2 – 3 cM range, as compared to Roberta’s 50% point which for her comes in at the 7 – 8 cM range. This surprised me so much that I went back and double and triple checked my equations to make sure they were identifying segments correctly and totalling everything correctly. They were.
Here is a plot of % Non-matches by segment size from several different analysis. In addition to my results and Roberta’s results, I’m including John Walden’s False Positive both sides phased results that are on the ISOGG Wiki which Blaine Bettinger talks about in his “Small Matching Segments – Friend or Foe” article of 2014. Also I’m including Ann Raymont’s findings in her “When is a match a false positive?” post from 2016.
It seems that every other study, all non-endogamous populations, give similar results, but mine is different. I currently do not know why this is. I can’t think of a reason why endogamy might give fewer non-matches for a given segment size. Unless my analysis is being done differently (or incorrectly) and I don’t believe it is, and my number of observations used is certainly large enough, then I think I may be showing something quite significant and relevant.
Among my 68 Chromosome Browser Results files that I have and that my DNA-relatives have given me, this father/mother/son/daughter was the only set of both-parents with child that I have. I would like to test some more, both endogamous and not.
I made my analysis spreadsheet quite general so that I could easily do this analysis for any father/mother/child triplet. If you’re interested in seeing what your non-match percentage looks like and would like to help me with this research that I’ll use to give my Double Match Triangulator program some smarts, please send me your set of CBR files. In return, I’ll be happy to send you the spreadsheet with your data in it and the results.
So if you have any set of CBR files from FamilyTreeDNA that include both parents and 1 or more children, would you be willing to send them to me so that I analyze them the same way? Thanks.