Login to participate
  
Register   Lost ID/password?

Louis Kessler’s Behold Blog

Comparing Raw Data from 5 DNA Testing Companies - Fri, 31 Aug 2018

In March 2017, I compared my DNA raw data from Family Tree DNA against my DNA raw data from MyHeritage DNA.  I had tested with FTDNA at home on Nov 25, 2016 and with MyHeritage DNA at RootsTech on Feb 10, 2017.

Since then, I ordered tests and tested at home with 23andMe on Dec 2, 2017, AncestryDNA on Dec 12, 2017, and Living DNA on June 23, 2018. So I now have five sets of my own Raw Data from different testing companies that I can compare.

You never know what the companies are doing, so just to make sure, I downloaded my Build 37 raw data from Family Tree DNA again and compared it with the download I did on Jan 12, 2017. Nothing had changed. The files were identical. That’s good. 
   

Raw Data File Contents

All 5 companies list your SNP (Single Nucleotide Polymorphism) data, one per line. Some companies include some lines of text description at the top, followed by a title line naming the fields, followed by the SNP data. Here for example is the beginning of my Ancestry DNA raw data file:

image

And this is the beginning of my Family Tree DNA raw data file:

image

Here’s a comparison of the five DNA tests I took and the raw data files I got from them:

image

Family Tree DNA and MyHeritage DNA files are both set up similarly as .csv files (comma delimited) with field put in double quotes. The other 3 companies use plain text files separating fields with a space or tab. Both type of files can easily be loaded into Excel and the fields will be placed properly into columns for you.

The first field for each SNP in all the files is the RSID (Reference SNP cluster Identifier) which basically is a name for the SNP. I checked, and in each raw data file, no RSID was listed more than once.

The RSID is followed by the chromosome number and the position in base pairs on the forward strand that the SNP is located on the chromosome. The position of the SNP can change when the powers that be come out with a new “build” of the genome. Several years ago, Build 36 was the standard, but most companies now use Build 37. They have already come out with a Build 38, but so far all of the companies are sticking to Build 37 because it really is a lot of work to change for little gain with regards to matching people to each other. All 5 of my raw data files are from Build 37, so (theoretically at least) the chromosome and position of any SNP should match. I’ll check that later in this article in the section: “RSIDs with more than one Position”.

The value of the SNP is called “result” by Family Tree DNA and MyHeritage DNA, “allele1 and allele2” by AncestryDNA, and “genotype” by 23andMe and Living DNA. Ancestry DNA puts a space between the two allele values. The other companies list the two alleles together as a single 2 character string.

The SNPs from all five companies are listed by chromosome and then by position within the chromosome. Chromosomes 1 to 22 (the autosomes) are listed first. The sex chromosomes X and Y and the mitochondrial MT follow.  Ancestry DNA numbers X as 23 and 25, Y as 24 and MT as 26. Ancestry uses 25 for the few SNPs that they probe that are in the pseudoautosomal region of the X and Y chromosomes. These are the tips of the X that actually combine with the Y chromosome just like autosomal genes do.

Family Tree DNA embeds a 2nd title line between the last SNP on the 22nd chromosome and the first SNP on the X chromosome. Don’t get caught by this. Be sure to remove this second title line if you are analyzing a Family Tree DNA raw data file in a spreadsheet or with programming.


RSIDs and SNPedia

The RSID, which you can think of as the name of the SNP, is usually represented by the letters “rs” followed by a number. The SNPedia has information on a fair percentage of these RSIDs and you can look them up to find out what that particular SNP has been found to do.  For example, the entry for rs1815739 in SNPedia will tell you that this SNP is on chromosome 11 at position 66560624, is part of Gene ACTN3, and is said to have an effect on muscle performance. Values of (C,C) could contribute to better performing muscles, (C, T) is a mix of muscle types, and (T,T)  could contribute to impaired muscle performance. Medical interpretation of SNPs is not something I have any experience with, so I will make no attempt to do that.

When testing companies test SNPs that do not already have an RSID defined, they often invent their own. 23andMe has used “i” followed by a number. Family Tree DNA and MyHeritage DNA have used “VG” followed by the chromosome number followed by “S” followed by a number. And Living DNA came up with a whole set of different RSID names, each of which must have some meaning to them. In my raw data, I found the following number of SNPs with these prefixes:

image

At the time I’m writing this, the number of SNPs defined in SNPedia is 109,335. SNPedia says that 49,082 of those are tested by Ancestry.com’s v2 platform and 24,761 by 23andMe’s v5 platform with 16,453 in common between them. There are 13,916 tested by Family Tree DNA. MyHeritage DNA and has about 12,000 entries and Living DNA has about 22,000. They say there are 1,504 SNPs of their defined SNPs that are in common to most platforms.


Number of SNPs by Chromosome

All companies read and provide raw data for the SNPs from the autosomes (chromosomes 1 to 22) as well as the X chromosome. MyHeritage DNA, Ancestry DNA and 23andMe provide Y chromosome SNPs. Ancestry DNA and 23andMe provide mitochondrial (MT) SNPs.

Below is the number of SNPs by chromosome in my raw data:

image

You’ll notice that the FTDNA and MyHeritage number of SNPs are identical for all chromosomes and are only 16 different for the X chromosome. That’s because both companies use the the same chip and the same Gene By Gene lab (the parent company of Family Tree DNA). Differences in the reads between the two are indicative of the error rate in one set of raw data. My analysis last year that compared the two sets of raw data found 42 differences out of 702,442 autosomal SNPs, indicating an error rate less than 0.01%. MyHeritage does include some Y chromosome results in its raw data, but Family Tree DNA does not.


Ancestry’s X Chromosome in More Detail

Ancestry divides its X data into what it calls chromosomes 23 and 25. The latter is said to represent the pseudoautosomal region which I described earlier. My 27,973 X SNPs from my Ancestry DNA raw data is made up of 27,473 chromosome 23 SNPs and just 500 pseudoautosomal chromosome 25 SNPs.

This is the range of positions and counts of my designated chromosome 23 versus chromosome 25 SNPs:

image

Ancestry DNA’s Chromosome 25 regions in my raw data include 339 SNPs up to position 2,697,868 which is the starting tip of the X chromosome and is the first pseudoautosomal region. And then there’s 63 SNPs at the ending tip of the chromosome in the second pseudoautosomal region.

For some reason, Ancestry DNA assigns 13 SNPs from 2,700,157 to 8,549,940 to chromosome 25 when it is outside the official region (up to 2.7 Mbp) where it also assigns 1,256 SNPs to chromosome 23. Then between 88,720,459 and 92,164,248, they have another 84 SNPs assigned to chromosome 25, and I’m not sure why.

The SNP designated 25 at position 117,610,641 in my raw data file is all alone and is likely an incorrect entry by Ancestry DNA.

138 of those Ancestry chromosome 25 SNPs are also included in my raw data from 23andMe, who simply include them as an X chromosome SNP and don’t differentiate them like Ancestry DNA does.


SNPs in common between companies

It is quite important to know how many SNPs are shared between companies. I compared my 5 sets of raw data in pairs and counted the SNPs shared. The numbers on the diagonal in bold are the number of SNPs in my raw data just from that company. The numbers below the diagonal are the number shared. The percentages above the diagonal are the percent shared out of the total SNPs that the two companies have = #shared / (#c1 + #c2 – #shared)

image

The first table shows the shared autosomal SNPs that I have between my raw data files from the five companies.

Below that are the comparable numbers from the Autosomal SNP comparison chart at the ISOGG Wiki. The FTDNA number 698,179 that I’ve marked in their chart has to be wrong because it can’t be less than the number FTDNA shares with MyHeritage. The numbers are fairly close to mine. I know from looking at several different people’s raw data from Family Tree DNA, that there is variation in the number of SNPs included in one company’s raw data from test to test.

Family Tree DNA and MyHeritage DNA provide identical autosomal SNPs. They share about 44% with AncestryDNA. 23andMe and Living DNA who both use the v5 chip share over 90% with each other, but only about 14% with the other companies. Only 110,231 autosomal SNPs were included in my raw data by all five companies.

Those low overlap percentages are what makes it difficult to find matching segments between data from the v5 chip and data from the old chip. Some companies like Family Tree DNA do not yet accept transfers of raw data from 23andMe or Living DNA because of that. MyHeritage DNA uses imputation to estimate the missing SNPs. GEDmatch is still working to develop a more reliable method to compare v5 chip data with earlier data through it’s GEDmatch Genesis project.

Here’s the same data, but for the X chromosome:

image

The ISOGG Wiki doesn’t yet have X data in their table for MyHeritage DNA, Living DNA or the new v5 chip of 23andMe.

Here are my tables for the Y chromosome and for mitochondrial.

image


RSIDs with more than one Position

All my raw data files were from Build 37 of the genome. So every RSID should map to one SNP on one specific chromosome at one position. That was true within any one set of raw data, where every RSID was just given once.

But once you combine multiple sets of raw data, you’ll find the same RSID tested in different files. This is the count of the number of RSIDs by the number of files each was found in:

image

So you would expect those RSIDs that are in more than one raw data file to be at the same position on the same chromosome in each file. It turns out that in my files 68 of those RSIDs are not at exactly the same position.

All but 1 are differences with the 23andMe raw data. And most of them are minor.

29 differences have the 23andMe position being just 1 less than the Living DNA position, e.g.  RSID rs498648 is on chromosome 1. In my 23andMe raw data file it is at position 176,957,452 and in my Living DNA file, it is at position 176,957,453. Now this is just 1 position different and isn’t important at all for genealogical purposes. But for a programmer who may want to develop tools for handling raw data, even a one difference can cause a problem. None of these 29 differences have RSIDs that are in the other 3 raw data files or in SNPedia, so I can’t tell which one might be the correct one.

34 of the differences are very small ones on the mt chromosome where 23andMe is 1 more (31 times), or 2 more (twice) or 3 more (once) than the Ancestry DNA position. e.g. for RSID rs118203886 Ancestry DNA lists position 611 on chromosome 26, and 23andMe lists position 613 on chromosome MT. Of these RSIDs, 32 are listed in SNPedia and SNPedia agrees with Ancestry DNA in all cases.

One more difference is SNP rs3857360 which is in both my Family Tree DNA and my MyHeritage DNA raw data files as position 102,989,428 on chromosome 5, but has a position one higher at 23andMe. This SNP is not in SNPedia.

But there are four differences between 23andMe and Living DNA that concern me the most because the RSID is used for two completely different locations. These 4 are:

image

Two of the values at 23andMe are no-calls, but of the other two, one doesn’t match with a TT at 23andMe and a AA at Living DNA. That already is indicative that these might be different SNPs that one of the companies has named incorrectly. None of these four SNPs are in SNPedia.


Positions with more than one RSID

So there were only 68 RSIDs with different positions, and only 4 of them were bad.

However, there are many more positions that have more than one RSID.

I found quite a number of SNPs on a chromosome at a specific position, where a different RSID was used for that SNP.

image

From my 5 raw data files, I had as many as 4 different RSIDs at a specific position.

For example, Chromosome 7, position 117,174,424 has these RSIDs:

  1. rs78440224 in AncestryDNA and Living DNA raw data
  2. i5010947 in 23andMe raw data
  3. i5053851 in 23andMe raw data
  4. VG07S45007 in Family Tree DNA and MyHeritage DNA raw data.

And if you look up rs78440224 in SNPedia, sure enough, they say that SNP is named i5010947 and i5053851 by 23andMe. It doesn’t happen to mention the fourth name though. (And I was happy to see that all four of those SNPs in my raw data have the value GG, which is not the cystic fibrosis carrier.)

The i5010947 and i5053851 RSIDs in the 23andMe raw data file means that there are two names for the same SNP in the same file. Cases like this will cause the position to occur more than once in the raw data file.


Analysis of the Allele Values

This is what we’ve really been trying to get to. Let’s first see what the allele values there are from each company.

image

The allele pair corresponding to the alleles on the forward strand of both parents’ chromosomes is given as two letters, with A, C, G and T being the possible choices. Ancestry lists the two alleles as two separate letters, but I’ve put them together in the above table.

Since it is unknown which of the two letters belongs to which parent, the order of display of the two letters is arbitrary. The standard practice is to order the two letters alphabetically, so if you have the choice of AC or CA, then you would use AC. For the most part, the companies follow this standard, but you can see very odd exceptions., e.g. MyHeritage DNA and AncestryDNA both using TC and TG instead of CT and GT. Living DNA often uses both orderings, and unless they’ve thought up something innovative, I doubt the order for a specific value means anything.

23andMe includes values for insertions (II) and deletions (DD) and even has a few deletion/insertions (DI).

Two dashes “–“ represent no-calls. These are positions where the values were not able to be determined. AncestryDNA uses two zeros: “00”. For matching purposes, no calls are treated as a match.

When a single letter is given, it is for a chromosome that is not in a pair. Since I’m a male, I have a single X chromosome from my mother and a single Y from my father and everybody’s mt chromosome comes just from their mother. 23andMe uses the single letter designation in this case, but the other companies duplicate the letter.

In order to compare allele values between companies to see if the readings are the same (in the next section), I’ll need to standardize the notation. I’ve chosen to use 2 letters and order them alphabetically as in the “Standardized” column of the above table.

When a value cannot be determined during the test, it is given what is known as a no call and is denoted by two hyphens by most companies, but by a zero by AncestryDNA. The percentage of no calls is a very important statistic and indicates the quality of the test results. A no call percentage of 3% or more is on the high side and the company may be willing to get new results from your sample or get you to re-test. My results from the five companies ranges from a low of 0.4% no calls at AncestryDNA to a high of 3.0% at 23andMe.

Below is my standardized table of counts for my autosomal chromosomes:

image

It’s interesting that Living DNA did not find any AT or CG values.

For the X chromosome below, I’ve marked the invalid values. Since I’m male and I only have one X chromosome, values with two different letters are impossible.

image

Next is the Y chromosome. There is a high number of no calls and invalids in the MyHeritage Y DNA data.

image

Only AncestryDNA and 23andMe include the mt chromosome in the raw data:

image


Comparing reads between companies

Now the interesting question. Do the different companies give the same values?

To do this, I re-sorted my combined file of results by chromosome and position, and merged the results for identical positions (SNPs with different RSIDs) together. If any of the readings of the SNPs at the same position conflicted, I was prepared to mark the value at the position as a no-call, but fortunately none did.

I did my analysis and summarized it with the following table:

image

So this table includes the 1,389,750 unique positions that were tested by my five companies. There were 3,346,178 readings in total, so that’s an average of 2.4 readings per position.

I’ve grouped the positions by the number of companies that read from that position from my five sets of raw data  and the by the number of those reads that were no calls.

For example, the first line says that 111,872 positions were read by all five companies. Only 19 of those have a disagreement among the 5 companies. For those 19 positions where there are disagreements, I would change the value to a no call, so 19 x 5 = 95 values will get changed to a no call.

The second line says that 2,353 positions were read by all five companies, but in each case one of the companies had a no call. Only 14 of those have a disagreement among the 5 companies. A no call does not count as a disagreement. For the 2,339 agreements, the no call can be given the value that the other companies agreed upon. For the 14 positions where there are disagreements, I would change the value to a no call, so 14 x 4 = 56 values will get changed to a no call.

In total, there are only disagreements between 2 or more companies at 665 positions, which is only 0.05%. That’s very good!

By doing this, I can assign 42,230 values to no call readings and only have to assign no calls to 1,692 readings. That reduces the number of no call readings from 73,127 to 73,127 – 42,230 + 1,692 = 32,589. So I have effectively reduced my percentage of no calls down from 2.19% to 0.97% of the readings the companies supplied to me.


Creating a combined raw data file

Well, it seems like I should take the next step and create a raw data file from these 1,389,750 positions.

I noted, but forgot to correct my X’s and Y’s earlier that were impossible values for me because they were not double letters. So that adds 304 no calls to my X values and 57 no calls to my Y values.

Here’s a summary of what I’ve got, with a comparison of what I got from the five sets of raw data from the five companies. My percent no calls is shown on the bottom line.

image

Note that 23andMe gave me 4,301 mt readings but I only have 2,483. That’s because 23andMe’s mt data included many SNPs with identical positions and I merged SNPs with the same position into one. In all cases, the SNPs that got merged all had the same value.

Now which company’s raw data should I emulate? The goal would be to create a raw data file that other utilities can read. Since I’ve got v5 chip data, I likely should use either 23andMe or Living DNA’s format. 23andMe is the only company that includes insertions and deletions, so I’ll use their format and follow 23andMe’s naming convention and name the file: genome_Louis_Kessler_v5_Full_20180831124000.txt.

23andMe uses tabs rather than spaces in between fields, so I used my text editor and converted all the spaces to tabs.

The first 10 data lines in the original 23andMe data file I got from them were:

image

And the first 10 lines of the file I have manufactured are:

image

Note there are extra SNPs from the other companies, and that SNP i713426 whose value was a no call in the 23andMe file is now filled in because the value AA was provided to me by Living DNA.

So this file is 35 MB in size and has 1,389,770 lines that include my 1,389,750 SNPs plus 19 description lines and one title line at the top.

And if you’re curious, the Excel file that I used to do all this analysis for this article is 186 MB in size.


Uploading to GEDmatch Genesis

I entered the file upload information and pressed the Upload button.

image 

It would not take the first file I tried uploading. I compared it to my raw data file from 23andMe file and noticed that my file was UTF8 with a byte order mark at the front of it. I saved the file as ANSI/ASCII file and then GEDmatch Genesis accepted it without error and identified it as a 23andMe kit type V3.

I’m not sure what I’ll do with it yet on GEDmatch Genesis, Maybe I’ll determine how it compares there with my 23andMe kit that I uploaded there back in January.

Any suggestions?


Meanwhile…Full Genome!!!

A full genome for less than $1000. That was the magic goal that labs had been trying to achieve.

A couple of days ago, while researching information for this article, I discovered an unbelievable deal by Dante Labs. They currently are offering Whole Genome Sequencing 30x marked down to $499 from $1000. I don’t know if that price is permanent or not, but it may be. They currently have a coupon code Dazzle4Rare you can use at checkout to save another $100. Global shipping is free.

That Dante Labs deal is so good, I couldn’t resist so I purchased a whole genome sequence for myself for just $399. I should get the test kit next week and then it will take about 10 weeks to process after the lab gets my sample.

Apparently during Amazon Prime Day, they offered it for $349.

So once I get the results back, I’ll see if I can compare my 1,389,770 SNPs that I put together here with the same alleles in my full genome and see what it tells me.

Followup: Sept 4, 2018: I was informed that in the last year or so, MyHeritage DNA stopped reporting about 30,000 SNPs from their tests and includes them now in your raw data as no calls. If I did the test again, I now might get close to 50,000 no calls (6.9%) rather than the 18,700 (2.6%) that I observed.

Followup: June 5, 2020:  I updated the first table comparing my 5 raw data files to include the date I took them and the chip that was used.

Followup: Nov 11, 2021:  Readers of this article might also be interested in my article from Apr 10, 2020: Determining the Accuracy of DNA Tests

Unlock the Past Seattle Conference Livestream and More - Tue, 14 Aug 2018

Gould Genealogy & History @GouldGenealogy is an Australian company started in 1976 by Alan Phillips. They have a physical store in Adelaide, but sell over 5,000 products for genealogists online at their website including books, ebooks, software and stationary. and include many publications they have exclusive rights to that you can’t get anywhere else.

Joining Alan in the company were his son Stephen and daughter Alona. In 2009 they created the company Unlock the Past, that works with local organizations to put on genealogy events and attract top-notch speakers. Initially, they provided this service for Australia, but have since expanded and have put events on worldwide.

Their latest venture is a one-day event in Seattle, Washington on Sept 6, 2018, being called: Unlock the Past in Seattle. It will have two streams – a DNA stream and an Irish/general stream. Speakers will include DNA experts Blaine Bettinger and Maurice Gleeson, as well as Cyndi Ingle of Cyndi’s list and Wayne Shepheard.

If you plan to be in or near Seattle that day, do go. If not, the exciting news is that they are now offering a livestreaming service for $65 USD that will allow you to watch 5 of the presentations live and the other 5 later. There’s also a $50 USD discount coupon on Gould’s Genealogy ebooks so if you take advantage of this, you get a real deal.

For more information about the Livestream, see their Media Release:

image

Following the day in Seattle, over 150 people will be joining the Unlock the Past team along with Maurice Gleeson, Cyndi Ingle and Dick Eastman for the 14th Unlock the Past Genealogy Cruise to Alaska leaving the day after the day in Seattle and returning to Seattle on Sept 14. imageThere you will be able to choose from 43 talks from 18 expert speakers from the USA, Australia, New Zealand, and England.

I can personally vouch for these cruises as being the greatest thing a genealogist can do for themselves. I was a speaker on the 3rd and 10th cruses and they were both fantastic. A conference at sea mixes vacation with the pleasure of spending time with an interesting group of like-minded family finders that you quickly become good friends with. My wife who is not a genie joined me and loved the vacation, the comradery, and the time by herself when she knew I was safe listing to a talk. I recommended this cruise to Ed Thompson, developer of Evidentia, and he liked the idea and will be one of the speakers on and enjoying the Alaska cruise.  

I am personally so sad not to be going on this particular cruise. It unfortunately overlapped with our Jewish High Holidays this year. So I’ll have to miss the opportunity to get together again with my friends from previous cruises and conferences. I’m hoping Alan continues on with the genealogy cruises for at least another year or two (before a well-deserved retirement) so that my wife and I can partake in at least one more.

image

I will, however, be flying to Kelowna, B.C. for the K&DGS Harvest Your Family Tree Conference from Sept 28 – 30, where I will be giving a talk about Double Match Triangulation. I look forward to hearing Blaine Bettinger and Cyndi Ingle speak and meeting them in person. I also look forward to getting together again with Helen Smith and Geoff Doherty (from my past cruises) and Dave Obee (from the OGS in Toronto and a Winnipeg talk of his), meeting the experts there on Canadian genealogy, and seeing who else I run into. Cyndi, Helen and Geoff will have earlier finished the Alaska cruise, so it will be fun to talk to them about that.

Disclaimer:  I am hoping that Alan will give me free access to the Seattle livestream for helping to promote it on my blog. Everything I write about above is my unbiased opinion and is not influenced in any way by whether I get the free access or not.

Behold’s Genetic Relationship Notation (BGRN) – Revised - Sun, 12 Aug 2018

A couple of years ago, I introduced the horrible acronym BGRN to represent a new notation for DNA relationships which I extended to also include non-genetic relationships. Using this notation, one can define precisely how one person is related to a second person. Using just the notation, I can programmatically determine the expected amount of DNA shared between the two people (autosomal, Y, X and mt), and can express in English how the second person is related to the first.

e.g. YXY(YX)xy = male person’s mother’s fathers’ sister’s son.

Note: I purchased the right to use this graphic

Back then I decided to make it a universal (non-English-centric) notation using the DNA X for a women and Y for a man, using an uppercase letter for going up to a parent and a lowercase letter for going down to a child.

I was working to implement it into Behold last year, when I got diverted into Double Match Triangulator (DMT) development. Currently, I am trying to finish off DMT version 3.0 and I have found a need for the notation in DMT.

But as I was doing so, I realized something. If a computer is going to handle the notation, then a set of X’s and Y’s and x’s and y’s works fine. It’s not quite as good when people need to be able to read, enter and understand these values. In DMT, I’m going to allow people to enter the relationships of any of their matches that they know. People are not going to want to enter YXY(YX)xy. It is not simple enough and not understandable enough.

So here is my new version of Behold’s Genetic Relationship Notation (BGRN). It is an English-based version (sorry non-English speakers) that uses the initial letters of recognizable English words to designate the genetic connection.

For example, our YXY(YX)xy will in this new notation be:  MFRDS, which translates to: “the person’s mother’s father’s parents (both of them) daughter’s son.  All the letters are uppercase. The “R” represents the paiR of paRents for both the F(ather) and the D(aughter) and indicate that the F and D are full siblings sharing both parents. Using a single letter R rather than grouping F and M together eliminates then need for parenthesis as the (YX) had.

Let’s now define all the rules, as I did in the earlier XY version of the notation:


The Behold Genetic Relationship Notation (BGRN) Revised

Behold’s Genetic Relationship Notation defines a string of characters that represent how person A connects to person B. With this string and the sex of person A, you should be able to:

a) Determine the expected amount of DNA shared by the two people, and
b) Describe the relationship in words.

The basic genetic notation uses the following characters to make up the string:

  • F = father
  • M = mother
  • P = parent of unknown sex
  • R = pair of parents, to represent a pair of Common Ancestors (CA)
  • S = son
  • D = daughter
  • C = child of unknown sex
  • T = identical twin, e.g. FT means the “T” is the identical twin of the "F”.
  • Y = man, optional, only used in position 1 if the starting person is male.
  • X = woman, optional, only used in position 1 if the starting person is female.
  • ? = rest of the connection is not known

That’s it. 10 uppercase letters and a question mark in this revised notation, compared to the 4 uppercase and 4 lowercase letters, a number, a hyphen and parenthesis of the original notation.

The sex of the starting person is optional. If included, it will be the first character of the string. This may be needed for some genetic analyses to allow determination of whether the Y or X or mt chromosome is possible to be shared between the starting and ending people.

The core rules of the revised notation, for purely genetic relationships, are:

  1. The string optionally starts with X or Y.
  2. This is followed by 0 or more of:  F, M, P.
  3. This may be followed by one R or by one T
  4. This is followed by 0 or more S, D, C.
  5. It may end in a ?.

Below are some examples of the notation for genetic relationships and the full relationship in words (plus a simplified relationship in parenthesis) that can be generated from it:

YMF = a man’s mother’s father (or maternal grandfather)   
MFR = a person’s mother’s father’s parent’s (or great-grandparents)    
YMFRDS = a man’s mother’s fathers’ sister’s son (or 1C1R)
FDDDD = a person’s paternal half-sister’s daughter’s daughter’s daughter
    (or half-great-grand-niece)    
XSS = a woman’s son’s son (or grandson)
XFTD = a woman’s father’s identical twin’s daughter (or niece).

See how much easier these are to read and interpret their representation in my original version of this notation, which was:  YXY, UXF(YX), YXY(YX)xy, U(Y)xxx, Xyy and XY2x.

Here’s examples of some common relationships:

M = mother
MM = maternal grandmother
PPPM = great-grandmother (unknown side)
PRD = aunt
PRCC = 1st cousin
PRCCC = 1st cousin, once removed (1C!R)
PPRCC = 1st cousin, once removed (the other way)
PRCCCC = 1st cousin, twice removed (1C2R)
PPRCCC = 1st cousin, twice removed (the other way)
PRD = great-aunt
PPRCCC = 2nd cousin
PPRCCCC = 2nd cousin, once removed (2C1R)
PPPRCCC = 2nd cousin, once removed (the other way)

In the above examples, any of the P’s can be replaced by F’s or M’s, and any of the C’s can be replaced by S’s or D’s.

Hopefully, you’re getting the idea and this seems easier to read than trying to decipher a string of uppercase and lowercase X’s and Y’s.

I won’t go into the calculation of how much DNA is shared since it’s worthy of another post, but let me say that the expected values can be easily obtained from strings written in Behold Genetic Relationship Notation along with the sex of the starting person.


Extending the Notation to Non-Genetic Relations

I still would like to extend this notation to handle more than just Genetic relationships and include all possible genealogical relationships. So let’s define the additional notation:

  • f = non-genetic but legal father
  • m = non-genetic but legal mother
  • p = non-genetic but legal parent of unknown sex
  • r = non-genetic but legal pair of two parents
  • s = non-genetic but legal son
  • d = non-genetic but legal daughter
  • c = non-genetic but legal child of unknown sex
  • h = husband
  • w = wife
  • k = spouse of unknown sex
  • n = unmarried partner of any sex

The nice thing about this is all these non-genetic relationships are lowercase. So that means that as soon as you see a lowercase letter in a relationship, then you know the genetic link is broken and there will be zero DNA shared for this connection.

Examples of the extended notation and the relationship in words that can be generated from it:

n = Person’s partner.    
RD = Person’s sister.   
kRDcF = Person’s spouse’s sister’s adopted child’s father.
MMMhDhMh = Person’s mother’s mother’s mother’s husband’s daughter’s husband’s mother’s husband.
RDSFSwRCz = Person’s sister’s son’s paternal half-brother’s wife’s sibling’s spouse.

So BGRN can handle any relationship, no matter how complicated.

And if you notice, I’ve been careful to only include consonants as the letters of the notation. If any vowels would have been included, it would have been possible to create some relationships that would be real words in English, and that is risky as some not-so-desirable words could appear.

I am interested in hearing any and all comments, criticisms and suggestions.


Update: Sept 3, 2018:  I made the change of a pair of parents from “B” or “b” to “R’ or “r”. The “B” was taken from “both parents”, but that phrase does not read well when you string them together as in:  “father’s parent’s both parents’ son”. So the “R” now is more indicative of “paRents” and the phrase FPRS can now be generated as:  “father’s parent’s parents’ son”.  Notice the subtlety of the apostrophe before or after the “s” in parents to indicate if there is more than one. If that is too subtle, it could be translated to “father’s parent’s pair of parents’ son”.  Because of this change, I also had to change the spouse or common-law partner of unknown sex from “r” to “t”.

Update: Nov 20, 2018:  I added the optional B and G at the start of the string to indicate the sex of the starting person if that is relevant to the relationship (e.g. for X or Y chromosome purposes).

Update: Jan 2, 2019:  Changed spouse or partner of unknown sex from “t” to “n” so that “t” will not be confused with “T” which is for identical twin.

Update: July 5, 2022:  Changed "t" to be unmarried partner of any sex and added "z" as spouse of unknown sex.

Update: July 14, 2022:  At various times, I’ve seen people ask how to distinguish the two types of removals of cousins, e.g. 3C2R. I ike 2U3C to go from the current person up 2 generations and across 3 cousins. And 3C2D to first go across 3 cousins and then down 2 generations.

Update April 9, 2025: In order to create a Connection Count Relationship Notation (CCRN), I wanted to make B available, so I changed what were “B” and “G” to “Y” and “X”.  Also, I wanted Z, so I changed “z” to “k”.