Login to participate
  
Register   Lost ID/password?
Louis Kessler’s Behold Blog » Blog Entry           prev Prev   Next next

My Whole Genome Sequencing. The VCF File - Wed, 6 Feb 2019

I received my results from my Dante Labs Whole Genome test last week. I purchased the test last August when I was able to get it for $399 USD. There were two health reports that I requested that are written in ancient Latin as far as my understanding of them goes. Then there were the VCF files which I was more interested in. The FASTQ and BAM files will be sent to me on a hard drive in a few weeks.

A Variant Call Format (VCF) file basically contains the differences between me and the “standard reference human”. There were two VCF files included in my results. One with my individual SNPs which was 143 MB and the other with Insertions and Deletions which was 43 MB. The individual SNP file is of most interest, because it is that file that contains the autosomal SNP data that DNA testing companies use for genealogical matching.

These files are in gz compressed format. When expanded (to 869 MB and 224 MB) they are standard text files and a bit of the individual SNP file looks like this:

image

My VCF file has a header section of 141 lines. The first line of the file (not shown above) indicates that this file’s format is Version 4.2 of VCF. Another important line in the header is line 139 above, which specifies the reference genome to be ucsc,bg19.fasta.  The ucsc is for the University of California Santa Cruz Genomics Institute who maintain and make available genome information at genome.ucsc.edu. The bg19 refers to the hg19 assembly of the human genome, which is also call Build 37, and is the version of the genome currently used by most of the DNA testing companies. And fasta is a format that lists all the reference values of the genome.

The header in my VCF file followed by 3,442,712 lines that represent each SNP where I am different from the reference value. “SNP” is an abbreviation for Single-nucleotide polymorphism. The “polymorphism” refers to something that can have more than one form, so when you hear SNP, think of a position on the genome where humans can differ from each other.

Each line contains:

  • #CHROM, the chromosome number of the SNP.  My file includes data for Chromosomes 1 to 22, X and Y.
  • POS, the position of the SNP on the chromosome
  • ID, the RSID of the SNP, i.e. a name it is given to reference it.  In my VCF file from Dante, no RSIDs are given and the ID is shown as a period on every line. That’s not a problem, since most DNA match is done by position, not RSIDs which can change positions between Builds.
  • REF, the value of that position on that chromosome in the reference genome and is one of A, C, G and T. This is usually the SNP value that most people have, e.g. if REF = A, then the pair AA with be the reference value for that SNP, i.e. A from their father and A from their mother.
  • ALT, the alternative values that I have. Usually it is one value, one of A, C, G and T and is different from the REF value. Occasionally it is two values, both different from the REF value, e.g. REF = A, ALT = C,T
  • QUAL, is a number estimating the quality of the read that was done in my test for that SNP. A higher number is better quality. 
  • FILTER, is an evaluation as to whether that SNPs value is reliable. My file only included SNPs with a filter value of PASS.
  • INFO and FORMAT, contains detailed information about the read at that SNP. The most important field is the AC field. If AC=2, then the ALT value will be both values of the pair. Otherwise the REF value will be the leftover value. e.g:
  • REF=A, ALT=C, AC=2, then SNP=CC
  • REF=A, ALT=C, AC=1, then SNP=AC
  • REF=A, ALC=C,T, AC=1, then SNP=CT

So from this file, using the REF, ALT and AC values on each line, I can compute the SNP value for the position given on the chromosome.

These are the counts of each computed SNP value for my file:

image

Remember that the above counts of homozygous readings (where both alleles are the same: AA, CC, GG or TT) do not include any SNPs which have the same reference value. If they are the same as the reference value, then they are not included in the VCF file.

Also note that since I’m a male, one allele should be shown for the X and Y chromosomes. I should not have any heterozygous (alleles are different) readings there. These might either be errors in the reads, or maybe they are reading the pseudo-autosomal regions on the X and Y where crossover might occur. I’m not sure why the number of my homozygous variants for Y are so low. But for genealogical matching purposes, I’m more interested in 1 to 22 and X.

The 1000 Genomes Project Consortium in 2015 found over 84.7 million SNPs among 2,504 individuals from 26 populations. They also found that “a typical genome differs from the reference human genome at 4.1 million to 5.0 million “sites” out of the 3.3 billion base pairs, so that’s only 0.14%. That means that 99.86% of our genomes are identical. These “sites” will include my 3,442,712 SNPs in the table above, as well as the 867,091 inserts and deletions from my other VCF file. So my total is 4,309,803 sites, which is in the correct range.

   

Comparing VCF values to my Raw Data

I’ve tested my DNA with 5 companies that have provided me with raw DNA results. The companies tested and gave me the results for from 618,640 SNPs (Living DNA) to 720,816 SNPs (MyHeritage DNA). There was overlap in what SNPs the companies tested. When I took the results of all 5 tests and combined them into one raw data file, I ended up with 1,389,750 unique SNPs.

A whole genome test is a test of all your DNA. My Dante WGS results provide me with values for all positions on all my chromosomes. These will come in 2 huge files I will receive soon on a hard drive.

The VCF files that I’m talking about in this article tell me what differs from the reference, so it is logical to assume that all values that are not in the VCF file are the same as the reference. Through deduction, you would think that I could state with certainty that the positions not specified in the VCF file would have the reference value. But that won’t always be true because the VCF contains only the SNPs that have “PASS” as the Filter value. We don’t know what the values are for those that are not marked as PASS from just the VCF. In fact, I don’t even know how many are not marked PASS, whether it is a lot or a few. Since this is a 30x (30 times coverage) WGS test, I would assume that the vast majority of the positions have been read correctly. Once I get the FASTA and BAM files, I’ll see if I can look at this in more detail.

My VCF file contains 471,923 SNPs that are in my combined raw data. So 34.0% of my combined raw data are specified in the VCF file. The other 3,837,880 SNPs in the VCF file are SNPs that none of the 5 DNA testing companies had tested. We’ll ignore those for now.

Here’s a summary of the 471,923 SNPs in common between my VCF file and my combined raw data file:

image

Of these, 98.0% were the same as they were in my combined raw data file.

The “New” column represent 6,321 SNPs that were no-calls in my combined raw data file, so my VCF allows me to define those.

The “Verify” column represents 228 SNPs that had disagreements between two or more of the raw data files, so I had set them to a no-call. The VCF could prove to be a tie-breaker in this case, but I’ll just continue to call these no-calls just to be safe.

The “Diff” column represent 2,798 SNPs that had a value in my combined raw data file, but the VCF value disagrees with it.

I could use this information to improve my raw data. I could assign values to the 6,321 no-calls, but I should then also turn 2,798 assigned values into no-calls. That would still reduce my overall number of no-calls down by 3,523, from 20,688 (1.5%) to 17,165 (1.2%).


How Can Genetic Genealogists Use a VCF file

Two ways:

1. Upload the VCF file to a DNA matching service that accepts it.

2. Use it to create a raw data file which you can then upload to a DNA matching service that accepts it.


Uploading a VCF file to GEDmatch Genesis

One would hope that if they did a whole genome test, they would be able to upload their whole genome data to one of the companies that do DNA matching.

The only company that currently takes VCF uploads is GEDmatch Genesis. I was patient and waited the 5 minutes until the browser responded after I hit the Upload button. Then it didn’t take very long to for GEDmatch to load the file and it provided this processing:

image

I made that kit “Research” and waited a day until GEDmatch completed the matching for the kit. Once the results came back, I found a problem.

The GEDmatch File Diagnostic Utility run on my combined raw data which I had previously uploaded gives this:

  image

When I run the diagnostics on my VCF file from Dante, I get this:

image

As correctly reported by the diagnostics, the All 5 file has 1,389,750 SNPs in it, and the WGS file has 3,442,712 SNPs in it.

The diagnostic then reports that my All 5 files has 1,128,146 usable SNPs which are then slimmed to 813,196 SNPs. The slimmed SNPs are the ones that GEDmatch Genesis uses for matching. They are the ones that are the most different between people and give you the most “bang for the buck”.

But my VCF file only had 590,334 useable SNPs which get slimmed to only 231,588 SNPs. That is way less than my All 5 file has. A WGS tests the whole genome, so it should give more SNPs than any other test or even combined tests give. So something was wrong.

Also, when I did a One to Many of my WGS kit, it matched most closely to my All 5 kit, which it should. But then it was closely followed by a whole bunch of kits of other people who are matching me close to identically. All those kits appear to be other whole genome tests.

It then became obvious to me that GEDmatch Genesis is only using the variant SNPs from the VCF file.  The reason why I get complete matches with other WGS kits is that if two people both have a variant at a position, then there is an extremely high probability that your variant is the same. And all GEDmatch is comparing between WGS files are variants.

The procedure that GEDmatch or anyone else who wants to load a VCF file needs to do is this:

  1. If a line in the VCF file has one REF value and one ALT value, then
    • If the INFO field contains:  “AC=1”, then you take the two of them.  e.g.  REF=T, ALT=C, then value is TC (or CT if you sort alphabetically)
    • If the INFO field contains:  “AC=2”, then you use the ALT value twice.  e.g.  REF=T, ALT=C, then value is CC.
  2. If a line in the VCF file has one REF and two ALT values, then you take both the ALT values.  e.g.  REF=T, ALT=C,G, then value is CG.  There are only a few hundred of these in my VCF file.
  3. If a SNP that they use is not in the VCF file, then use the reference. e.g. REF=C, to give the value CC.  They’ll need to have a reference table with the Build 37 genome reference values for all the SNPs that they use. This table would be the same for everyone.

I reported this to GEDmatch and John Olson replied back and confirmed that they are not adding the reference values. He said the VCF upload will have to wait until they get caught up on their Genesis conversion issues.


Using DNA Kit Studio to Create a Raw Data File from a VCF

Wilhelm H. created a wonderful little program called DNA Kit Studio that includes a VCF to RAW converter in it.

image

It originally did not accept my VCF from Dante. I contacted Wilhelm and the reason was that Dante did not include RSID values. Wilhelm made the change and sent me a beta of the program for me to try. It now created the raw data file, and correctly did steps 1a, 1b, and 2, above.  But he, like GEDmatch, also was not including the reference genome value for the other positions.

I gave Wilhelm links to a couple of open source sites that have most of the reference values for the 23andMe and Ancestry SNPs that the companies test for. And likely when I get the rest of my whole genome data (the Fasta and BAM files), I’ll figure out how to determine all the reference values myself.

If you can’t wait for Wilhelm to finish his update to his VCF to RAW converter, or if you don’t want to do the task yourself, you could use Wilhelm’s service and he’ll convert it for you for a small fee.


Conclusion:  Is a WGS Test Useful for Matching?

For the purposes of matching, it really only takes a raw data file from any of the major DNA testing companies to get you going. GEDmatch and some of the testing companies will accept uploads and you can get into most databases with just the one test.

You will get slightly more accurate matches at GEDmatch Genesis if you take a test from two companies, one using the old chip (AncestryDNA, Family Tree DNA or MyHeritage DNA) and one using the new chip (23andMe or Living DNA) and then use a tool like DNA Kit Studio to combine them before uploading.

But currently, I don’t see that the WGS test provides enough added utility to make it something genetic genealogists need for matching purposes.

Also see:

9 Comments           comments Leave a Comment

1. vanlaargen (vanlaargen)
Netherlands flag
Joined: Sat, 1 Sep 2018
3 blog comments, 0 forum posts
Posted: Sun, 10 Feb 2019  Permalink

“It originally did not accept my VCF from Dante. I contacted Wilhelm and the reason was that Dante did not include RSID values. Wilhelm made the change and sent me a beta of the program for me to try. It now created the raw data file, and correctly did steps 1a, 1b, and 2, above. But he, like GEDmatch, also was not including the reference genome value for the other positions.”

Well, this explains why I couldn’t get it to work back in September. After successfully creating a merged file of my mother’s 23andMev3, MyHeritage, Ancestry and LivingDNA files using DNA Kit Studio. I failed at doing it with my DanteLabs vcf files. I recently learned DanteLabs’ North-American customers get their results from an Illumina machine, while European customers get their results from a BGI machine, I wonder if this will cause issues too.

I did have my eye on Thomas Krahns Extract23 script at the time. In preparation, I used my mother’s merged file to add the missing SNPs to Thomas’ 23andMe_V3_hg19_ref.tab.gz file. After I got a PC upgrade (additional SSD drive and 64GB of ram a couple of days ago, I tried again yesterday to create a 4-company file from my DanteLabs hg19 BAM file using extract23.

Diagnostic for my Mother’s Merged 23andMev3-Ancestry-LivingDNA-MyHeritage file made using DNAKitStudio:
https://genesis.gedmatch.com/v_diag2.php?kit_num=EB3068844

Diagnostic for my 23andMev3-Ancestry-LivingDNA-MyHeritage DanteLabs file using made using adjusted Extract23
https://genesis.gedmatch.com/v_diag2.php?kit_num=DX3001878

2. Louis Kessler (lkessler)
United States flag
Joined: Sun, 9 Mar 2003
287 blog comments, 245 forum posts
Posted: Tue, 12 Feb 2019  Permalink

Interesting. You got over 100,000 more usable SNPs from your combined file than I did. That’s a lot. I wonder why.

And thanks for your link to Krahn’s program. I’m a Windows guy, but his template will be of interest to me. I’ll likely be custom programming my own extract once I get my BAM file.

Yes, Dante uses equipment from different companies in their labs in America vs Europe. The process may be different but it shouldn’t make a real difference in results since they’re both whole genome.

3. teepean (teepean)
Finland flag
Joined: Tue, 19 Mar 2019
1 blog comment, 0 forum posts
Posted: Tue, 19 Mar 2019  Permalink

vanlaargen: Could you share your adjusted 23andMe_V3_hg19_ref.tab.gz, please?

4. blzlovr (blzlovr)
United States flag
Joined: Thu, 2 May 2019
1 blog comment, 0 forum posts
Posted: Thu, 2 May 2019  Permalink

I received my data from Dante Labs and didn’t know what to do with it, the medical information was very similar to Genos and Promethease. The kit I bought included mtDNA but couldn’t figure out a way to view it so I asked customer service and they added a VCF file. They said I could get the BAM and FastQ files on a hard drive (not sure of the charge on that and hoping they meant USB drive?). I found your blog by googling mdDNA and VCF files. It’s looking like I really can’t do anything with the VCF file and I will order BAM and FastQ files. Husband did Big Y and analyzed at Yfull, but has not done any mtDNA testing and my hope was to use Dante Labs for that. I like your blog, easy to understand.

5. Louis Kessler (lkessler)
Canada flag
Joined: Sun, 9 Mar 2003
287 blog comments, 245 forum posts
Posted: Mon, 6 May 2019  Permalink

Bizlovr: These files are not easy to work with. I’m trying to directly analyze them myself (I have some skills with this sort of analysis) since I don’t like the black box analysis tools that the genomic genealogists use. Getting down and dirty and playing with the ants is easier for me than just getting specific ants here and there or counts of the number of red and black ants. It’s the way I learn best as well.

As far as what you can do with it? Well, the number one reason to do a WGS is for medical purposes, and I’m not really interested in that. I’m trying to see how it might be able to help in any way for genealogical purposes, but it’s more of an interesting exploration and learning tour that I’m on.

I’ve done my full mtDNA and Y-DNA tests at FTDNA so I’ve got enough there and have no need to get that data out of my WGS. The VCF Dante supplies surprisingly leaves out the mtDNA variants. but if you copy the VCF file link and change the “snp.vcf.gz” at the end to “raw.snp.vcf.gz” and download that, you’ll get a slightly bigger file with chrM (mt) listed before chr1. The mt-chromosome isn’t very big and I only have 46 variants which is typical for mtDNA, and the variants in that file were correct. So you can get your mtDNA variants from your Dante test that way.

6. starkruzr (starkruzr)
United States flag
Joined: Fri, 12 Jul 2019
2 blog comments, 0 forum posts
Posted: Fri, 12 Jul 2019  Permalink

Hi Louis,

I’m a systems engineer who designs and builds systems for life science applications (https://bioteam.net/bio/jarett-deangelis/) and I’m really interested in doing some advanced analysis to brush up on my bioinformatics application skills. I also went with Dante and just received my VCFs (there doesn’t seem to be a “button” to push to order my FASTQs and BAMs, sent an email). My dad did Ancestry and I’m curious to see if there are any interesting cross-referencing applications we could do together.

Thanks,
Jarett

7. Louis Kessler (lkessler)
Canada flag
Joined: Sun, 9 Mar 2003
287 blog comments, 245 forum posts
Posted: Sat, 13 Jul 2019  Permalink

Hi Jarett. My interest is more in the genealogical aspects of DNA testing I’m currently working hard to finish and release version 3 of DMT. http://www.doublematchtriangulator.com - I am not a genomics expert, but I’m learning.

8. starkruzr (starkruzr)
United States flag
Joined: Fri, 12 Jul 2019
2 blog comments, 0 forum posts
Posted: Sat, 13 Jul 2019  Permalink

Holy crap, that’s awesome! How does this work under the hood? Are you leveraging open source libraries/applications that already exist on Windows?

9. Louis Kessler (lkessler)
Canada flag
Joined: Sun, 9 Mar 2003
287 blog comments, 245 forum posts
Posted: Sat, 13 Jul 2019  Permalink

Jarett, no, I wrote all the code myself. Double Matching is a technique I’ve never seen used before, so I’m discovering new ways to utilize it as I go.

 

The Following 3 Sites Have Linked Here

  1. Promises and Limitations of Genetic Genealogy, by Debbie Kennett, in Advanced Genetic Genealogy, page 354 : Sat, 13 Apr 2019
    "Debbie Parker Wayne and Louis Kessler are writing about the WGS journey on their blogs."

  2. Superkit Sunday - Judy Russell - The Legal Genealogist : Sun, 14 Apr 2019
    See generally, Louis Kessler, “My Whole Genome Sequencing. The VCF File,” Behold Genealogy, posted 6 Fed 2019 ( : accessed 14 Apr 2019) (“The slimmed SNPs are the ones that GEDmatch Genesis uses for matching. They are the ones that are the most different between people and give you the most ‘bang for the buck’”)

  3. Ive ordered a Dante Labs Whole Genome kit. How do I extract files usable in genealogy? - Comment by Rob Judd - WikiTreee G2G : Mon, 19 Aug 2019
    "This excellent article covers the exact ground I was enquiring about. Further, the author references a free tool that can be used to convert VCF files into various formats used by the common genealogy testing companies, such as 23andMe, AncestryDNA and FTDNA/MyHeritage."

Leave a Comment

You must login to comment.

Login to participate
  
Register   Lost ID/password?