GATK.GRCh38 vs GRCh38

I am doing both RNASeq and Sarek and need both of them to use the same genome. I used GATK.GRCh38 on Sarek and then noticed that this does not exist in RNASeq, so I used GRCh38. Are they the same thing?

iGenomes GATK.GRCh38 and GRCh38 (really NCBI/GRCh38) are definitely not the same, the former has 3366(!) contigs, the latter 195. The difference is that the GATK genome is optimized for variant calling, where reads mapped to a similar but wrong locus can have serious negative effects. The GATK genome therefore contains a lot of “decoy” sequences, e.g. from viruses commonly found in humans, as well as “alternate” versions of certain loci that show high variability betwen individuals (MHC, etc.).
In RNAseq, on the other hand, too many alternate contigs are not necessarily desirable, because reads need not match perfectly to the genome, as long as they are in the correct gene, you will get usable results, and spreading reads over several contigs has the potential to do more harm than good (although the details vary with your exact application and goals).
What do the two genomes have in common? Most importantly, they both follow the UCSC naming convention (chr1, chr2, etc.), so they are “compatible” in that sense. In fact, NCBI/GRCh38 is a subset of GATK.GRCh38, i.e. everything in NCBI/GRCh38 is also in the GATK version. I have not checked the sequences, but the basic human GRCh38 assembly has been stable since its release in December 2013, and “patches”, which are periodically added, are not used in most routine analyses, so it is pretty safe to assume that the sequences that the genome version have in common are functionally equivalent.
With that in mind, it seems perfectly reasonable to me to use the NCBI/GRCh38 assembly for RNAseq and the GATK version for variant calling. One issue to keep in mind is that if you use iGenomes in the nf-core/rnaseq pipeline with --genome GRCh38, you will use a GTF file (containing gene annotations) that is very old and misses certain features (gene_biotype) as well, so I would not recommend that. I would get a compatible new GTF instead. If you are using VEP in sarek, the VEP cache has a version that corresponds to an Ensembl reslease (newest currently in sarek is 111), so if you are getting a GTF from Ensembl (they have alternate versions with chr prefix, even though the Ensembl genomes don’t use that themselves), it makes sense to use the same release, which ensures that the gene IDs are fully compatible. You can then use the --fasta option in rnaseq with the iGenomes NCBi/GRCh38 genome sequence and the --gft option with the GTF file of your choice.

4 Likes

Thank you very much @tdanhorn , this is great advice and information.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.