bcbio Homo sapiens genome build

Details on the assembly sources used for default bcbio hg38 genome.

Reference genome for alignment (seq/)

The hg38 (GRCh38) reference genome build is from the 1000 Genomes project. This is derived from the NCBI set with HLA and decoy alternative alleles. Note that this genome was chosen because it’s considered the best current reference suitable for variant calling. This is the reference genome FASTA that gets used by aligners, including HISAT2, STAR, bowtie2, and bwa.

Relevant links:

Additional links containing version information:

HISAT2 pre-built index version information:

See also:

Reference transcriptome for RNA-seq (rnaseq/)

bcbio currently uses the latest Ensembl annotation for the hg38 transcripts FASTA.

This file gets used by salmon/kallisto for transcript-level quantification.

Here’s the cloudbiolinux code used to generate the transcripts FASTQ. This isn’t currently documented well enough on that repo.

url=http://ftp.ensembl.org/pub/release-94/gtf/homo_sapiens/Homo_sapiens.GRCh38.94.gtf.gz
mkdir -p rnaseq
remap_url=http://raw.githubusercontent.com/dpryan79/ChromosomeMappings/master/GRCh38_ensembl2UCSC.txt
wget --no-check-certificate -qO- $remap_url | awk '{if($1!=$2) print "s/^"$1"/"$2"/g"}' > remap.sed
wget --no-check-certificate -qO- $url | gunzip -c > hg38.gtf
sed -f remap.sed hg38.gtf > rnaseq/hg38.gtf
rm remap.sed
rm hg38.gtf

Current hg38 recipes:

Recipes per workflow: