Reference genome for alignment (
The hg38 (UCSC; Ensembl GRCh38) reference genome build is from the 1000 Genomes project. This is derived from the NCBI set with HLA and decoy alternative alleles. Note that this genome was chosen because it’s considered the best current reference suitable for variant calling. This is the reference genome FASTA that gets used by aligners, including HISAT2, STAR, bowtie2, and bwa.
Relevant links, with version information:
- 1000 Genomes FTP
- NCBI FTP
HISAT2 pre-built index version information:
- HISAT2 (v2.0.1) reference with Ensembl release 78 exon and splicesite annotations.
- Derived from NCBI set with HLA and decoy alternative alleles.
- Source: 1000 Genomes GRCh38 reference genome
- Build date: 2015-12-07
Reference transcriptome for RNA-seq (
bcbio-nextgen currently uses the latest Ensembl GRCh38 reference genome for the hg38 transcripts FASTA and GTF.
Note that the annotations are remapped from Ensembl to UCSC (see
gtf.yaml recipe link below).
This file gets used by salmon and kallisto for transcript-level quantification.
YAML recipe links
Current hg38 recipes:
Recipes per workflow, defined in CloudBioLinux:
- seq.yaml: Alignment/variation. Note that STAR index builds per install using this recipe.
- hisat2.yaml: HISAT2 uses a pre-built index.
- bwa.yaml: bwa uses a pre-built index.
- transcripts.yaml: RNA-seq transcripts are from latest Ensembl reference.
- gtf.yaml: This step uses mappings defined in the ChromosomeMappings repo to remap from Ensembl to UCSC.