Pre-formatted Metagenomic and Human reference datasets
While reference datasets can be created from existing FASTA files solely using RTG commands, to make full use of RTG functionality this can require the creation of additional configuration files. For example, the metagenomic species abundance tools work best when provided with species taxonomy information, and to make use of the sex-aware capabilities of the human variant calling pipeline requires reference configuration specifying sex chromosome information and the location of PAR regions.
To get our customers up and running more smoothly, we have provided several commonly used datasets in a pre-formatted form. The following datasets are available:
Human reference genomes:
- 1000g_v37_phase2.sdf.zip (996.8 MB) (chromosomes named as "1", "2", etc)
- hg19.sdf.zip (984.4 MB) (chromosomes named as "chr1", "chr2", etc)
- GRCh38.sdf.zip (999.2 MB) (chromosomes named as "chr1", "chr2", etc. This is the "no alt analysis set")
- GRCh38_hs38d1.sdf.zip (1001.2 MB) (chromosomes named as "chr1", "chr2", etc. This is the "no alt analysis set" plus decoys)
Metagenomics support databases:
- references-filter.zip (984.4 MB)
- references-protein.zip (4.90 GB)
- references-species.zip (5.52 GB)
The "Pipeline Commands" section of the user manual has instructions on how to use these databases, as well as how you can create and use your own databases.
Once downloaded and unzipped, you can run the rtg sdfstats
command on the dataset for more information.
The RTG genome simulation utilities have the ability to utilize genetic maps encoding linkage disequilibrium information. While users may supply their own files, the following zipfile contains genetic map files for human build 37 that can be used with the rtg childsim
and rtg pedsim
commands. See the contained README for information on original data source and see the user manual for more information about how to use these genetic maps: