Iliad SNP Array
TL;DR setup
Input |
Output |
IDAT data |
quality-controlled VCF |
Please make sure that your conda environment for Iliad is activated - conda activate iliadEnv
or mamba activate iliadEnv
Modify the configuration file workdirPath
parameter to the appropriate path leading up to and including /Iliad
and a final forward slash e.g. /Path/To/Iliad/
.
The configuration file is found in config/config.yaml
.
#####################################
#####################################
#####################################
# # # USER INPUT VARIABLES # # #
#####################################
#####################################
#####################################
# You must insert your /PATH/TO/Iliad/
# use 'pwd' command to find your current working directory when you are inside of Iliad directory
# e.g. /path/to/Iliad/ <---- must include forward slash at the end of working directory path
# must include forward slash, '/', at the end of working directory path
workdirPath: /Insert/path/to/Iliad/
You might consider changing some other parameters to your project needs that are pre-set and include:
Homo sapiens GRCh38 release 104 reference genome
ref:
species: homo_sapiens
release: 104
build: GRCh38
Illumina MEGA microarray GRCh38 support and product files
urlProductFiles:
manifest: https://webdata.illumina.com/downloads/productfiles/multiethnic-global-8/v1-0/build38/multi-ethnic-global-8-d2-bpm.zip
mzip: multi-ethnic-global-8-d2-bpm.zip
cluster: https://webdata.illumina.com/downloads/productfiles/multiethnic-global-8/v1-0/infinium-multi-ethnic-global-8-d1-cluster-file.zip
czip: infinium-multi-ethnic-global-8-d1-cluster-file.zip
urlSupportFiles:
physicalGeneticCoordinates: https://support.illumina.com/content/dam/illumina-support/documents/downloads/productfiles/multiethnic-global/multi-ethnic-global-8-d2-physical-genetic-coordinates.zip
pzip: multi-ethnic-global-8-d2-physical-genetic-coordinates.zip
rsidConversion: https://support.illumina.com/content/dam/illumina-support/documents/downloads/productfiles/multiethnic-global/multi-ethnic-global-8-d2-b150-rsids.zip
rzip: multi-ethnic-global-8-d2-b150-rsids.zip
rfile: Multi-EthnicGlobal_D2_b150_rsids.txt
Place your data into the /Iliad/data/snp_array/idat/
directory.
Since this module is NOT the main snakefile, Snakemake will NOT automatically detect it without the --snakefile
flag.
(Please make sure that your conda environment for Iliad is activated - conda activate iliadEnv
or mamba activate iliadEnv
)
$ snakemake --snakefile workflow/snpArray_Snakefile --cores 1
and combined with other user-specified snakemake flags such as --cores
.
If you plan to use on a local machine or self-built server without a job scheduler the default command to run is the following:
$ snakemake -p --use-singularity --use-conda --snakefile workflow/snpArray_Snakefile --cores 1 --jobs 1 --default-resource=mem_mb=10000 --latency-wait 120
However, there is a file included in the Iliad
directory named - snpArray-snakemake.sh
that will be useful in batch job submission.
Below is an example snakemake workflow submission in SLURM job scheduler.
Please read the shell variables at the top of the script and customize to your own paths and resource needs.
$ sbatch snpArray-snakemake.sh
If you would like more in-depth information and descriptions, please continue to the next sections below. Otherwise, you have completed the TL;DR setup section.
Information
This tutorial introduces the genome-wide SNP array data processing module of the Iliad_ workflow developed using Snakemake workflow language.
Please visit Snakemake for specific details. They also provide informational slides. In general, though, each module is composed of rules. These rules define how output files are generated from input files while
automatically determining dependencies amongst the rules. A DAG
(directed acyclic graph) of jobs will be built each time to account for all of the samples and jobs
that will executed either via job scheduler or local cores and will execute in parallel if multiple jobs are declared.
Because of the Snakemake workflow system design, the Iliad workflow is scalable from single core machines to HPC clusters with job schedulers.
The SNP array module is designed to process target data in your lab. Iliad is currently limited to Illumina microarray raw data processing and is configured for the human genotyping Infinium Multi-Ethic Global-8 Kit (MEGA). We ensured no bioinformatics knowledge is needed to run this module with the help of external test runs performed on Google Cloud Platform (GCP).
SNP Array Module Rule Graph
Background
Genome-wide microarray data remains one of most widely used methods to obtain genotypic information on collected DNA samples, despite the
growing popularity and accessibility of genotyping by sequencing.
To make a comprehensive genomic pipeline,
we wanted to provide the means necessary for researchers to still access such a large body of data that remains important for many analyses.
GWAS data
can be used in many more applications than gene identification, such as ancestry estimation,
historical population reconstruction, clinical genetic testing for diagnostic purposes, forensic analyses, and new method validation for sequencing data.
This module is currently limited to Illumina microarrays on the basis of the software tools and support and product file downloads. It is configured to the MEGA microarray, meaning download files are pointed to MEGA support files and product files. It does possess the capability to be adapted to other microarrays. Pull requests and contributions are welcomed.
Basics
The raw files from an Illumina sequencer are bead array files found in raw intensity data .idat
format.
These .idat
files are to be converted into Genotype Call .gtc
files using iaap-cli software. This software does have an
end-user license agreement (EULA) and is not included or distributed by Iliad. If the user chooses to configure a download of the
program, it will be downloaded, independent from the Iliad repository distribution.
The .gtc
files are converted to a .vcf
using bcftools plug-in gtc2vcf.
This requires a reference genome assembly and Iliad downloads the user-configured reference genome fasta files.
Iliad is configured to download Homo sapiens GRCh38 release 104 as default.
Processing the .vcf
is critical for realistic genetic data compatibility.
Custom python scripts are called in rules to rename unconventional loci names to standardized rs IDs
using dbSNP files.
The default configuration file is set to download human_9606_b151_GRCh38p7 All_20180418.vcf.gz
.
Once the vcf
is processed, quality controls are performed based on the GenTrain and ClusterSep scores.
Default thresholds, along with other default workflow configurations, can be found in your path to the configuration file: config/config.yaml
.
In-depth Setup
Once the Installation of Iliad and its two dependencies has been completed,
you will find your new working directory within the PATH/TO/Iliad
folder.
Make sure your current working directory is in this cloned repo as stated in the installation.
If the repository is not cloned in that fashion, there is a chance that your direcory will be improperly named as Iliad-main
.
$ cd Iliad
In that working directory you will find there are a number of directories with files and code to run each of the module pipelines.
FIRST,
there is a data/snp_array/idat
directory with a readme.md
file. You must place all of your .idat
files in this folder.
There should be two .idat
files for each sample: one green _Grn.idat
and one red _Red.idat
.
SECOND,
there is a configuration file with some default parameters, however, you MUST at least change the workdirPath
parameter to the appropriate
path leading up to and including /Iliad
e.g. /Path/To/Iliad/
. The configuration file is found in config/config.yaml
.
workdirPath: /Path/To/Iliad/
Some other parameters that are pre-set and you might consider changing to your project needs include:
Homo sapiens GRCh38 release 104 reference genome
ref:
species: homo_sapiens
release: 104
build: GRCh38
Illumina MEGA microarray GRCh38 support and product files
urlProductFiles:
manifest: https://webdata.illumina.com/downloads/productfiles/multiethnic-global-8/v1-0/build38/multi-ethnic-global-8-d2-bpm.zip
mzip: multi-ethnic-global-8-d2-bpm.zip
cluster: https://webdata.illumina.com/downloads/productfiles/multiethnic-global-8/v1-0/infinium-multi-ethnic-global-8-d1-cluster-file.zip
czip: infinium-multi-ethnic-global-8-d1-cluster-file.zip
urlSupportFiles:
physicalGeneticCoordinates: https://support.illumina.com/content/dam/illumina-support/documents/downloads/productfiles/multiethnic-global/multi-ethnic-global-8-d2-physical-genetic-coordinates.zip
pzip: multi-ethnic-global-8-d2-physical-genetic-coordinates.zip
rsidConversion: https://support.illumina.com/content/dam/illumina-support/documents/downloads/productfiles/multiethnic-global/multi-ethnic-global-8-d2-b150-rsids.zip
rzip: multi-ethnic-global-8-d2-b150-rsids.zip
rfile: Multi-EthnicGlobal_D2_b150_rsids.txt
THIRD,
each module pipeline has a specific Snakefile
.
Snakemake will automatically detect the main snakefile, which is named excatly as such and found in the workflow
directory: workflow/Snakefile
.
Iliad reserves the main snakefile for the main module, specifically the raw sequence read data module.
This means the user must specify which Snakefile
will be invoked with the following:
$ snakemake --snakefile workflow/snpArray_Snakefile
and combined with other user-specified snakemake flags, of course, like --cores
.
In this module, the SNP Array Snakefile is also located in the workflow
directory: workflow/snpArray_Snakefile
.
Users must invoke this snpArray_Snakefile
in order to run the workflow for This SNP ARRAY MODULE.
If you plan to use on a local machine or self-built server without a job scheduler the default command to run is the following:
$ snakemake -p --use-singularity --use-conda --cores 1 --jobs 1 --snakefile workflow/snpArray_Snakefile --default-resource=mem_mb=10000 --latency-wait 120
However, there is a file included in the Iliad
directory named - snpArray-snakemake.sh
that will be useful in batch job submission.
Below is an example snakemake workflow submission in SLURM job scheduler.
Please read the shell variables at the top of the script and customize to your own paths and resource needs.
$ sbatch snpArray-snakemake.sh