wespipeline package

Submodules

wespipeline.align module

class wespipeline.align.BwaAlignFastq(*args, **kwargs)

Bases: luigi.contrib.external_program.ExternalProgramTask

Task used for aligning fastq files against the reference genome.

It requires the output of both the wespipeline.reference.ReferenceGenome and wespipeline.fastq.GetFastq higher level tasks in order to proceed with the alignment.

If wespipeline.utils.GlobalParams.exp_name is set, it will be used for giving name to the Sam file produced.

Parameters

none

Output:

A luigi.LocalTarget instance for the aligned sam file.

output()

The output that this Task produces.

The output of the Task determines if the Task needs to be run–the task is considered finished iff the outputs all exist. Subclasses should override this method to return a single Target or a list of Target instances.

Implementation note

If running multiple workers, the output must be a resource that is accessible by all workers, such as a DFS or database. Otherwise, workers might compute the same output since they don’t see the work done by other workers.

See Task.output

program_args()

Override this method to map your task parameters to the program arguments

Returns

list to pass as args to subprocess.Popen

requires()

The Tasks that this Task depends on.

A Task will only run if all of the Tasks that it requires are completed. If your Task does not require any other Tasks, then you don’t need to override this method. Otherwise, a subclass can override this method to return a single Task, a list of Task instances, or a dict whose values are Task instances.

See Task.requires

class wespipeline.align.FastqAlign(*args, **kwargs)

Bases: wespipeline.utils.MetaOutputHandler, luigi.task.Task

Higher level task for the alignment of fastq files.

It is given preference to local files over processing the alignment in order to reduce computational overhead.

Alignment is done with the Bwa mem utility.

Parameters
  • fastq1_local_file (str) – String indicating the location of a local Sam file for the alignment.

  • cpus (int) – Integer indicating the number of cpus that can be used for the alignment.

Output:

A dict mapping keys to luigi.LocalTarget instances for each of the processed files. The following keys are available:

‘sam’ : Local file with the alignment.

cpus = <luigi.parameter.Parameter object>
requires()

The Tasks that this Task depends on.

A Task will only run if all of the Tasks that it requires are completed. If your Task does not require any other Tasks, then you don’t need to override this method. Otherwise, a subclass can override this method to return a single Task, a list of Task instances, or a dict whose values are Task instances.

See Task.requires

run()

The task run method, to be overridden in a subclass.

See Task.run

sam_local_file = <luigi.parameter.Parameter object>

wespipeline.fastq module

class wespipeline.fastq.FastqcQualityCheck(*args, **kwargs)

Bases: luigi.contrib.external_program.ExternalProgramTask

Task used for creating a quality report on fastq files.

The report is created using the Fastqc utility, reulsting on an html report, an a zip folder containing more detailed information about the quality of the reads.

Parameters

fastq_file (str) – Path for the fastq file to be analyzed.

Output:

html (luigi.LocalTarget) : File containing the report for fastqc quality.

fastq_file = <luigi.parameter.Parameter object>
output()

The output that this Task produces.

The output of the Task determines if the Task needs to be run–the task is considered finished iff the outputs all exist. Subclasses should override this method to return a single Target or a list of Target instances.

Implementation note

If running multiple workers, the output must be a resource that is accessible by all workers, such as a DFS or database. Otherwise, workers might compute the same output since they don’t see the work done by other workers.

See Task.output

program_args()

Override this method to map your task parameters to the program arguments

Returns

list to pass as args to subprocess.Popen

class wespipeline.fastq.GetFastq(*args, **kwargs)

Bases: wespipeline.utils.MetaOutputHandler, luigi.task.Task

Higher level task for the retrieval of the experiment fastq files.

Three diferent sources for the fastq files are accepted: an existing local file, an NCBI accession number for the reads, and an external url indicating the location for the resources. The order in which the sources will be searched is the same as above: it is given preference to local files over external resources in order to reduce computational overhead, and NCBI accession number over external resources for reproducibility reasons.

Parameters
  • fastq1_local_file (str) – String indicating the location of a local compressed fastq file.

  • fastq2_local_file (str) – String indicating the location of a local compressed fastq file.

  • fastq1_url (str) – Url indicating the location of the resource for the compressed fastq file.

  • fastq2_url (str) – Url indicating the location of the resource for the compressed fastq file.

  • paired_end (bool) – Non case sensitive boolean indicating wether the reads are paired_end.

  • compressed (bool) – Non case sensitive boolean indicating wether the reads are compressed.

  • create_report (bool) – A non case-sensitive boolean indicating wether to create a quality check report.

Output:

A dict mapping keys to luigi.LocalTarget instances for each of the processed files. The following keys are available:

‘fastq1’ : Local file with the fastq file with the experiment’s reads. ‘fastq2’ : In case of paired end experiments, a local file with the fastq

file with the experiment’s reads.

accession_number = <luigi.parameter.Parameter object>
compressed = <luigi.parameter.BoolParameter object>
create_report = <luigi.parameter.BoolParameter object>
fastq1_local_file = <luigi.parameter.Parameter object>
fastq1_url = <luigi.parameter.Parameter object>
fastq2_local_file = <luigi.parameter.Parameter object>
fastq2_url = <luigi.parameter.Parameter object>
paired_end = <luigi.parameter.BoolParameter object>
requires()

The Tasks that this Task depends on.

A Task will only run if all of the Tasks that it requires are completed. If your Task does not require any other Tasks, then you don’t need to override this method. Otherwise, a subclass can override this method to return a single Task, a list of Task instances, or a dict whose values are Task instances.

See Task.requires

run()

The task run method, to be overridden in a subclass.

See Task.run

class wespipeline.fastq.SraToolkitFastq(*args, **kwargs)

Bases: luigi.contrib.external_program.ExternalProgramTask

Task used for downloading fastq files from the NVBI archive.

In case of the reads to be paired end, the output will consist of two separate fastq files.

The output file(s) will have for name the accession number and,

in the case of paired end reads, a suffix identifying each of the two fastq.

Parameters
  • accession_number (str) – NCBI accession number for the experiment.

  • paired_end (bool) – Non case sensitive boolean indicating wether the reads are paired_end.

Output:

A dict mapping keys to luigi.LocalTarget instances for each of the processed files. The following keys are available:

‘fastq1’ : Local file with the fastq file with the experiment’s reads. ‘fastq2’ : In case of paired end experiments, a local file with the fastq

file with the experiment’s reads.

accession_number = <luigi.parameter.Parameter object>
output()

The output that this Task produces.

The output of the Task determines if the Task needs to be run–the task is considered finished iff the outputs all exist. Subclasses should override this method to return a single Target or a list of Target instances.

Implementation note

If running multiple workers, the output must be a resource that is accessible by all workers, such as a DFS or database. Otherwise, workers might compute the same output since they don’t see the work done by other workers.

See Task.output

paired_end = <luigi.parameter.BoolParameter object>
program_args()

Override this method to map your task parameters to the program arguments

Returns

list to pass as args to subprocess.Popen

class wespipeline.fastq.UncompressFastqgz(*args, **kwargs)

Bases: luigi.task.Task

Task for uncompressing fastq files.

The task uses utils.UncompressFile for uncompressing into fastq. If both fastq_local_file and fastq_url are set, the local file will have preference; thus reducing the overhead in the process.

Parameters
  • fastq_local_file (str) – String indicating the location of a local compressed fastq file.

  • fastq_url (str) – Url indicating the location of the resource for the compressed fastq file.

  • output_file (str) – String indicating the desired location and name the output uncompressed fastq file.

Output:

A luigi.LocalTarget instance for the uncompressed fastq file.

fastq_local_file = <luigi.parameter.Parameter object>
fastq_url = <luigi.parameter.Parameter object>
output()

The output that this Task produces.

The output of the Task determines if the Task needs to be run–the task is considered finished iff the outputs all exist. Subclasses should override this method to return a single Target or a list of Target instances.

Implementation note

If running multiple workers, the output must be a resource that is accessible by all workers, such as a DFS or database. Otherwise, workers might compute the same output since they don’t see the work done by other workers.

See Task.output

output_file = <luigi.parameter.Parameter object>
requires()

The Tasks that this Task depends on.

A Task will only run if all of the Tasks that it requires are completed. If your Task does not require any other Tasks, then you don’t need to override this method. Otherwise, a subclass can override this method to return a single Task, a list of Task instances, or a dict whose values are Task instances.

See Task.requires

run()

The task run method, to be overridden in a subclass.

See Task.run

wespipeline.processalign module

class wespipeline.processalign.AlignProcessing(*args, **kwargs)

Bases: wespipeline.utils.MetaOutputHandler, luigi.task.Task

Higher level task for the alignment of fastq files.

It is given preference to local files over processing the alignment in order to reduce computational overhead.

If the bam and bai local files are set, they will be used instead of the

Alignment is done with the Bwa mem utility.

Parameters
  • bam_local_file (str) – String indicating the location of a local bam file with the sorted alignment. If set, this file will not be created.

  • bai_local_file (str) – String indicating the location of a local bai file with the index for the alignment. If set, this file will not be created.

  • no_dup_bam_local_file (str) – String indicating the location of a local sam file without the duplicates. If set, this file will not be created.

  • no_dup_bai_local_file (str) – String indicating the location of a local file with the index for the bam file without duplicates. If set, this file will not be created.

  • cpus (int) – Integer indicating the number of cpus that can be used for the alignment.

Output:

A dict mapping keys to luigi.LocalTarget instances for each of the processed files. The following keys are available:

‘bam’ : Local file with the sorted alignment. ‘bai’ : Local file with the alignment index. ‘bamNoDup’ : Local sorted file with duplicates removed. ‘indexNoDup’ : Local file with the index for sorted alignment without duplicates.

bai_local_file = <luigi.parameter.Parameter object>
bam_local_file = <luigi.parameter.Parameter object>
cpus = <luigi.parameter.IntParameter object>
no_dup_bai_local_file = <luigi.parameter.Parameter object>
no_dup_bam_local_file = <luigi.parameter.Parameter object>
requires()

The Tasks that this Task depends on.

A Task will only run if all of the Tasks that it requires are completed. If your Task does not require any other Tasks, then you don’t need to override this method. Otherwise, a subclass can override this method to return a single Task, a list of Task instances, or a dict whose values are Task instances.

See Task.requires

run()

The task run method, to be overridden in a subclass.

See Task.run

class wespipeline.processalign.IndexBam(*args, **kwargs)

Bases: luigi.contrib.external_program.ExternalProgramTask

Task used for indexing the Bam file.

The wespipeline.utils.GlobalParams.exp_name will be used for giving name to the Bai file produced.

Parameters

none

Output:

A luigi.LocalTarget instance for the index Bai file.

output()

The output that this Task produces.

The output of the Task determines if the Task needs to be run–the task is considered finished iff the outputs all exist. Subclasses should override this method to return a single Target or a list of Target instances.

Implementation note

If running multiple workers, the output must be a resource that is accessible by all workers, such as a DFS or database. Otherwise, workers might compute the same output since they don’t see the work done by other workers.

See Task.output

program_args()

Override this method to map your task parameters to the program arguments

Returns

list to pass as args to subprocess.Popen

requires()

The Tasks that this Task depends on.

A Task will only run if all of the Tasks that it requires are completed. If your Task does not require any other Tasks, then you don’t need to override this method. Otherwise, a subclass can override this method to return a single Task, a list of Task instances, or a dict whose values are Task instances.

See Task.requires

class wespipeline.processalign.IndexNoDup(*args, **kwargs)

Bases: luigi.contrib.external_program.ExternalProgramTask

Task used for indexing the Bam file without duplicates.

The wespipeline.utils.GlobalParams.exp_name will be used for giving name to the Bai file produced.

Parameters

none

Output:

A luigi.LocalTarget instance for the index Bai file.

output()

The output that this Task produces.

The output of the Task determines if the Task needs to be run–the task is considered finished iff the outputs all exist. Subclasses should override this method to return a single Target or a list of Target instances.

Implementation note

If running multiple workers, the output must be a resource that is accessible by all workers, such as a DFS or database. Otherwise, workers might compute the same output since they don’t see the work done by other workers.

See Task.output

program_args()

Override this method to map your task parameters to the program arguments

Returns

list to pass as args to subprocess.Popen

requires()

The Tasks that this Task depends on.

A Task will only run if all of the Tasks that it requires are completed. If your Task does not require any other Tasks, then you don’t need to override this method. Otherwise, a subclass can override this method to return a single Task, a list of Task instances, or a dict whose values are Task instances.

See Task.requires

class wespipeline.processalign.PicardMarkDuplicates(*args, **kwargs)

Bases: luigi.contrib.external_program.ExternalProgramTask

Task used for removing duplicates from the Bam file.

The wespipeline.utils.GlobalParams.exp_name will be used for giving name to the Bam file produced.

Parameters

none

Output:

A luigi.LocalTarget instance for the Bam file without the duplicates.

output()

The output that this Task produces.

The output of the Task determines if the Task needs to be run–the task is considered finished iff the outputs all exist. Subclasses should override this method to return a single Target or a list of Target instances.

Implementation note

If running multiple workers, the output must be a resource that is accessible by all workers, such as a DFS or database. Otherwise, workers might compute the same output since they don’t see the work done by other workers.

See Task.output

program_args()

Override this method to map your task parameters to the program arguments

Returns

list to pass as args to subprocess.Popen

requires()

The Tasks that this Task depends on.

A Task will only run if all of the Tasks that it requires are completed. If your Task does not require any other Tasks, then you don’t need to override this method. Otherwise, a subclass can override this method to return a single Task, a list of Task instances, or a dict whose values are Task instances.

See Task.requires

class wespipeline.processalign.SortSam(*args, **kwargs)

Bases: luigi.contrib.external_program.ExternalProgramTask

Task used for sorting the alignment sam file.

It requires the output of the wespipeline.reference.FastqAlign step.

The wespipeline.utils.GlobalParams.exp_name will be used for giving name to the Bam file produced.

Parameters

none

Output:

A luigi.LocalTarget instance for the sorted Sam Bam file.

output()

The output that this Task produces.

The output of the Task determines if the Task needs to be run–the task is considered finished iff the outputs all exist. Subclasses should override this method to return a single Target or a list of Target instances.

Implementation note

If running multiple workers, the output must be a resource that is accessible by all workers, such as a DFS or database. Otherwise, workers might compute the same output since they don’t see the work done by other workers.

See Task.output

program_args()

Override this method to map your task parameters to the program arguments

Returns

list to pass as args to subprocess.Popen

requires()

The Tasks that this Task depends on.

A Task will only run if all of the Tasks that it requires are completed. If your Task does not require any other Tasks, then you don’t need to override this method. Otherwise, a subclass can override this method to return a single Task, a list of Task instances, or a dict whose values are Task instances.

See Task.requires

wespipeline.reference module

class wespipeline.reference.BwaIndex(*args, **kwargs)

Bases: luigi.contrib.external_program.ExternalProgramTask

Task user for indexing the reference genome .fa file with the bwa index utility.

Aligning the reference genome helps reducing access time drastically.

Parameters

None

Output:

A set of five files are result of indexing the reference genome. The extensions for each of the files are ‘.amb’, ‘.ann’, ‘.bwt’, ‘.pac’, ‘.sa’.

output()

The output that this Task produces.

The output of the Task determines if the Task needs to be run–the task is considered finished iff the outputs all exist. Subclasses should override this method to return a single Target or a list of Target instances.

Implementation note

If running multiple workers, the output must be a resource that is accessible by all workers, such as a DFS or database. Otherwise, workers might compute the same output since they don’t see the work done by other workers.

See Task.output

program_args()

Override this method to map your task parameters to the program arguments

Returns

list to pass as args to subprocess.Popen

requires()

The Tasks that this Task depends on.

A Task will only run if all of the Tasks that it requires are completed. If your Task does not require any other Tasks, then you don’t need to override this method. Otherwise, a subclass can override this method to return a single Task, a list of Task instances, or a dict whose values are Task instances.

See Task.requires

class wespipeline.reference.FaidxIndex(*args, **kwargs)

Bases: luigi.contrib.external_program.ExternalProgramTask

Task user for indexing the reference genome .fa file with the samtools faidx utility.

Aligning the reference genome helps reducing access time drastically.

Parameters

None

Output:

A luigi.LocalTarget for the .fai index file for the reference genome .

output()

The output that this Task produces.

The output of the Task determines if the Task needs to be run–the task is considered finished iff the outputs all exist. Subclasses should override this method to return a single Target or a list of Target instances.

Implementation note

If running multiple workers, the output must be a resource that is accessible by all workers, such as a DFS or database. Otherwise, workers might compute the same output since they don’t see the work done by other workers.

See Task.output

program_args()

Override this method to map your task parameters to the program arguments

Returns

list to pass as args to subprocess.Popen

requires()

The Tasks that this Task depends on.

A Task will only run if all of the Tasks that it requires are completed. If your Task does not require any other Tasks, then you don’t need to override this method. Otherwise, a subclass can override this method to return a single Task, a list of Task instances, or a dict whose values are Task instances.

See Task.requires

class wespipeline.reference.GetProgram(*args, **kwargs)

Bases: luigi.contrib.external_program.ExternalProgramTask

Task user for downloading and giving execution permissions to the 2bit program.

The task gives execute permissions to the conversion utility for 2bit files to be converted to fa files which can then be used for aligning the sequences.

The source for the program is ftp://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/twoBitToFa.

Parameters

none

Output:

A luigi.LocalTarget for the executable.

output()

The output that this Task produces.

The output of the Task determines if the Task needs to be run–the task is considered finished iff the outputs all exist. Subclasses should override this method to return a single Target or a list of Target instances.

Implementation note

If running multiple workers, the output must be a resource that is accessible by all workers, such as a DFS or database. Otherwise, workers might compute the same output since they don’t see the work done by other workers.

See Task.output

program_args()

Override this method to map your task parameters to the program arguments

Returns

list to pass as args to subprocess.Popen

requires()

The Tasks that this Task depends on.

A Task will only run if all of the Tasks that it requires are completed. If your Task does not require any other Tasks, then you don’t need to override this method. Otherwise, a subclass can override this method to return a single Task, a list of Task instances, or a dict whose values are Task instances.

See Task.requires

class wespipeline.reference.GetReferenceFa(*args, **kwargs)

Bases: wespipeline.utils.MetaOutputHandler, luigi.task.WrapperTask

Task user for obtaining the reference genome .fa file.

This task will retrieve an external genome or use a provided local one, and convert it from 2bit format to .fa if neccessary.

Parameters
  • ref_url (str) – Url for the resource with the reference genome.

  • reference_local_file (str) – Path for the reference genome 2bit file. If given the ref_url parameter will be ignored.

  • from2bit (bool) – Non case sensitive boolean indicating wether the reference genome if in 2bit format. Defaults to false.

Output:

A luigi.LocalTarget for the reference genome fa file.

from2bit = <luigi.parameter.BoolParameter object>
ref_url = <luigi.parameter.Parameter object>
reference_local_file = <luigi.parameter.Parameter object>
requires()

The Tasks that this Task depends on.

A Task will only run if all of the Tasks that it requires are completed. If your Task does not require any other Tasks, then you don’t need to override this method. Otherwise, a subclass can override this method to return a single Task, a list of Task instances, or a dict whose values are Task instances.

See Task.requires

class wespipeline.reference.PicardDict(*args, **kwargs)

Bases: luigi.contrib.external_program.ExternalProgramTask

Task user for creating a dict file with the reference genome .fa file with the picard utility.

Parameters

None

Output:

A luigi.LocalTarget for the .fai index file for the reference genome .

output()

The output that this Task produces.

The output of the Task determines if the Task needs to be run–the task is considered finished iff the outputs all exist. Subclasses should override this method to return a single Target or a list of Target instances.

Implementation note

If running multiple workers, the output must be a resource that is accessible by all workers, such as a DFS or database. Otherwise, workers might compute the same output since they don’t see the work done by other workers.

See Task.output

program_args()

Override this method to map your task parameters to the program arguments

Returns

list to pass as args to subprocess.Popen

requires()

The Tasks that this Task depends on.

A Task will only run if all of the Tasks that it requires are completed. If your Task does not require any other Tasks, then you don’t need to override this method. Otherwise, a subclass can override this method to return a single Task, a list of Task instances, or a dict whose values are Task instances.

See Task.requires

class wespipeline.reference.ReferenceGenome(*args, **kwargs)

Bases: wespipeline.utils.MetaOutputHandler, luigi.task.Task

Higher level task for retrieving the reference genome.

It is given preference to local files over downloading the reference. However the indexing of the reference genome is always done using GloablParams.exp_name and GlobalParams.base_dir for determining filenames and location for newer files respectively.

The indexing is done using both Samtools and Bwa toolkits.

Parameters
  • reference_local_file (str) – Optional string indicating the location for the reference genome. If set, it will not be downloaded.

  • ref_url (str) – Url for the download of the reference genome.

  • from2bit (bool) – A boolean [True, False] indicating whether the reference genome must be converted from 2bit.

Output:

A dict mapping keys to luigi.LocalTarget instances for each of the processed files. The following keys are available:

‘faidx’ : Local file with the index, result of indexing with Samtools. ‘bwa’ : Set of five files, result of indexing the reference genome with Bwa. ‘fa’ : Local file with the reference genome.

from2bit = <luigi.parameter.BoolParameter object>
ref_url = <luigi.parameter.Parameter object>
reference_local_file = <luigi.parameter.Parameter object>
requires()

The Tasks that this Task depends on.

A Task will only run if all of the Tasks that it requires are completed. If your Task does not require any other Tasks, then you don’t need to override this method. Otherwise, a subclass can override this method to return a single Task, a list of Task instances, or a dict whose values are Task instances.

See Task.requires

run()

The task run method, to be overridden in a subclass.

See Task.run

class wespipeline.reference.TwoBitToFa(*args, **kwargs)

Bases: luigi.contrib.external_program.ExternalProgramTask

Task user for Converting 2bit files to the fa format.

The task will use a local executable or require the task for obtaining it, and use with the reference genome.

Parameters
  • ref_url (str) – Url for the resource with the reference genome.

  • reference_local_file (str) – Path for the reference genome 2bit file. If given the ref_url parameter will be ignored.

Output:

A luigi.LocalTarget for the reference genome fa file.

output()

The output that this Task produces.

The output of the Task determines if the Task needs to be run–the task is considered finished iff the outputs all exist. Subclasses should override this method to return a single Target or a list of Target instances.

Implementation note

If running multiple workers, the output must be a resource that is accessible by all workers, such as a DFS or database. Otherwise, workers might compute the same output since they don’t see the work done by other workers.

See Task.output

program_args()

Override this method to map your task parameters to the program arguments

Returns

list to pass as args to subprocess.Popen

ref_url = <luigi.parameter.Parameter object>
reference_local_file = <luigi.parameter.Parameter object>
requires()

The Tasks that this Task depends on.

A Task will only run if all of the Tasks that it requires are completed. If your Task does not require any other Tasks, then you don’t need to override this method. Otherwise, a subclass can override this method to return a single Task, a list of Task instances, or a dict whose values are Task instances.

See Task.requires

wespipeline.utils module

class wespipeline.utils.GlobalParams(*args, **kwargs)

Bases: luigi.task.Config

Task used for specifying globally accessible parameters.

Parameters defined in this class are task independent and should mantain low.

Parameters
  • exp_name (str) – Name for the experiment. Useful for defining file names.

  • log_dir (str) – Absolute path for the logs of the application.

  • base_dir (str) – Absolute path to the directory where files are expected to appear if not specifyed differently.

base_dir = <luigi.parameter.Parameter object>
exp_name = <luigi.parameter.Parameter object>
log_dir = <luigi.parameter.Parameter object>
class wespipeline.utils.GunzipFile(*args, **kwargs)

Bases: luigi.contrib.external_program.ExternalProgramTask

Task for unzipping compressed files.

Gunzip will allways do the process inplace, deleting the extension.

Parameters

input_file (str) – Absolute path to the compressed file.

input_file = <luigi.parameter.Parameter object>
output()

The output that this Task produces.

The output of the Task determines if the Task needs to be run–the task is considered finished iff the outputs all exist. Subclasses should override this method to return a single Target or a list of Target instances.

Implementation note

If running multiple workers, the output must be a resource that is accessible by all workers, such as a DFS or database. Otherwise, workers might compute the same output since they don’t see the work done by other workers.

See Task.output

program_args()

Override this method to map your task parameters to the program arguments

Returns

list to pass as args to subprocess.Popen

requires()

The Tasks that this Task depends on.

A Task will only run if all of the Tasks that it requires are completed. If your Task does not require any other Tasks, then you don’t need to override this method. Otherwise, a subclass can override this method to return a single Task, a list of Task instances, or a dict whose values are Task instances.

See Task.requires

class wespipeline.utils.LocalFile(*args, **kwargs)

Bases: luigi.task.Task

Helper task for making.

No extra processing is done in the task. It allows to make tasks dependent on files using the same strategy as with other tasks.

Parameters

file (str) – Absolute path to the file to be tested.

file = <luigi.parameter.Parameter object>
output()

The output that this Task produces.

The output of the Task determines if the Task needs to be run–the task is considered finished iff the outputs all exist. Subclasses should override this method to return a single Target or a list of Target instances.

Implementation note

If running multiple workers, the output must be a resource that is accessible by all workers, such as a DFS or database. Otherwise, workers might compute the same output since they don’t see the work done by other workers.

See Task.output

run()

The task run method, to be overridden in a subclass.

See Task.run

class wespipeline.utils.MetaOutputHandler

Bases: object

Helper class for propagating inputs in WrapperTasks

output()
class wespipeline.utils.UncompressFile(*args, **kwargs)

Bases: luigi.contrib.external_program.ExternalProgramTask

Task for unzipping compressed files to a desired location.

Gunzip will allways do the process inplace, deleting the extension. This task allows to select the destination.

This operation

Parameters
  • input_file (str) – Absolute path to the compressed file.

  • output_file (str) – Absolute path to the desired final location.

  • copy (bool) – Non case sensitive boolean indicating wether to copy or to move the file. Defaults to false.

copy = <luigi.parameter.BoolParameter object>
input_file = <luigi.parameter.Parameter object>
output()

The output that this Task produces.

The output of the Task determines if the Task needs to be run–the task is considered finished iff the outputs all exist. Subclasses should override this method to return a single Target or a list of Target instances.

Implementation note

If running multiple workers, the output must be a resource that is accessible by all workers, such as a DFS or database. Otherwise, workers might compute the same output since they don’t see the work done by other workers.

See Task.output

output_file = <luigi.parameter.Parameter object>
program_args()

Override this method to map your task parameters to the program arguments

Returns

list to pass as args to subprocess.Popen

requires()

The Tasks that this Task depends on.

A Task will only run if all of the Tasks that it requires are completed. If your Task does not require any other Tasks, then you don’t need to override this method. Otherwise, a subclass can override this method to return a single Task, a list of Task instances, or a dict whose values are Task instances.

See Task.requires

class wespipeline.utils.Wget(*args, **kwargs)

Bases: luigi.contrib.external_program.ExternalProgramTask

Task for downloading files using the tool wget.

Parameters
  • url (str) – Url indicating the location of the resource to be retreived.

  • output_file (str) – Absolute path for the destiny location of the retrived resource.

output()

The output that this Task produces.

The output of the Task determines if the Task needs to be run–the task is considered finished iff the outputs all exist. Subclasses should override this method to return a single Target or a list of Target instances.

Implementation note

If running multiple workers, the output must be a resource that is accessible by all workers, such as a DFS or database. Otherwise, workers might compute the same output since they don’t see the work done by other workers.

See Task.output

output_file = <luigi.parameter.Parameter object>
program_args()

Override this method to map your task parameters to the program arguments

Returns

list to pass as args to subprocess.Popen

requires()

The Tasks that this Task depends on.

A Task will only run if all of the Tasks that it requires are completed. If your Task does not require any other Tasks, then you don’t need to override this method. Otherwise, a subclass can override this method to return a single Task, a list of Task instances, or a dict whose values are Task instances.

See Task.requires

url = <luigi.parameter.Parameter object>

wespipeline.vcf module

class wespipeline.vcf.DeepvariantCallVariants(*args, **kwargs)

Bases: luigi.contrib.external_program.ExternalProgramTask

Task used for identifying varinats in the bam file provided using DeepVariant.

Parameters

model_type (str) – A string defining the model to use for the variant calling. Valid options are [WGS,WES,PACBIO].

Dependencies:

ReferenceGenome AlignProcessing

Output:

A luigi.LocalTarget instance for the index vcf file.

output()

The output that this Task produces.

The output of the Task determines if the Task needs to be run–the task is considered finished iff the outputs all exist. Subclasses should override this method to return a single Target or a list of Target instances.

Implementation note

If running multiple workers, the output must be a resource that is accessible by all workers, such as a DFS or database. Otherwise, workers might compute the same output since they don’t see the work done by other workers.

See Task.output

program_args()

Override this method to map your task parameters to the program arguments

Returns

list to pass as args to subprocess.Popen

requires()

The Tasks that this Task depends on.

A Task will only run if all of the Tasks that it requires are completed. If your Task does not require any other Tasks, then you don’t need to override this method. Otherwise, a subclass can override this method to return a single Task, a list of Task instances, or a dict whose values are Task instances.

See Task.requires

class wespipeline.vcf.DeepvariantDockerTask(*args, **kwargs)

Bases: luigi.contrib.docker_runner.DockerTask

Task used for identifying varinats in the bam file provided using DeepVariant.

Parameters

model_type (str) – A string defining the model to use for the variant calling. Valid options are [WGS,WES,PACBIO].

Dependencies:

ReferenceGenome AlignProcessing

Output:

A luigi.LocalTarget instance for the index vcf file.

BIN_VERSION = '0.8.0'
property binds

Override this to mount local volumes, in addition to the /tmp/luigi which gets defined by default. This should return a list of strings. e.g. [‘/hostpath1:/containerpath1’, ‘/hostpath2:/containerpath2’]

property command
create_gvcf = <luigi.parameter.BoolParameter object>
property image
model_type = <luigi.parameter.Parameter object>
property mount_tmp
output()

The output that this Task produces.

The output of the Task determines if the Task needs to be run–the task is considered finished iff the outputs all exist. Subclasses should override this method to return a single Target or a list of Target instances.

Implementation note

If running multiple workers, the output must be a resource that is accessible by all workers, such as a DFS or database. Otherwise, workers might compute the same output since they don’t see the work done by other workers.

See Task.output

requires()

The Tasks that this Task depends on.

A Task will only run if all of the Tasks that it requires are completed. If your Task does not require any other Tasks, then you don’t need to override this method. Otherwise, a subclass can override this method to return a single Task, a list of Task instances, or a dict whose values are Task instances.

See Task.requires

class wespipeline.vcf.DockerGatkCallVariants(*args, **kwargs)

Bases: luigi.contrib.docker_runner.DockerTask

Task used for identifying varinats in the bam file provided using DeepVariant.

Parameters

model_type (str) – A string defining the model to use for the variant calling. Valid options are [WGS,WES,PACBIO].

Dependencies:

ReferenceGenome AlignProcessing

Output:

A luigi.LocalTarget instance for the index vcf file.

BIN_VERSION = '0.8.0'
property binds

Override this to mount local volumes, in addition to the /tmp/luigi which gets defined by default. This should return a list of strings. e.g. [‘/hostpath1:/containerpath1’, ‘/hostpath2:/containerpath2’]

property command
property image
model_type = <luigi.parameter.Parameter object>
property mount_tmp
output()

The output that this Task produces.

The output of the Task determines if the Task needs to be run–the task is considered finished iff the outputs all exist. Subclasses should override this method to return a single Target or a list of Target instances.

Implementation note

If running multiple workers, the output must be a resource that is accessible by all workers, such as a DFS or database. Otherwise, workers might compute the same output since they don’t see the work done by other workers.

See Task.output

requires()

The Tasks that this Task depends on.

A Task will only run if all of the Tasks that it requires are completed. If your Task does not require any other Tasks, then you don’t need to override this method. Otherwise, a subclass can override this method to return a single Task, a list of Task instances, or a dict whose values are Task instances.

See Task.requires

class wespipeline.vcf.FreebayesCallVariants(*args, **kwargs)

Bases: luigi.contrib.external_program.ExternalProgramTask

Task used for identifying varinats in the bam file provided using Freebayes.

The wespipeline.utils.GlobalParams.exp_name will be used for giving name to the vcf produced.

Parameters

none

Dependencies:

ReferenceGenome AlignProcessing

Output:

A luigi.LocalTarget instance for the index vcf file.

output()

The output that this Task produces.

The output of the Task determines if the Task needs to be run–the task is considered finished iff the outputs all exist. Subclasses should override this method to return a single Target or a list of Target instances.

Implementation note

If running multiple workers, the output must be a resource that is accessible by all workers, such as a DFS or database. Otherwise, workers might compute the same output since they don’t see the work done by other workers.

See Task.output

program_args()

Override this method to map your task parameters to the program arguments

Returns

list to pass as args to subprocess.Popen

requires()

The Tasks that this Task depends on.

A Task will only run if all of the Tasks that it requires are completed. If your Task does not require any other Tasks, then you don’t need to override this method. Otherwise, a subclass can override this method to return a single Task, a list of Task instances, or a dict whose values are Task instances.

See Task.requires

class wespipeline.vcf.GatkCallVariants(*args, **kwargs)

Bases: luigi.contrib.external_program.ExternalProgramTask

Task used for identifying varinats in the bam file provided using GatkCallVariants.

The wespipeline.utils.GlobalParams.exp_name will be used for giving name to the vcf produced.

Parameters

none

Dependencies:

ReferenceGenome AlignProcessing

Output:

A luigi.LocalTarget instance for the index vcf file.

output()

The output that this Task produces.

The output of the Task determines if the Task needs to be run–the task is considered finished iff the outputs all exist. Subclasses should override this method to return a single Target or a list of Target instances.

Implementation note

If running multiple workers, the output must be a resource that is accessible by all workers, such as a DFS or database. Otherwise, workers might compute the same output since they don’t see the work done by other workers.

See Task.output

program_args()

Override this method to map your task parameters to the program arguments

Returns

list to pass as args to subprocess.Popen

requires()

The Tasks that this Task depends on.

A Task will only run if all of the Tasks that it requires are completed. If your Task does not require any other Tasks, then you don’t need to override this method. Otherwise, a subclass can override this method to return a single Task, a list of Task instances, or a dict whose values are Task instances.

See Task.requires

class wespipeline.vcf.PlatypusCallVariants(*args, **kwargs)

Bases: luigi.contrib.external_program.ExternalProgramTask

Task used for identifying varinats in the bam file provided using Platypus.

The wespipeline.utils.GlobalParams.exp_name will be used for giving name to the vcf produced.

Parameters

none

Dependencies:

ReferenceGenome AlignProcessing

Output:

A luigi.LocalTarget instance for the index vcf file.

output()

The output that this Task produces.

The output of the Task determines if the Task needs to be run–the task is considered finished iff the outputs all exist. Subclasses should override this method to return a single Target or a list of Target instances.

Implementation note

If running multiple workers, the output must be a resource that is accessible by all workers, such as a DFS or database. Otherwise, workers might compute the same output since they don’t see the work done by other workers.

See Task.output

program_args()

Override this method to map your task parameters to the program arguments

Returns

list to pass as args to subprocess.Popen

requires()

The Tasks that this Task depends on.

A Task will only run if all of the Tasks that it requires are completed. If your Task does not require any other Tasks, then you don’t need to override this method. Otherwise, a subclass can override this method to return a single Task, a list of Task instances, or a dict whose values are Task instances.

See Task.requires

class wespipeline.vcf.VariantCalling(*args, **kwargs)

Bases: wespipeline.utils.MetaOutputHandler, luigi.task.Task

Higher level task for the alignment of fastq files.

It is given preference to local files over processing the alignment in order to reduce computational overhead.

If the bam and bai local files are set, they will be used instead of the

Alignment is done with the Bwa mem utility.

Parameters
  • use_platypus (bool) – A non-case sensitive boolean indicating wether to use Platypus for variant callign.

  • use_freebayes (bool) – A non-case sensitive boolean indicating wether to use Freebayesfor variant callign.

  • use_samtools (bool) – A non-case sensitive boolean indicating wether to use Samtools for variant callign.

  • use_gatk (bool) – A non-case sensitive boolean indicating wether to use Gatk for variant callign.

  • use_deepvariant (bool) – A non-case sensitive boolean indicating wether to use DeepVariant for variant callign.

  • vcf_local_files (string) – A comma delimited list of vfc files to be used instead of using the variant calling tools.

  • cpus (int) – Number of cpus that are available for each of the methods selected.

Output:

A dict mapping keys to luigi.LocalTarget instances for each of the processed files. The following keys are available:

‘platypus’ : Local file with the variant calls obtained using Platypus. ‘freebayes’ : Local file with the variant calls obtained using Freevayes. ‘Varscan’ : Local sorted file with variant calls obtained using Varscan. ‘gatk’ : Local file with the variant calls obtained using GATK. ‘deepvariant’ : Local file with the variant calls obtained using DeepVariant.

cpus = <luigi.parameter.IntParameter object>
requires()

The Tasks that this Task depends on.

A Task will only run if all of the Tasks that it requires are completed. If your Task does not require any other Tasks, then you don’t need to override this method. Otherwise, a subclass can override this method to return a single Task, a list of Task instances, or a dict whose values are Task instances.

See Task.requires

run()

The task run method, to be overridden in a subclass.

See Task.run

use_deepvariant = <luigi.parameter.BoolParameter object>
use_freebayes = <luigi.parameter.BoolParameter object>
use_gatk = <luigi.parameter.BoolParameter object>
use_platypus = <luigi.parameter.BoolParameter object>
use_varscan = <luigi.parameter.BoolParameter object>
vcf_local_files = <luigi.parameter.Parameter object>
class wespipeline.vcf.VarscanCallVariants(*args, **kwargs)

Bases: luigi.contrib.external_program.ExternalProgramTask

Task used for identifying varinats in the bam file provided using Varscan..

The wespipeline.utils.GlobalParams.exp_name will be used for giving name to the vcf produced.

Parameters

none

Dependencies:

ReferenceGenome AlignProcessing

Output:

A luigi.LocalTarget instance for the index vcf file.

output()

The output that this Task produces.

The output of the Task determines if the Task needs to be run–the task is considered finished iff the outputs all exist. Subclasses should override this method to return a single Target or a list of Target instances.

Implementation note

If running multiple workers, the output must be a resource that is accessible by all workers, such as a DFS or database. Otherwise, workers might compute the same output since they don’t see the work done by other workers.

See Task.output

program_args()

Override this method to map your task parameters to the program arguments

Returns

list to pass as args to subprocess.Popen

requires()

The Tasks that this Task depends on.

A Task will only run if all of the Tasks that it requires are completed. If your Task does not require any other Tasks, then you don’t need to override this method. Otherwise, a subclass can override this method to return a single Task, a list of Task instances, or a dict whose values are Task instances.

See Task.requires

wespipeline.vcfanalysis module

class wespipeline.vcfanalysis.DockerVTnormalizeVCF(*args, **kwargs)

Bases: luigi.contrib.docker_runner.DockerTask

VERSION = '0.57721--hdf88d34_2'
biallelic_block_substitutions = <luigi.parameter.BoolParameter object>
biallelic_clumped_variant = <luigi.parameter.BoolParameter object>
property binds

Override this to mount local volumes, in addition to the /tmp/luigi which gets defined by default. This should return a list of strings. e.g. [‘/hostpath1:/containerpath1’, ‘/hostpath2:/containerpath2’]

property command
decomposes_multiallelic_variants = <luigi.parameter.BoolParameter object>
property image
property mount_tmp
output()

The output that this Task produces.

The output of the Task determines if the Task needs to be run–the task is considered finished iff the outputs all exist. Subclasses should override this method to return a single Target or a list of Target instances.

Implementation note

If running multiple workers, the output must be a resource that is accessible by all workers, such as a DFS or database. Otherwise, workers might compute the same output since they don’t see the work done by other workers.

See Task.output

requires()

The Tasks that this Task depends on.

A Task will only run if all of the Tasks that it requires are completed. If your Task does not require any other Tasks, then you don’t need to override this method. Otherwise, a subclass can override this method to return a single Task, a list of Task instances, or a dict whose values are Task instances.

See Task.requires

vcf = <luigi.parameter.Parameter object>
class wespipeline.vcfanalysis.NormalizeVcfFiles(*args, **kwargs)

Bases: wespipeline.utils.MetaOutputHandler, luigi.task.Task

docstring for NormalizeVcfFiles

output()

The output that this Task produces.

The output of the Task determines if the Task needs to be run–the task is considered finished iff the outputs all exist. Subclasses should override this method to return a single Target or a list of Target instances.

Implementation note

If running multiple workers, the output must be a resource that is accessible by all workers, such as a DFS or database. Otherwise, workers might compute the same output since they don’t see the work done by other workers.

See Task.output

requires()

The Tasks that this Task depends on.

A Task will only run if all of the Tasks that it requires are completed. If your Task does not require any other Tasks, then you don’t need to override this method. Otherwise, a subclass can override this method to return a single Task, a list of Task instances, or a dict whose values are Task instances.

See Task.requires

run()

The task run method, to be overridden in a subclass.

See Task.run

class wespipeline.vcfanalysis.VTnormalizeVCF(*args, **kwargs)

Bases: luigi.contrib.external_program.ExternalProgramTask

out = <luigi.parameter.Parameter object>
output()

The output that this Task produces.

The output of the Task determines if the Task needs to be run–the task is considered finished iff the outputs all exist. Subclasses should override this method to return a single Target or a list of Target instances.

Implementation note

If running multiple workers, the output must be a resource that is accessible by all workers, such as a DFS or database. Otherwise, workers might compute the same output since they don’t see the work done by other workers.

See Task.output

program_args()

Override this method to map your task parameters to the program arguments

Returns

list to pass as args to subprocess.Popen

requires()

The Tasks that this Task depends on.

A Task will only run if all of the Tasks that it requires are completed. If your Task does not require any other Tasks, then you don’t need to override this method. Otherwise, a subclass can override this method to return a single Task, a list of Task instances, or a dict whose values are Task instances.

See Task.requires

vcf = <luigi.parameter.Parameter object>
class wespipeline.vcfanalysis.VariantCallingAnalysis(*args, **kwargs)

Bases: luigi.task.Task

Higher level task for comparing variant calls.

Comparing variant calls is a delicate task that increments in complexity when dealing in diploid sequences (such us the human genome), where different variants can appear in the same position in each of the pair chromomes.

The normalization is done with vt, and the comparison with VcfTools

Parameters

None

Output:

None. The resulting files are not provided as task output. Each of the n vcf files is analyzed and comparied by pairs. It is a total of 2n-1 files.

normalize = <luigi.parameter.BoolParameter object>
output()

The output that this Task produces.

The output of the Task determines if the Task needs to be run–the task is considered finished iff the outputs all exist. Subclasses should override this method to return a single Target or a list of Target instances.

Implementation note

If running multiple workers, the output must be a resource that is accessible by all workers, such as a DFS or database. Otherwise, workers might compute the same output since they don’t see the work done by other workers.

See Task.output

requires()

The Tasks that this Task depends on.

A Task will only run if all of the Tasks that it requires are completed. If your Task does not require any other Tasks, then you don’t need to override this method. Otherwise, a subclass can override this method to return a single Task, a list of Task instances, or a dict whose values are Task instances.

See Task.requires

run()

The task run method, to be overridden in a subclass.

See Task.run

class wespipeline.vcfanalysis.VcftoolsCompare(*args, **kwargs)

Bases: luigi.contrib.external_program.ExternalProgramTask

Task used for comparing a pair of vcf files using VcfTools.

Parameters
  • vcf1 (str) – Absolute path to the first file to be compared.

  • vcf2 (str) – Absolute path to the second file to be compared.

Dependencies:

None

Output:

A luigi.LocalTarget instance for the result of comparing the files.

output()

The output that this Task produces.

The output of the Task determines if the Task needs to be run–the task is considered finished iff the outputs all exist. Subclasses should override this method to return a single Target or a list of Target instances.

Implementation note

If running multiple workers, the output must be a resource that is accessible by all workers, such as a DFS or database. Otherwise, workers might compute the same output since they don’t see the work done by other workers.

See Task.output

program_args()

Override this method to map your task parameters to the program arguments

Returns

list to pass as args to subprocess.Popen

requires()

The Tasks that this Task depends on.

A Task will only run if all of the Tasks that it requires are completed. If your Task does not require any other Tasks, then you don’t need to override this method. Otherwise, a subclass can override this method to return a single Task, a list of Task instances, or a dict whose values are Task instances.

See Task.requires

vcf1 = <luigi.parameter.Parameter object>
vcf2 = <luigi.parameter.Parameter object>
class wespipeline.vcfanalysis.VcftoolsDepthAnalysis(*args, **kwargs)

Bases: luigi.contrib.external_program.ExternalProgramTask

Task used for extracting basic statistics for the variant calls using VcfTools.

Parameters

vcf (str) – Absolute path to the file with the variant annotations.

Dependencies:

None

Output:

A luigi.LocalTarget instance for the file with the vcf statistics.

output()

The output that this Task produces.

The output of the Task determines if the Task needs to be run–the task is considered finished iff the outputs all exist. Subclasses should override this method to return a single Target or a list of Target instances.

Implementation note

If running multiple workers, the output must be a resource that is accessible by all workers, such as a DFS or database. Otherwise, workers might compute the same output since they don’t see the work done by other workers.

See Task.output

program_args()

Override this method to map your task parameters to the program arguments

Returns

list to pass as args to subprocess.Popen

requires()

The Tasks that this Task depends on.

A Task will only run if all of the Tasks that it requires are completed. If your Task does not require any other Tasks, then you don’t need to override this method. Otherwise, a subclass can override this method to return a single Task, a list of Task instances, or a dict whose values are Task instances.

See Task.requires

vcf = <luigi.parameter.Parameter object>
class wespipeline.vcfanalysis.VcftoolsFreqAnalysis(*args, **kwargs)

Bases: luigi.contrib.external_program.ExternalProgramTask

Task used for extracting basic statistics for the variant calls using VcfTools.

Parameters

vcf (str) – Absolute path to the file with the variant annotations.

Dependencies:

None

Output:

A luigi.LocalTarget instance for the file with the vcf statistics.

output()

The output that this Task produces.

The output of the Task determines if the Task needs to be run–the task is considered finished iff the outputs all exist. Subclasses should override this method to return a single Target or a list of Target instances.

Implementation note

If running multiple workers, the output must be a resource that is accessible by all workers, such as a DFS or database. Otherwise, workers might compute the same output since they don’t see the work done by other workers.

See Task.output

program_args()

Override this method to map your task parameters to the program arguments

Returns

list to pass as args to subprocess.Popen

requires()

The Tasks that this Task depends on.

A Task will only run if all of the Tasks that it requires are completed. If your Task does not require any other Tasks, then you don’t need to override this method. Otherwise, a subclass can override this method to return a single Task, a list of Task instances, or a dict whose values are Task instances.

See Task.requires

vcf = <luigi.parameter.Parameter object>

Module contents

wespipeline.name = 'wespipeline_pkg'