step_pipeline package

Submodules

step_pipeline.batch module

This module contains Hail Batch-specific extensions of the Pipeline and Step classes

class step_pipeline.batch.BatchStepType(value)

Bases: enum.Enum

Constants that represent different Batch Step types.

PYTHON = 'python'
BASH = 'bash'
class step_pipeline.batch.BatchPipeline(name=None, config_arg_parser=None, backend=Backend.HAIL_BATCH_SERVICE)

Bases: step_pipeline.pipeline.Pipeline

This class contains Hail Batch-specific extensions of the Pipeline class

property backend

Returns either Backend.HAIL_BATCH_SERVICE or Backend.HAIL_BATCH_LOCAL

new_step(name=None, step_number=None, arg_suffix=None, depends_on=None, image=None, cpu=None, memory=None, storage=None, always_run=False, timeout=None, output_dir=None, reuse_job_from_previous_step=None, localize_by=Localize.COPY, delocalize_by=Delocalize.COPY)

Creates a new pipeline Step.

Parameters
  • name (str) – A short name for this Step.

  • step_number (int) – Optional Step number which serves as another alias for this step in addition to name.

  • arg_suffix (str) – Optional suffix for the command-line args that will be created for forcing or skipping execution of this Step.

  • depends_on (Step) – Optional upstream Step that this Step depends on.

  • image (str) – Docker image to use for this Step.

  • cpu (str, float, int) – CPU requirements. Units are in cpu if cores is numeric.

  • memory (str, float int) – Memory requirements. The memory expression must be of the form {number}{suffix} where valid optional suffixes are K, Ki, M, Mi, G, Gi, T, Ti, P, and Pi. Omitting a suffix means the value is in bytes. For the ServiceBackend, the values ‘lowmem’, ‘standard’, and ‘highmem’ are also valid arguments. ‘lowmem’ corresponds to approximately 1 Gi/core, ‘standard’ corresponds to approximately 4 Gi/core, and ‘highmem’ corresponds to approximately 7 Gi/core. The default value is ‘standard’.

  • storage (str, int) – Disk size. The storage expression must be of the form {number}{suffix} where valid optional suffixes are K, Ki, M, Mi, G, Gi, T, Ti, P, and Pi. Omitting a suffix means the value is in bytes. For the ServiceBackend, jobs requesting one or more cores receive 5 GiB of storage for the root file system /. Jobs requesting a fraction of a core receive the same fraction of 5 GiB of storage. If you need additional storage, you can explicitly request more storage using this method and the extra storage space will be mounted at /io. Batch automatically writes all ResourceFile to /io. The default storage size is 0 Gi. The minimum storage size is 0 Gi and the maximum storage size is 64 Ti. If storage is set to a value between 0 Gi and 10 Gi, the storage request is rounded up to 10 Gi. All values are rounded up to the nearest Gi.

  • always_run (bool) – Set the Step to always run, even if dependencies fail.

  • timeout (float, int) – Set the maximum amount of time this job can run for before being killed.

  • output_dir (str) – Optional default output directory for Step outputs.

  • reuse_job_from_previous_step (Step) – Optionally, reuse the batch.Job object from this other upstream Step.

  • localize_by (Localize) – If specified, this will be the default Localize approach used by Step inputs.

  • delocalize_by (Delocalize) – If specified, this will be the default Delocalize approach used by Step outputs.

Returns

The new BatchStep object.

Return type

BatchStep

gcloud_project(gcloud_project)

Set the requester-pays project.

Parameters

gcloud_project (str) – The name of the Google Cloud project to be billed when accessing requester-pays buckets.

cancel_after_n_failures(cancel_after_n_failures)

Set the cancel_after_n_failures value.

Parameters

cancel_after_n_failures – (int): Automatically cancel the batch after N failures have occurred.

default_image(default_image)

Set the default Docker image to use for Steps in this pipeline.

Parameters
  • default_image (str) – Default docker image to use for Bash jobs. This must be the full name

  • desired (of the image including any repository prefix and tags if) –

default_python_image(default_python_image)

Set the default image for Python Jobs.

Parameters

default_python_image (str) – The Docker image to use for Python jobs. The image specified must have the dill package installed. If default_python_image is not specified, then a Docker image will automatically be created for you with the base image hailgenetics/python-dill:[major_version].[minor_version]-slim and the Python packages specified by python_requirements will be installed. The default name of the image is batch-python with a random string for the tag unless python_build_image_name is specified. If the ServiceBackend is the backend, the locally built image will be pushed to the repository specified by image_repository.

default_memory(default_memory)

Set the default memory usage.

Parameters

default_memory (int, str) – Memory setting to use by default if not specified by a Step. Only applicable if a docker image is specified for the LocalBackend or the ServiceBackend. See Job.memory().

default_cpu(default_cpu)

Set the default cpu requirement.

Parameters

default_cpu (float, int, str) – CPU setting to use by default if not specified by a job. Only applicable if a docker image is specified for the LocalBackend or the ServiceBackend. See Job.cpu().

default_storage(default_storage)

Set the default storage disk size.

Parameters

default_storage (str, int) – Storage setting to use by default if not specified by a job. Only applicable for the ServiceBackend. See Job.storage().

default_timeout(default_timeout)

Set the default job timeout duration.

Parameters

default_timeout – Maximum time in seconds for a job to run before being killed. Only applicable for the ServiceBackend. If None, there is no timeout.

run()

Batch-specific code for submitting the pipeline to the Hail Batch backend

class step_pipeline.batch.BatchStep(pipeline, name=None, step_number=None, arg_suffix=None, image=None, cpu=None, memory=None, storage=None, always_run=False, timeout=None, output_dir=None, reuse_job_from_previous_step=None, localize_by=Localize.COPY, delocalize_by=Delocalize.COPY)

Bases: step_pipeline.pipeline.Step

This class contains Hail Batch-specific extensions of the Step class

cpu(cpu)

Set the CPU requirement for this Step.

Parameters

cpu (str, float, int) – CPU requirements. Units are in cpu if cores is numeric.

memory(memory)

Set the memory requirement for this Step.

Parameters

memory (str, float int) – Memory requirements. The memory expression must be of the form {number}{suffix} where valid optional suffixes are K, Ki, M, Mi, G, Gi, T, Ti, P, and Pi. Omitting a suffix means the value is in bytes. For the ServiceBackend, the values ‘lowmem’, ‘standard’, and ‘highmem’ are also valid arguments. ‘lowmem’ corresponds to approximately 1 Gi/core, ‘standard’ corresponds to approximately 4 Gi/core, and ‘highmem’ corresponds to approximately 7 Gi/core. The default value is ‘standard’.

storage(storage)

Set the disk size for this Step.

Parameters

storage (str, int) – Disk size. The storage expression must be of the form {number}{suffix} where valid optional suffixes are K, Ki, M, Mi, G, Gi, T, Ti, P, and Pi. Omitting a suffix means the value is in bytes. For the ServiceBackend, jobs requesting one or more cores receive 5 GiB of storage for the root file system /. Jobs requesting a fraction of a core receive the same fraction of 5 GiB of storage. If you need additional storage, you can explicitly request more storage using this method and the extra storage space will be mounted at /io. Batch automatically writes all ResourceFile to /io. The default storage size is 0 Gi. The minimum storage size is 0 Gi and the maximum storage size is 64 Ti. If storage is set to a value between 0 Gi and 10 Gi, the storage request is rounded up to 10 Gi. All values are rounded up to the nearest Gi.

always_run(always_run)

Set the always_run parameter for this Step.

Parameters

always_run (bool) – Set the Step to always run, even if dependencies fail.

timeout(timeout)

Set the timeout for this Step.

Parameters

timeout (float, int) – Set the maximum amount of time this job can run for before being killed.

step_pipeline.constants module

class step_pipeline.constants.Backend(value)

Bases: enum.Enum

Constants that represent possible pipeline execution backends

HAIL_BATCH_LOCAL = 'hbl'
HAIL_BATCH_SERVICE = 'hbs'
TERRA = 'terra'
CROMWELL = 'cromwell'

step_pipeline.io module

This module contains classes and methods related to data input & output.

class step_pipeline.io.Localize(value)

Bases: enum.Enum

Constants that represent different options for how to localize files into the running container. Each 2-tuple consists of a name for the localization approach, and a subdirectory where to put files.

COPY = ('copy', 'local_copy')

COPY uses the execution backend’s default approach to localizing files

GSUTIL_COPY = ('gsutil_copy', 'local_copy')

GSUTIL_COPY runs ‘gsutil cp’ to localize file(s) from a google bucket path. This requires gsutil to be available inside the execution container.

HAIL_HADOOP_COPY = ('hail_hadoop_copy', 'local_copy')

HAIL_HADOOP_COPY uses the Hail hadoop API to copy file(s) from a google bucket path. This requires python3 and Hail to be installed inside the execution container.

HAIL_BATCH_GCSFUSE = ('hail_batch_gcsfuse', 'gcsfuse')

HAIL_BATCH_GCSFUSE use the Hail Batch gcsfuse function to mount a google bucket into the execution container as a network drive, without copying the files. This Hail Batch service account must have read access to the bucket.

HAIL_BATCH_GCSFUSE_VIA_TEMP_BUCKET = ('hail_batch_gcsfuse_via_temp_bucket', 'gcsfuse')

HAIL_BATCH_GCSFUSE_VIA_TEMP_BUCKET is useful for situations where you’d like to use gcsfuse to localize files and your personal gcloud account has read access to the source bucket, but the Hail Batch service account cannot be granted read access to that bucket. Since it’s possible to run ‘gsutil cp’ under your personal credentials within the execution container, but Hail Batch gcsfuse always runs under the Hail Batch service account credentials, this workaround 1) runs ‘gsutil cp’ under your personal credentials to copy the source files to a temporary bucket that you control, and where you have granted read access to the Hail Batch service account 2) uses gcsfuse to mount the temporary bucket 3) performs computational steps on the mounted data 4) deletes the source files from the temporary bucket when the Batch job completes.

This localization approach may be useful for situations where you need a large number of jobs and each job processes a very small piece of a large data file (eg. a few loci in a cram file).

Copying the large file(s) from the source bucket to a temp bucket in the same region is fast and inexpensive, and only needs to happen once before the jobs run. Each job can then avoid allocating a large disk, and waiting for the large file to be copied into the container. This approach requires gsutil to be available inside the execution container.

get_subdir_name()

Returns the subdirectory name passed to the constructor

class step_pipeline.io.Delocalize(value)

Bases: enum.Enum

Constants that represent different options for how to delocalize file(s) from a running container.

COPY = 'copy'

COPY uses the execution backend’s default approach to delocalizing files

GSUTIL_COPY = 'gsutil_copy'

GSUTIL_COPY runs ‘gsutil cp’ to copy the path to a google bucket destination. This requires gsutil to be available inside the execution container.

HAIL_HADOOP_COPY = 'hail_hadoop_copy'

HAIL_HADOOP_COPY uses the hail hadoop API to copy file(s) to a google bucket path. This requires python3 and hail to be installed inside the execution container.

class step_pipeline.io.InputType(value)

Bases: enum.Enum

Constants that represent the type of a step.input_value(..) arg.

STRING = 'string'
FLOAT = 'float'
INT = 'int'
BOOL = 'boolean'
class step_pipeline.io.InputSpecBase(name=None)

Bases: abc.ABC

This is the InputSpec parent class, with subclasses implementing specific types of input specs which contain metadata about inputs to a Pipeline Step.

property name
property uuid
class step_pipeline.io.InputValueSpec(value=None, name=None, input_type=InputType.STRING)

Bases: step_pipeline.io.InputSpecBase

An InputValueSpec stores metadata about an input that’s not a file path

property value
property input_type
class step_pipeline.io.InputSpec(source_path=None, name=None, localize_by=None, localization_root_dir=None)

Bases: step_pipeline.io.InputSpecBase

An InputSpec stores metadata about an input file or directory

property source_path
property source_bucket
property source_path_without_protocol
property source_dir
property filename
property local_path
property local_dir
property localize_by
class step_pipeline.io.OutputSpec(local_path=None, output_dir=None, output_path=None, name=None, delocalize_by=None)

Bases: object

An OutputSpec stores metadata about an output file or directory from a Step

property output_path
property output_dir
property filename
property name
property local_path
property local_dir
property delocalize_by

step_pipeline.main module

This module contains the pipeline(..) function which is the main gateway for users to access the functionality in the step_pipeline library

step_pipeline.main.pipeline(name=None, backend=Backend.HAIL_BATCH_SERVICE, config_file_path='~/.step_pipeline')

Creates a pipeline object.

Usage:

with step_pipeline("my pipeline") as sp:
    s = sp.new_step(..)
    ... step definitions ...

# or alternatively:

sp = step_pipeline("my pipeline")
s = sp.new_step(..)
... step definitions ...
sp.run()
Parameters
  • name (str) – Pipeline name.

  • backend (Backend) – The backend to use for executing the pipeline.

  • config_file_path (str) – path of a configargparse config file.

Returns

An object that you can use to create Steps by calling .new_step(..) and then execute the pipeline by

calling .run()

Return type

Pipeline

step_pipeline.pipeline module

class step_pipeline.pipeline.Pipeline(name=None, config_arg_parser=None)

Bases: abc.ABC

Pipeline represents the execution pipeline. This base class contains only generalized code that is not specific to any particular execution backend. It has public methods for creating Steps, as well as some private methods that implement the general aspects of traversing the execution graph (DAG) and transferring all steps to a specific execution backend.

get_config_arg_parser()

Returns the configargparse.ArgumentParser object used by the Pipeline to define command-line args. This is a drop-in replacement for argparse.ArgumentParser with some extra features such as support for config files and environment variables. See https://github.com/bw2/ConfigArgParse for more details. You can use this to add and parse your own command-line arguments the same way you would using argparse. For example:

p = pipeline.get_config_arg_parser() p.add_argument(”–my-arg”) args = pipeline.parse_args()

parse_args()

Parse command line args defined up to this point. This method can be called more than once.

Returns

argparse args object.

abstract new_step(name, step_number=None)

Creates a new pipeline Step. Subclasses must implement this method.

Parameters
  • name (str) – A short name for the step.

  • step_number (int) – Optional step number.

gcloud_project(gcloud_project)
cancel_after_n_failures(cancel_after_n_failures)
default_image(default_image)
default_python_image(default_python_image)
default_memory(default_memory)
default_cpu(default_cpu)
default_storage(default_storage)
default_timeout(default_timeout)
default_output_dir(default_output_dir)

Set the default output_dir for pipeline Steps.

Parameters

default_output_dir (str) – Output directory

abstract run()

Submits a pipeline to an execution engine such as Hail Batch. Subclasses must implement this method. They should use this method to perform initialization of the specific execution backend and then call self._transfer_all_steps(..).

check_input_glob(glob_path)

This method is useful for checking the existence of multiple input files and caching the results. Input file(s) to this Step using glob syntax (ie. using wildcards as in gs://bucket/**/sample*.cram)

Parameters

path (str) – local file path or gs:// Google Storage path. The path can contain wildcards (*).

Returns

List of metadata dicts like:

[
    {
        'path': 'gs://bucket/dir/file.bam.bai',
        'size_bytes': 2784,
        'modification_time': 'Wed May 20 12:52:01 EDT 2020',
    },
]

Return type

list

export_pipeline_graph(output_svg_path=None)

Renders the pipeline execution graph diagram based on the Steps defined so far.

Parameters

output_svg_path (str) – Path where to write the SVG image with the execution graph diagram. If not specified, it will be based on the pipeline name.

class step_pipeline.pipeline.Step(pipeline, name, step_number=None, arg_suffix=None, output_dir=None, localize_by=None, delocalize_by=None, add_force_command_line_args=True, add_skip_command_line_args=True)

Bases: abc.ABC

Represents a set of commands or sub-steps which together produce some output file(s), and which can be skipped if the output files already exist (and are newer than any input files unless a –force arg is used). A Step’s input and output files must be stored in some persistant location, like a local disk or GCS.

Using Hail Batch as an example, a Step typically corresponds to a single Hail Batch Job. Sometimes a Job can be reused to run multiple steps (for example, where step 1 creates a VCF and step 2 tabixes it).

name(name)

Set the short name for this Step.

Parameters

name (str) – Name

command(command)

Add a shell command to this Step.

Parameters

command (str) – A shell command to execute as part of this Step

input_glob(glob_path, name=None, localize_by=None)

Specify input file(s) to this Step using glob syntax (ie. using wildcards as in gs://bucket/**/sample*.cram)

Parameters
  • glob_path (str) – The path of the input file(s) or directory to localize, optionally including wildcards.

  • name (str) – Optional name for this input.

  • localize_by (Localize) – How this path should be localized.

Returns

An object that describes the specified input file or directory.

Return type

InputSpec

input_value(value=None, name=None, input_type=None)

Specify a Step input that is something other than a file path.

Parameters
  • value – The input’s value.

  • name (str) – Optional name for this input.

  • input_type (InputType) – The value’s type.

Returns

An object that contains the input value, name, and type.

Return type

InputValueSpec

input(source_path=None, name=None, localize_by=None)

Specifies an input file or directory.

Parameters
  • source_path (str) – Path of input file or directory to localize.

  • name (str) – Optional name for this input.

  • localize_by (Localize) – How this path should be localized.

Returns

An object that describes the specified input file or directory.

Return type

InputSpec

inputs(source_path, *source_paths, name=None, localize_by=None)

Specifies one or more input file or directory paths.

Parameters
  • source_path (str) – Path of input file or directory to localize.

  • name (str) – Optional name to apply to all these inputs.

  • localize_by (Localize) – How these paths should be localized.

Returns

A list of InputSpec objects that describe these input files or directories. The list will contain

one entry for each passed-in source path.

Return type

list

use_the_same_inputs_as(other_step, localize_by=None)

Copy the inputs of another step, while optionally changing the localize_by approach. This is a utility method to make it easier to specify inputs for a Step that is very similar to a previously-defined step.

Parameters
  • other_step (Step) – The Step object to copy inputs from.

  • localize_by (Localize) – Optionally specify how these inputs should be localized. If not specified, the value from other_step will be reused.

Returns

A list of new InputSpec objects that describe the inputs copied from other_step. The returned list

will contain one entry for each input of other_step.

Return type

list

use_previous_step_outputs_as_inputs(previous_step, localize_by=None)

Define Step inputs to be the output paths of an upstream Step and explicitly mark this Step as downstream of previous_step by calling self.depends_on(previous_step).

Parameters
  • previous_step (Step) – A Step that’s upstream of this Step in the pipeline.

  • localize_by (Localize) – Specify how these inputs should be localized. If not specified, the default localize_by value for the pipeline will be used.

Returns

A list of new InputSpec objects that describe the inputs defined based on the outputs of previous_step. The returned list will contain one entry for each output of previous_step.

Return type

list

output_dir(path)

If an output path is specified as a relative path, it will be relative to this dir.

Parameters

path (str) – Directory path.

output(local_path, output_path=None, output_dir=None, name=None, delocalize_by=None)

Specify a Step output file or directory.

Parameters
  • local_path (str) – The file or directory path within the execution container’s file system.

  • output_path (str) – Optional destination path to which the local_path should be delocalized.

  • output_dir (str) – Optional destination directory to which the local_path should be delocalized. It is expected that either output_path will be specified, or an output_dir value will be provided as an argument to this method or previously (such as by calling the step.output_dir(..) setter method). If both output_path and output_dir are specified and output_path is a relative path, it is interpretted as being relative to output_dir.

  • name (str) – Optional name for this output.

  • delocalize_by (Delocalize) – How this path should be delocalized.

Returns

An object describing this output.

Return type

OutputSpec

outputs(local_path, *local_paths, output_dir=None, name=None, delocalize_by=None)

Define one or more outputs.

Parameters
  • local_path (str) – The file or directory path within the execution container’s file system.

  • output_dir (str) – Optional destination directory to which the given local_path(s) should be delocalized.

  • name (str) – Optional name for the output(s).

  • delocalize_by (Delocalize) – How the path(s) should be delocalized.

Returns

A list of OutputSpec objects that describe these outputs. The list will contain one entry for each passed-in path.

Return type

list

depends_on(upstream_step)

Marks this Step as being downstream of another Step in the pipeline, meaning that this Step can only run after the upstream_step has completed successfully.

Parameters

upstream_step (Step) – The upstream Step this Step depends on.

has_upstream_steps()

Returns True if this Step has upstream Steps that must run before it runs (ie. that it depends on)

post_to_slack(message, channel=None, slack_token=None)

Posts the given message to slack. Requires python3 and pip to be installed in the execution environment.

Parameters
  • message (str) – The message to post.

  • channel (str) – The Slack channel to post to.

  • slack_token (str) – Slack auth token.

switch_gcloud_auth_to_user_account(gcloud_credentials_path=None, gcloud_user_account=None, gcloud_project=None, debug=False)

This method adds commands to this Step to switch gcloud auth from the Batch-provided service account to the user’s personal account.

This is useful if subsequent commands need to access google buckets that to which the user’s personal account has access but to which the Batch service account cannot be granted access for whatever reason.

For this to work, you must first:

  1. create a google bucket that only you have access to - for example: gs://weisburd-gcloud-secrets/

  2. on your local machine, make sure you’re logged in to gcloud by running:

    gcloud auth login

  3. copy your local ~/.config directory (which caches your gcloud auth credentials) to the secrets bucket from step 1:

    gsutil -m cp -r ~/.config/ gs://weisburd-gcloud-secrets/

  4. grant your default Batch service-account read access to your secrets bucket so it can download these credentials into each docker container.

  5. make sure gcloud & gsutil are installed inside the docker images you use for your Batch jobs

  6. call this method at the beginning of your batch job:

Example

step.switch_gcloud_auth_to_user_account(

“gs://weisburd-gcloud-secrets”, “weisburd@broadinstitute.org”, “seqr-project”)

Parameters
  • gcloud_credentials_path (str) – Google bucket path that contains your gcloud auth .config folder.

  • gcloud_user_account (str) – The user account to activate (ie. “weisburd@broadinstitute.org”).

  • gcloud_project (str) – This will be set as the default gcloud project within the container.

  • debug (bool) – Whether to add extra “gcloud auth list” commands that are helpful for troubleshooting issues with the auth steps.

record_memory_cpu_and_disk_usage(output_dir, time_interval=5, export_json=True, export_graphs=False, install_glances=True)

Add commands that run the ‘glances’ python tool to record memory, cpu, disk usage and other profiling stats in the background at regular intervals.

Parameters
  • output_dir (str) – Profiling data will be written to this directory.

  • time_interval (int) – How frequently to update the profiling data files.

  • export_json (bool) – Whether to export a glances.json file to output_dir.

  • export_graphs (bool) – Whether to export .svg graphs.

  • install_glances (bool) – If True, a command will be added to first install the ‘glances’ python library inside the execution container.

step_pipeline.utils module

This module contains misc. utility functions used by other modules.

step_pipeline.utils.are_any_inputs_missing(step, verbose=False)

Returns True if any of the Step’s inputs don’t exist

step_pipeline.utils.are_outputs_up_to_date(step, verbose=False)

Returns True if all of the Step’s outputs already exist and are newer than all inputs

exception step_pipeline.utils.GoogleStorageException

Bases: Exception

step_pipeline.utils.check_gcloud_storage_region(gs_path, expected_regions=('US', 'US-CENTRAL1'), gcloud_project=None, ignore_access_denied_exception=True, verbose=True)

Checks whether the given Google Storage path is located in one of the expected_regions. This is set to “US-CENTRAL1” by default since that’s the region where the hail Batch cluster is located. Localizing data from other regions will be slower and result in egress charges.

Parameters
  • gs_path (str) – The google storage gs:// path to check. Only the bucket portion of the path matters, so other parts of the path can contain wildcards (*), etc.

  • expected_regions (tuple) – a set of acceptable storage regions. If gs_path is not in one of these regions, this method will raise a StorageRegionException.

  • gcloud_project (str) – (optional) if specified, it will be added to the gsutil command with the -u arg.

  • ignore_access_denied_exception (bool) – if True, this method return silently if it encounters an AccessDenied error.

  • verbose (bool) – print more detailed log output

Raises

StorageRegionException – If the given gs_path is not stored in one the expected_regions.

step_pipeline.wdl module

This module contains Cromwell/Terra-specific extensions of the Pipeline and Step classes

class step_pipeline.wdl.WdlPipeline(name=None, config_arg_parser=None, backend=Backend.TERRA)

Bases: step_pipeline.pipeline.Pipeline

This class extends the Pipeline class to add support for generating a WDL and will later add support for running it using Cromwell or Terra.

property backend

Returns either Backend.CROMWELL or Backend.TERRA

new_step(name=None, step_number=None, depends_on=None, image=None, cpu=None, memory=None, storage=None, localize_by=Localize.COPY, delocalize_by=Delocalize.COPY, **kwargs)

Creates a new pipeline Step.

Parameters
  • name (str) – A short name for this Step.

  • step_number (int) – Optional Step number which serves as another alias for this step in addition to name.

  • depends_on (Step) – Optional upstream Step that this Step depends on.

  • image (str) – Docker image to use for this Step.

  • cpu (str, float, int) – CPU requirements. Units are in cpu if cores is numeric.

  • memory (str, float int) – Memory requirements. The memory expression must be of the form {number}{suffix} where valid optional suffixes are K, Ki, M, Mi, G, Gi, T, Ti, P, and Pi. Omitting a suffix means the value is in bytes. For the ServiceBackend, the values ‘lowmem’, ‘standard’, and ‘highmem’ are also valid arguments. ‘lowmem’ corresponds to approximately 1 Gi/core, ‘standard’ corresponds to approximately 4 Gi/core, and ‘highmem’ corresponds to approximately 7 Gi/core. The default value is ‘standard’.

  • storage (str, int) – Disk size. The storage expression must be of the form {number}{suffix} where valid optional suffixes are K, Ki, M, Mi, G, Gi, T, Ti, P, and Pi. Omitting a suffix means the value is in bytes.

  • localize_by (Localize) – If specified, this will be the default Localize approach used by Step inputs.

  • delocalize_by (Delocalize) – If specified, this will be the default Delocalize approach used by Step outputs.

  • **kwargs – other keyword args can be provided, but are ignored.

Returns

The new WdlStep object.

Return type

WdlStep

run_for_each_row(table)

Run the pipeline in parallel for each row of the given table

run()

Generate WDL

class step_pipeline.wdl.WdlStep(pipeline, name=None, step_number=None, image=None, cpu=None, memory=None, storage=None, output_dir=None, localize_by=Localize.COPY, delocalize_by=Delocalize.COPY)

Bases: step_pipeline.pipeline.Step

This class contains Hail Batch-specific extensions of the Step class

cpu(cpu)

Set the CPU requirement for this Step.

Parameters

cpu (str, float, int) – CPU requirements. Units are in cpu if cores is numeric.

memory(memory)

Set the memory requirement for this Step.

Parameters

memory (str, float int) – Memory requirements. The memory expression must be of the form {number}{suffix} where valid optional suffixes are K, Ki, M, Mi, G, Gi, T, Ti, P, and Pi. Omitting a suffix means the value is in bytes. For the ServiceBackend, the values ‘lowmem’, ‘standard’, and ‘highmem’ are also valid arguments. ‘lowmem’ corresponds to approximately 1 Gi/core, ‘standard’ corresponds to approximately 4 Gi/core, and ‘highmem’ corresponds to approximately 7 Gi/core. The default value is ‘standard’.

storage(storage)

Set the disk size for this Step.

Parameters

storage (str, int) – Disk size. The storage expression must be of the form {number}{suffix} where valid optional suffixes are K, Ki, M, Mi, G, Gi, T, Ti, P, and Pi.

Module contents