Skip to main content

Failure handling

Failure handling controls how workflows respond when jobs or steps fail. This includes whether to continue execution, cancel remaining work, or mark steps as non-critical.

Overview

By default:

  • Jobs: If a job fails, dependent jobs don't run.
  • Steps: If a step fails, the job fails and subsequent steps don't run.
  • Workflow: If any job fails, other running jobs are canceled (cancel-on-failure behavior).

You can customize this behavior at workflow and step levels.

Schema

See Schema Reference for cancel_on_failure (workflow-level) and continue_on_error (step-level) definitions.

Workflow-Level: cancel_on_failure

Controls whether to cancel all running jobs when any job fails.

Field: cancel_on_failure (optional boolean)

Default: true (cancel all jobs when any job fails)

Values:

  • true: Cancel all running jobs when any job fails, regardless of dependencies (default).
  • false: Continue running all jobs even if some fail.

Examples

Default Behavior (cancel-on-failure enabled)

jobs:
validate:
steps:
- name: Check data quality
train:
steps:
- name: Train model

If validate fails, train is canceled.

Disable Cancel-on-Failure

cancel_on_failure: false

jobs:
train-xgboost:
steps:
- name: Train model
train-random-forest:
steps:
- name: Train model

If train-xgboost fails, train-random-forest continues running.

Step-Level: continue_on_error

Controls whether a job continues executing when a step fails.

Field: continue_on_error (optional boolean)

Default: false (job fails when step fails)

Values:

  • true: Continue job execution even if this step fails.
  • false: Fail the job if this step fails (default).

Examples

Optional Model Validation

jobs:
train:
name: Train and Validate
steps:
- name: Train model
- name: Run optional validation
continue_on_error: true # Report metrics but don't block
- name: Save model

Optional validation reports metrics but doesn't block model saving.

Interaction Between cancel_on_failure and continue_on_error

Steps with continue_on_error: true do not cause job failure. If all steps complete (even with some failures), the job is considered successful and does not trigger cancel_on_failure.

Scenariocancel_on_failurecontinue_on_errorBehavior
Critical step failstruefalseJob fails, other jobs canceled
Critical step failsfalsefalseJob fails, other jobs continue
Optional step failstruetrueJob succeeds, other jobs run
Optional step failsfalsetrueJob succeeds, other jobs run

Concurrency Cancellations vs Failures

Workflows cancelled due to concurrency control are different from failed workflows:

  • State: Cancelled workflows have state cancelled (not failed).
  • cancel_on_failure: Concurrency cancellations do NOT trigger cancel_on_failure in other workflows.
  • Use case: Concurrency control is for resource management, not error handling.

See Concurrency Control for details on managing concurrent workflow executions.


Feedback