Skip to main content

Failure handling

Failure handling controls how workflows respond when jobs or steps fail. This includes whether to continue execution, cancel remaining work, or mark steps as non-critical.

Overview

By default:

  • Jobs: If a job fails, dependent jobs don't run
  • Steps: If a step fails, the job fails and subsequent steps don't run
  • Workflow: If any job fails, other running jobs are canceled (cancel-on-failure behavior)

You can customize this behavior at workflow and step levels.

Workflow-level: cancel_on_failure

Controls whether to cancel all running jobs when any job fails.

Field: cancel_on_failure (optional boolean)

Default: true (cancel all jobs when any job fails)

ValueBehavior
trueCancel all running jobs when any job fails, regardless of dependencies (default)
falseContinue running all jobs even if some fail

Examples

Default behavior (cancel-on-failure enabled)

jobs:
validate:
steps:
- name: Check data quality
run: python validate.py

train:
steps:
- name: Train model
run: python train.py

If validate fails, train is canceled.

Disable cancel-on-failure

cancel_on_failure: false

jobs:
train-xgboost:
steps:
- name: Train model
run: python train_xgboost.py

train-random-forest:
steps:
- name: Train model
run: python train_rf.py

If train-xgboost fails, train-random-forest continues running.

Step-level: continue_on_error

Controls whether a job continues executing when a step fails.

Field: continue_on_error (optional boolean)

Default: false (job fails when step fails)

ValueBehavior
trueContinue job execution even if this step fails
falseFail the job if this step fails (default)

Examples

Optional model validation

jobs:
train:
name: Train and Validate
steps:
- name: Train model
run: python train.py

- name: Run optional validation
continue_on_error: true # Report metrics but don't block
run: python optional_validation.py

- name: Save model
run: python save_model.py

Optional validation reports metrics but doesn't block model saving.

Interaction between cancel_on_failure and continue_on_error

Steps with continue_on_error: true do not cause job failure. If all steps complete (even with some failures), the job is considered successful and does not trigger cancel_on_failure.

Scenariocancel_on_failurecontinue_on_errorBehavior
Critical step failstruefalseJob fails, other jobs canceled
Critical step failsfalsefalseJob fails, other jobs continue
Optional step failstruetrueJob succeeds, other jobs run
Optional step failsfalsetrueJob succeeds, other jobs run

Concurrency cancellations vs failures

Workflows cancelled due to concurrency control are different from failed workflows:

AspectConcurrency cancellationFailure
Statecancelledfailed
Triggers cancel_on_failureNoYes
Use caseResource managementError handling

For details on managing concurrent workflow executions, see Concurrency.

Complete example

id: ml-pipeline
name: ML Training Pipeline

cancel_on_failure: false # Allow independent jobs to continue

jobs:
preprocess:
steps:
- name: Validate input data
run: python validate.py

- name: Preprocess data
run: python preprocess.py

train-model-a:
depends_on: [preprocess]
steps:
- name: Train model A
run: python train_a.py

- name: Run optional metrics
continue_on_error: true
run: python optional_metrics.py

- name: Save model
run: python save.py

train-model-b:
depends_on: [preprocess]
steps:
- name: Train model B
run: python train_b.py

evaluate:
depends_on: [train-model-a, train-model-b]
steps:
- name: Compare models
run: python compare.py

In this example:

  • If train-model-a fails, train-model-b continues running (because cancel_on_failure: false)
  • If the optional metrics step fails, the job continues to save the model
  • evaluate only runs if both training jobs succeed

Feedback