Failure handling
Failure handling controls how workflows respond when jobs or steps fail. This includes whether to continue execution, cancel remaining work, or mark steps as non-critical.
Overview
By default:
- Jobs: If a job fails, dependent jobs don't run
- Steps: If a step fails, the job fails and subsequent steps don't run
- Workflow: If any job fails, other running jobs are canceled (cancel-on-failure behavior)
You can customize this behavior at workflow and step levels.
Workflow-level: cancel_on_failure
Controls whether to cancel all running jobs when any job fails.
Field: cancel_on_failure (optional boolean)
Default: true (cancel all jobs when any job fails)
| Value | Behavior |
|---|---|
true | Cancel all running jobs when any job fails, regardless of dependencies (default) |
false | Continue running all jobs even if some fail |
Examples
Default behavior (cancel-on-failure enabled)
jobs:
validate:
steps:
- name: Check data quality
run: python validate.py
train:
steps:
- name: Train model
run: python train.py
If validate fails, train is canceled.
Disable cancel-on-failure
cancel_on_failure: false
jobs:
train-xgboost:
steps:
- name: Train model
run: python train_xgboost.py
train-random-forest:
steps:
- name: Train model
run: python train_rf.py
If train-xgboost fails, train-random-forest continues running.
Step-level: continue_on_error
Controls whether a job continues executing when a step fails.
Field: continue_on_error (optional boolean)
Default: false (job fails when step fails)
| Value | Behavior |
|---|---|
true | Continue job execution even if this step fails |
false | Fail the job if this step fails (default) |
Examples
Optional model validation
jobs:
train:
name: Train and Validate
steps:
- name: Train model
run: python train.py
- name: Run optional validation
continue_on_error: true # Report metrics but don't block
run: python optional_validation.py
- name: Save model
run: python save_model.py
Optional validation reports metrics but doesn't block model saving.
Interaction between cancel_on_failure and continue_on_error
Steps with continue_on_error: true do not cause job failure. If all steps complete (even with some failures), the job is considered successful and does not trigger cancel_on_failure.
| Scenario | cancel_on_failure | continue_on_error | Behavior |
|---|---|---|---|
| Critical step fails | true | false | Job fails, other jobs canceled |
| Critical step fails | false | false | Job fails, other jobs continue |
| Optional step fails | true | true | Job succeeds, other jobs run |
| Optional step fails | false | true | Job succeeds, other jobs run |
Concurrency cancellations vs failures
Workflows cancelled due to concurrency control are different from failed workflows:
| Aspect | Concurrency cancellation | Failure |
|---|---|---|
| State | cancelled | failed |
Triggers cancel_on_failure | No | Yes |
| Use case | Resource management | Error handling |
For details on managing concurrent workflow executions, see Concurrency.
Complete example
id: ml-pipeline
name: ML Training Pipeline
cancel_on_failure: false # Allow independent jobs to continue
jobs:
preprocess:
steps:
- name: Validate input data
run: python validate.py
- name: Preprocess data
run: python preprocess.py
train-model-a:
depends_on: [preprocess]
steps:
- name: Train model A
run: python train_a.py
- name: Run optional metrics
continue_on_error: true
run: python optional_metrics.py
- name: Save model
run: python save.py
train-model-b:
depends_on: [preprocess]
steps:
- name: Train model B
run: python train_b.py
evaluate:
depends_on: [train-model-a, train-model-b]
steps:
- name: Compare models
run: python compare.py
In this example:
- If
train-model-afails,train-model-bcontinues running (becausecancel_on_failure: false) - If the optional metrics step fails, the job continues to save the model
evaluateonly runs if both training jobs succeed
- Submit and view feedback for this page
- Send feedback about H2O Orchestrator | Docs to cloud-feedback@h2o.ai