Failure handling
Failure handling controls how workflows respond when jobs or steps fail. This includes whether to continue execution, cancel remaining work, or mark steps as non-critical.
Overview
By default:
- Jobs: If a job fails, dependent jobs don't run.
- Steps: If a step fails, the job fails and subsequent steps don't run.
- Workflow: If any job fails, other running jobs are canceled (cancel-on-failure behavior).
You can customize this behavior at workflow and step levels.
Schema
See Schema Reference for cancel_on_failure (workflow-level) and continue_on_error (step-level) definitions.
Workflow-Level: cancel_on_failure
Controls whether to cancel all running jobs when any job fails.
Field: cancel_on_failure (optional boolean)
Default: true (cancel all jobs when any job fails)
Values:
true: Cancel all running jobs when any job fails, regardless of dependencies (default).false: Continue running all jobs even if some fail.
Examples
Default Behavior (cancel-on-failure enabled)
jobs:
validate:
steps:
- name: Check data quality
train:
steps:
- name: Train model
If validate fails, train is canceled.
Disable Cancel-on-Failure
cancel_on_failure: false
jobs:
train-xgboost:
steps:
- name: Train model
train-random-forest:
steps:
- name: Train model
If train-xgboost fails, train-random-forest continues running.
Step-Level: continue_on_error
Controls whether a job continues executing when a step fails.
Field: continue_on_error (optional boolean)
Default: false (job fails when step fails)
Values:
true: Continue job execution even if this step fails.false: Fail the job if this step fails (default).
Examples
Optional Model Validation
jobs:
train:
name: Train and Validate
steps:
- name: Train model
- name: Run optional validation
continue_on_error: true # Report metrics but don't block
- name: Save model
Optional validation reports metrics but doesn't block model saving.
Interaction Between cancel_on_failure and continue_on_error
Steps with continue_on_error: true do not cause job failure. If all steps complete (even with some failures), the job is considered successful and does not trigger cancel_on_failure.
| Scenario | cancel_on_failure | continue_on_error | Behavior |
|---|---|---|---|
| Critical step fails | true | false | Job fails, other jobs canceled |
| Critical step fails | false | false | Job fails, other jobs continue |
| Optional step fails | true | true | Job succeeds, other jobs run |
| Optional step fails | false | true | Job succeeds, other jobs run |
Concurrency Cancellations vs Failures
Workflows cancelled due to concurrency control are different from failed workflows:
- State: Cancelled workflows have state
cancelled(notfailed). - cancel_on_failure: Concurrency cancellations do NOT trigger
cancel_on_failurein other workflows. - Use case: Concurrency control is for resource management, not error handling.
See Concurrency Control for details on managing concurrent workflow executions.
- Submit and view feedback for this page
- Send feedback about H2O Workflows to cloud-feedback@h2o.ai