Matrix jobs

Matrix jobs enable parallel execution of a job with different parameter combinations. Define variables with multiple values, and the system creates a separate job instance for each combination using the Cartesian product.

Defining matrices

Basic matrix

A matrix with a single variable creates multiple parallel job instances:

jobs:
  train-model:
    name: Train with Different Algorithms
    matrix:
      algorithm: [xgboost, lightgbm, random_forest, neural_net]
    runner: gpu-large
    timeout: "2h"
    steps:
      - name: Train model
        env:
          ALGORITHM: ${{ .matrix.algorithm }}
        run: |
          python train.py --algorithm $ALGORITHM --data ./data/train.csv

This creates 4 job instances, one for each algorithm.

Multi-variable matrix

Multiple variables create a Cartesian product of all combinations:

jobs:
  hyperparameter-search:
    name: Hyperparameter Tuning
    matrix:
      learning_rate: ["0.001", "0.01", "0.1"]
      batch_size: ["32", "64", "128"]
    runner: gpu-large
    timeout: "3h"
    steps:
      - name: Train with hyperparameters
        env:
          LR: ${{ .matrix.learning_rate }}
          BATCH: ${{ .matrix.batch_size }}
        run: |
          echo "Training with learning_rate=$LR and batch_size=$BATCH"
          python train.py --lr $LR --batch-size $BATCH --epochs 100

      - name: Upload model
        upload:
          path: models/model.pkl
          destination: drive://models/lr-${{ .matrix.learning_rate }}-batch-${{ .matrix.batch_size }}/

This creates 9 job instances (3 learning rates × 3 batch sizes):

hyperparameter-search[batch_size:32,learning_rate:0.001]
hyperparameter-search[batch_size:32,learning_rate:0.01]
hyperparameter-search[batch_size:32,learning_rate:0.1]
hyperparameter-search[batch_size:64,learning_rate:0.001]
And so on...

Matrix with workflow calls

Matrices work with reusable workflows, passing matrix variables as inputs:

jobs:
  evaluate-models:
    name: Evaluate on Multiple Datasets
    matrix:
      model: [xgboost, lightgbm, random_forest]
      dataset: [train, validation, test]
    workflow:
      name: model-evaluation
      inputs:
        model_type: ${{ .matrix.model }}
        dataset_name: ${{ .matrix.dataset }}
        metrics: "accuracy,f1,auc"

This creates 9 job instances (3 models × 3 datasets), each calling the model-evaluation workflow with different parameters.

Matrix expressions

Accessing matrix variables

Access matrix variables using the ${{ .matrix.variable_name }} expression syntax, consistent with other workflow expressions.

Format: ${{ .matrix.<variable_name> }}

Example:

jobs:
  process:
    matrix:
      region: [us-east, eu-west, ap-south]
      data_type: [transactions, events]
    steps:
      - name: Download data
        download:
          source: drive://data/${{ .matrix.region }}/${{ .matrix.data_type }}/
          path: ./data/

      - name: Process data
        env:
          REGION: ${{ .matrix.region }}
          TYPE: ${{ .matrix.data_type }}
        run: python process.py --region $REGION --type $TYPE

Job expansion and execution

Parallel execution

All matrix job instances run in parallel by default. There is no automatic sequencing or max-parallel limiter. All combinations execute concurrently as soon as the job's dependencies are satisfied.

Dependency handling

When a job depends on a matrix job, it waits for all matrix instances to complete successfully.

Example:

jobs:
  train:
    matrix:
      model: [xgboost, lightgbm, random_forest]
    steps:
      - run: python train.py --model ${{ .matrix.model }}

  evaluate:
    depends_on: [train]  # Waits for all 3 training instances
    steps:
      - run: python evaluate_all.py --models-dir ./models/

Complete example

id: hyperparameter-optimization
name: Train Models with Different Hyperparameters

inputs:
  dataset_bucket:
    type: string
    required: true
    description: Training dataset bucket UUID

env:
  SCRIPTS_REPO: "https://github.com/h2oai/ml-training"

jobs:
  train-models:
    name: Train with Hyperparameters
    matrix:
      algorithm: [xgboost, lightgbm, random_forest]
      max_depth: ["5", "10", "15"]
    runner: gpu-large
    timeout: "3h"

    steps:
      - name: Download training data
        download:
          source: drive://${{ .inputs.dataset_bucket }}/train.csv
          path: ./data/train.csv

      - name: Clone training scripts
        run: git clone --depth 1 $SCRIPTS_REPO scripts

      - name: Train model
        env:
          ALGORITHM: ${{ .matrix.algorithm }}
          MAX_DEPTH: ${{ .matrix.max_depth }}
        run: |
          echo "Training $ALGORITHM with max_depth=$MAX_DEPTH"
          python scripts/train.py \
            --algorithm $ALGORITHM \
            --max-depth $MAX_DEPTH \
            --data ./data/train.csv \
            --output ./models/

      - name: Upload trained model
        upload:
          path: ./models/
          destination: drive://models/${{ .matrix.algorithm }}-depth${{ .matrix.max_depth }}/

      - name: Upload metrics
        upload:
          path: metrics.json
          destination: drive://metrics/${{ .matrix.algorithm }}-depth${{ .matrix.max_depth }}.json

This creates 9 parallel training runs (3 algorithms × 3 max_depth values).

Feedback

Submit and view feedback for this page
Send feedback about H2O Orchestrator | Docs to cloud-feedback@h2o.ai

Defining matrices​

Basic matrix​

Multi-variable matrix​

Matrix with workflow calls​

Matrix expressions​

Accessing matrix variables​

Job expansion and execution​

Parallel execution​

Dependency handling​

Complete example​