Skip to main content
Version: Next

vLLM

This page describes how to deploy a Hugging Face Large Language Model (LLM) on H2O MLOps.

note

This page shows how to use a Model Hub repository ID to create and deploy an experiment in H2O MLOps. If you already have a vLLM Config artifact, you can follow the standard process outlined in our Quickstart guide.

  1. Connect to H2O MLOps.

    The first step is connecting to H2O MLOps. To connect to H2O MLOps from outside the H2O Cloud, run the following code:

    import time
    import h2o_mlops
    import h2o_mlops.options

    mlops = h2o_mlops.Client(
    h2o_cloud_url="<h2o_cloud_url>",
    refresh_token="<refresh_token>",
    )
    • h2o_cloud_url: This is the same URL used to access the H2O Cloud homepage.
    • refresh_token: For information on how to retrieve your refresh token (also known as a platform token), see API authentication.
  2. Specify the Hugging Face model.

    The second step is to define the Hugging Face model along with other experiment parameters.

    vllm_name       = "my-vllm"
    non_vllm_name = "my-non-vllm"
    hub_model_id = "TheBloke/Llama-2-7B-Chat-AWQ" # Hugging Face model repo id
    passphrase = "passphrase"
    • vllm_name and non_vllm_name: Defines the name for the vLLM (Virtual Large Language Model) experiment, and the non-vLLM experiment.
    • hub_model_id: Specifies the Hugging Face repo ID of a model to be deployed.
    • passphrase: Defines the security passphrase used for authentication the deployment.
  3. Create a project.

    Create an H2O MLOps project named LLMOps Demo.

    project = mlops.projects.create(name="LLMOps Demo")
  4. Create experiments.

    Create two experiments to compare vLLM and non-vLLM types.

    vllm_experiment = project.experiments.create_vllm(
    hub_model_id=hub_model_id,
    name=vllm_name,
    )
    non_vllm_experiment = project.experiments.create(
    data=f"../tests/e2e_bdd/features/data/mojo.zip",
    name=non_vllm_name,
    )
  5. List created experiments.

    To list all experiments in the project, run the following code:

    project.experiments.list()

    Output:

     | name        | uid                                  | tags
    --+-------------+--------------------------------------+--------
    0 | my-non-vllm | 5c247e7f-b594-4d78-969e-1fc2ddf68838 |
    1 | my-vllm | 70e34587-e73f-4db4-b1bd-f6062a537ceb |

    To list only vLLM experiments, run:

    project.experiments.list(filter_vllm=True)

    Output:

     | name    | uid                                  | tags
    --+---------+--------------------------------------+--------
    0 | my-vllm | 70e34587-e73f-4db4-b1bd-f6062a537ceb |
  6. Check the vLLM configuration for experiments.

    Inspect the configuration for vllm_experiment.

    print(vllm_experiment.vllm_config)

    Output:

    {'model': 'TheBloke/Llama-2-7B-Chat-AWQ', 'name': 'my-vllm'}

    Inspect the configuration for non_vllm_experiment.

    print(non_vllm_experiment.vllm_config)

    Output:

    None

    vllm_config for non_vllm_experiment is None because this is not a vLLM-based experiment.

  7. Create and register models.

    Create two models and registers them with the experiments.

    vllm_model      = project.models.create(name=vllm_name)
    non_vllm_model = project.models.create(name=non_vllm_name)

    vllm_model.register(vllm_experiment)
    non_vllm_model.register(non_vllm_experiment)
  8. Retrieve the list of tolerations.

    The following command retrieves the list of tolerations allowed in the H2O MLOps environment:

    mlops.allowed_tolerations

    Output:

    ['gpu', 'spot-cpu']
  9. Deploy created models.

    This step deploys both the vLLM model and the non-vLLM model to H2O MLOps. During the deployment of the vLLM model, ensure that one of the GPU-based, MLOps allowed tolerations is specified.

    vllm_deployment = project.deployments.deploy_vllm(
    name=vllm_name,
    model=vllm_model,
    security_options=h2o_mlops.options.SecurityOptions(
    passphrase=passphrase,
    ),
    kubernetes_options=h2o_mlops.options.KubernetesOptions(
    limits={"nvidia.com/gpu": "1"},
    toleration="gpu", #specify one of the GPU-based, MLOps allowed tolerations
    ),
    )

    non_vllm_deployment = project.deployments.create_single(
    name=non_vllm_name,
    model=non_vllm_model,
    scoring_runtime=mlops.runtimes.scoring.list(
    artifact_type="dai_mojo_pipeline",
    uid="dai_mojo_runtime",
    )[0],
    security_options=h2o_mlops.options.SecurityOptions(
    passphrase=passphrase,
    ),
    )
    note

    Attempting to use deploy_vllm with non-vLLM models will raise a ValueError. Ensure that only LLM-based models are deployed using deploy_vllm.

  10. List deployments.

    To list all deployed models, run the following code:

    project.deployments.list()

    Output:

       | name        | mode         | uid
    ----+-------------+--------------+--------------------------------------
    0 | my-vllm | Single Model | 6ea17ba2-ae79-4852-a295-3aea553c1495
    1 | my-non-vllm | Single Model | b80c3041-19f0-464e-9705-ce4a81f7090c

    To list only vLLM deployments, run:

    project.deployments.list(filter_vllm=True)

    Output:

       | name    | mode         | uid
    ----+---------+--------------+--------------------------------------
    0 | my-vllm | Single Model | 6ea17ba2-ae79-4852-a295-3aea553c1495
  11. Print deployment information.

    Let's compare the important deployment details, such as the following, between both VLLM and non-VLLM deployments.

    • Deployment details

    • Scorer details

    • Resource allocation details

    new_line_char = "\n"
    def print_deployment_info(deployment, deployment_type):
    while not deployment.is_healthy():
    deployment.raise_for_failure()
    time.sleep(5)
    print(f"{'=' * 30}{deployment_type} Deployment{'=' * 30}")
    print(f"Deployment Name: {deployment.name}")
    print(f"Deployment UID: {deployment.uid}")
    print(f"Deployment Status: {deployment.status()}")
    print(f"Scorer API Key: {deployment.security_options.passphrase}")
    print(f"Scorer API Base URL: {deployment.scorer_api_base_url}")
    print(f"OpenAI Base URL: {deployment.openai_base_url}")
    print(f"Configuration: {deployment.experiments[0].vllm_config}")
    print(f"Resources: [{str(deployment.kubernetes_options).replace(new_line_char, ', ')}]")
    print()

    To print the deployment information for the vLLM deployment, run:

    print_deployment_info(vllm_deployment, "vLLM")

    Output:

    ==============================vLLM Deployment==============================
    Deployment Name: my-vllm
    Deployment UID: 6ea17ba2-ae79-4852-a295-3aea553c1495
    Deployment Status: HEALTHY
    Scorer API Key: passphrase
    Scorer API Base URL: https://model.cloud-dev.h2o.dev/6ea17ba2-ae79-4852-a295-3aea553c1495
    OpenAI Base URL: https://model.cloud-dev.h2o.dev/6ea17ba2-ae79-4852-a295-3aea553c1495/v1
    Configuration: {'model': 'TheBloke/Llama-2-7B-Chat-AWQ', 'name': 'my-vllm'}
    Resources: [replicas: 1, requests: {}, limits: {'nvidia.com/gpu': '1'}, affinity: , toleration: gpu]

    To print the deployment information for the non-vLLM deployment, run:

    print_deployment_info(non_vllm_deployment, "Non-vLLM")

    Output:

    ==============================Non-vLLM Deployment==============================
    Deployment Name: my-non-vllm
    Deployment UID: b80c3041-19f0-464e-9705-ce4a81f7090c
    Deployment Status: HEALTHY
    Scorer API Key: passphrase
    Scorer API Base URL: https://model.cloud-dev.h2o.dev/6ea17ba2-ae79-4852-a295-3aea553c1495
    OpenAI Base URL: None
    Configuration: None
    Resources: [replicas: 1, requests: {}, limits: {}, affinity: , toleration: ]

Feedback