vLLM
This page describes how to deploy a Hugging Face Large Language Model (LLM) on H2O MLOps.
This page shows how to use a Model Hub repository ID to create and deploy an experiment in H2O MLOps. If you already have a vLLM Config artifact, you can follow the standard process outlined in our Quickstart guide.
-
Connect to H2O MLOps.
The first step is connecting to H2O MLOps. To connect to H2O MLOps from outside the H2O Cloud, run the following code:
import time
import h2o_mlops
import h2o_mlops.options
mlops = h2o_mlops.Client(
h2o_cloud_url="<h2o_cloud_url>",
refresh_token="<refresh_token>",
)h2o_cloud_url
: This is the same URL used to access the H2O Cloud homepage.refresh_token
: For information on how to retrieve your refresh token (also known as a platform token), see API authentication.
-
Specify the Hugging Face model.
The second step is to define the Hugging Face model along with other experiment parameters.
vllm_name = "my-vllm"
non_vllm_name = "my-non-vllm"
hub_model_id = "TheBloke/Llama-2-7B-Chat-AWQ" # Hugging Face model repo id
passphrase = "passphrase"vllm_name
andnon_vllm_name
: Defines the name for the vLLM (Virtual Large Language Model) experiment, and the non-vLLM experiment.hub_model_id
: Specifies the Hugging Face repo ID of a model to be deployed.passphrase
: Defines the security passphrase used for authentication the deployment.
-
Create a project.
Create an H2O MLOps project named
LLMOps Demo
.project = mlops.projects.create(name="LLMOps Demo")
-
Create experiments.
Create two experiments to compare vLLM and non-vLLM types.
vllm_experiment = project.experiments.create_vllm(
hub_model_id=hub_model_id,
name=vllm_name,
)
non_vllm_experiment = project.experiments.create(
data=f"../tests/e2e_bdd/features/data/mojo.zip",
name=non_vllm_name,
) -
List created experiments.
To list all experiments in the project, run the following code:
project.experiments.list()
Output:
| name | uid | tags
--+-------------+--------------------------------------+--------
0 | my-non-vllm | 5c247e7f-b594-4d78-969e-1fc2ddf68838 |
1 | my-vllm | 70e34587-e73f-4db4-b1bd-f6062a537ceb |To list only vLLM experiments, run:
project.experiments.list(filter_vllm=True)
Output:
| name | uid | tags
--+---------+--------------------------------------+--------
0 | my-vllm | 70e34587-e73f-4db4-b1bd-f6062a537ceb | -
Check the vLLM configuration for experiments.
Inspect the configuration for
vllm_experiment
.print(vllm_experiment.vllm_config)
Output:
{'model': 'TheBloke/Llama-2-7B-Chat-AWQ', 'name': 'my-vllm'}
Inspect the configuration for
non_vllm_experiment
.print(non_vllm_experiment.vllm_config)
Output:
None
vllm_config
fornon_vllm_experiment
isNone
because this is not a vLLM-based experiment. -
Create and register models.
Create two models and registers them with the experiments.
vllm_model = project.models.create(name=vllm_name)
non_vllm_model = project.models.create(name=non_vllm_name)
vllm_model.register(vllm_experiment)
non_vllm_model.register(non_vllm_experiment) -
Retrieve the list of tolerations.
The following command retrieves the list of tolerations allowed in the H2O MLOps environment:
mlops.allowed_tolerations
Output:
['gpu', 'spot-cpu']
-
Deploy created models.
This step deploys both the vLLM model and the non-vLLM model to H2O MLOps. During the deployment of the vLLM model, ensure that one of the GPU-based, MLOps allowed tolerations is specified.
vllm_deployment = project.deployments.deploy_vllm(
name=vllm_name,
model=vllm_model,
security_options=h2o_mlops.options.SecurityOptions(
passphrase=passphrase,
),
kubernetes_options=h2o_mlops.options.KubernetesOptions(
limits={"nvidia.com/gpu": "1"},
toleration="gpu", #specify one of the GPU-based, MLOps allowed tolerations
),
)
non_vllm_deployment = project.deployments.create_single(
name=non_vllm_name,
model=non_vllm_model,
scoring_runtime=mlops.runtimes.scoring.list(
artifact_type="dai_mojo_pipeline",
uid="dai_mojo_runtime",
)[0],
security_options=h2o_mlops.options.SecurityOptions(
passphrase=passphrase,
),
)noteAttempting to use
deploy_vllm
with non-vLLM models will raise aValueError
. Ensure that only LLM-based models are deployed usingdeploy_vllm
. -
List deployments.
To list all deployed models, run the following code:
project.deployments.list()
Output:
| name | mode | uid
----+-------------+--------------+--------------------------------------
0 | my-vllm | Single Model | 6ea17ba2-ae79-4852-a295-3aea553c1495
1 | my-non-vllm | Single Model | b80c3041-19f0-464e-9705-ce4a81f7090cTo list only vLLM deployments, run:
project.deployments.list(filter_vllm=True)
Output:
| name | mode | uid
----+---------+--------------+--------------------------------------
0 | my-vllm | Single Model | 6ea17ba2-ae79-4852-a295-3aea553c1495 -
Print deployment information.
Let's compare the important deployment details, such as the following, between both VLLM and non-VLLM deployments.
-
Deployment details
-
Scorer details
-
Resource allocation details
new_line_char = "\n"
def print_deployment_info(deployment, deployment_type):
while not deployment.is_healthy():
deployment.raise_for_failure()
time.sleep(5)
print(f"{'=' * 30}{deployment_type} Deployment{'=' * 30}")
print(f"Deployment Name: {deployment.name}")
print(f"Deployment UID: {deployment.uid}")
print(f"Deployment Status: {deployment.status()}")
print(f"Scorer API Key: {deployment.security_options.passphrase}")
print(f"Scorer API Base URL: {deployment.scorer_api_base_url}")
print(f"OpenAI Base URL: {deployment.openai_base_url}")
print(f"Configuration: {deployment.experiments[0].vllm_config}")
print(f"Resources: [{str(deployment.kubernetes_options).replace(new_line_char, ', ')}]")
print()To print the deployment information for the vLLM deployment, run:
print_deployment_info(vllm_deployment, "vLLM")
Output:
==============================vLLM Deployment==============================
Deployment Name: my-vllm
Deployment UID: 6ea17ba2-ae79-4852-a295-3aea553c1495
Deployment Status: HEALTHY
Scorer API Key: passphrase
Scorer API Base URL: https://model.cloud-dev.h2o.dev/6ea17ba2-ae79-4852-a295-3aea553c1495
OpenAI Base URL: https://model.cloud-dev.h2o.dev/6ea17ba2-ae79-4852-a295-3aea553c1495/v1
Configuration: {'model': 'TheBloke/Llama-2-7B-Chat-AWQ', 'name': 'my-vllm'}
Resources: [replicas: 1, requests: {}, limits: {'nvidia.com/gpu': '1'}, affinity: , toleration: gpu]To print the deployment information for the non-vLLM deployment, run:
print_deployment_info(non_vllm_deployment, "Non-vLLM")
Output:
==============================Non-vLLM Deployment==============================
Deployment Name: my-non-vllm
Deployment UID: b80c3041-19f0-464e-9705-ce4a81f7090c
Deployment Status: HEALTHY
Scorer API Key: passphrase
Scorer API Base URL: https://model.cloud-dev.h2o.dev/6ea17ba2-ae79-4852-a295-3aea553c1495
OpenAI Base URL: None
Configuration: None
Resources: [replicas: 1, requests: {}, limits: {}, affinity: , toleration: ] -
- Submit and view feedback for this page
- Send feedback about H2O MLOps to cloud-feedback@h2o.ai