Version: v0.70.7

vLLM

This page describes how to deploy a Hugging Face Large Language Model (LLM) on H2O MLOps.

note

This page shows how to use a Model Hub repository ID to create and deploy an experiment in H2O MLOps. If you already have a vLLM Config artifact, you can follow the standard process outlined in our Quickstart guide.

Connect to H2O MLOps.

The first step is connecting to H2O MLOps. To connect to H2O MLOps from outside the H2O Cloud, run the following code:
```
import time
import h2o_mlops
import h2o_mlops.options

mlops = h2o_mlops.Client(
   h2o_cloud_url="<h2o_cloud_url>",
   refresh_token="<refresh_token>",
)
```
- h2o_cloud_url: This is the same URL used to access the H2O Cloud homepage.
- refresh_token: For information on how to retrieve your refresh token (also known as a platform token), see API authentication.
Specify the Hugging Face model.

The second step is to define the Hugging Face model along with other experiment parameters.
```
vllm_name       = "my-vllm"
non_vllm_name   = "my-non-vllm"
hub_model_id    = "TheBloke/Llama-2-7B-Chat-AWQ"  # Hugging Face model repo id
passphrase      = "passphrase"
```
- vllm_name and non_vllm_name: Defines the name for the vLLM (Virtual Large Language Model) experiment, and the non-vLLM experiment.
- hub_model_id: Specifies the Hugging Face repo ID of a model to be deployed.
- passphrase: Defines the security passphrase used for authentication the deployment.
Create a project.

Create an H2O MLOps project named LLMOps Demo.
```
project = mlops.projects.create(name="LLMOps Demo")
```

Create experiments.

Create two experiments to compare vLLM and non-vLLM types.

vllm_experiment = project.experiments.create_vllm(
   hub_model_id=hub_model_id,
   name=vllm_name,
)
non_vllm_experiment = project.experiments.create(
   data=f"../tests/e2e_bdd/features/data/mojo.zip",
   name=non_vllm_name,
)

List created experiments.

To list all experiments in the project, run the following code:

project.experiments.list()

Output:

 | name        | uid                                  | tags
--+-------------+--------------------------------------+--------
0 | my-non-vllm | 5c247e7f-b594-4d78-969e-1fc2ddf68838 |
1 | my-vllm     | 70e34587-e73f-4db4-b1bd-f6062a537ceb |

To list only vLLM experiments, run:

project.experiments.list(filter_vllm=True)

Output:

 | name    | uid                                  | tags
--+---------+--------------------------------------+--------
0 | my-vllm | 70e34587-e73f-4db4-b1bd-f6062a537ceb |

Check the vLLM configuration for experiments.

Inspect the configuration for vllm_experiment.
```
print(vllm_experiment.vllm_config)
```
Output:
```
{'model': 'TheBloke/Llama-2-7B-Chat-AWQ', 'name': 'my-vllm'}
```
Inspect the configuration for non_vllm_experiment.
```
print(non_vllm_experiment.vllm_config)
```
Output:
```
None
```
vllm_config for non_vllm_experiment is None because this is not a vLLM-based experiment.

Create and register models.

Create two models and registers them with the experiments.

vllm_model      = project.models.create(name=vllm_name)
non_vllm_model  = project.models.create(name=non_vllm_name)

vllm_model.register(vllm_experiment)
non_vllm_model.register(non_vllm_experiment)

Retrieve the list of tolerations.

The following command retrieves the list of tolerations allowed in the H2O MLOps environment:
```
mlops.allowed_tolerations
```
Output:
```
['gpu', 'spot-cpu']
```

Deploy created models.

This step deploys both the vLLM model and the non-vLLM model to H2O MLOps. During the deployment of the vLLM model, ensure that one of the GPU-based, MLOps allowed tolerations is specified.

vllm_deployment = project.deployments.deploy_vllm(
   name=vllm_name,
   model=vllm_model,
   security_options=h2o_mlops.options.SecurityOptions(
       passphrase=passphrase,
   ),
   kubernetes_options=h2o_mlops.options.KubernetesOptions(
       limits={"nvidia.com/gpu": "1"}, 
       toleration="gpu", #specify one of the GPU-based, MLOps allowed tolerations
   ),
)

non_vllm_deployment = project.deployments.create_single(
   name=non_vllm_name,
   model=non_vllm_model,
   scoring_runtime=mlops.runtimes.scoring.list(
       artifact_type="dai_mojo_pipeline",
       uid="dai_mojo_runtime",
   )[0],
   security_options=h2o_mlops.options.SecurityOptions(
       passphrase=passphrase,
   ),
)

note

Attempting to use deploy_vllm with non-vLLM models will raise a ValueError. Ensure that only LLM-based models are deployed using deploy_vllm.

List deployments.

To list all deployed models, run the following code:

project.deployments.list()

Output:

   | name        | mode         | uid
----+-------------+--------------+--------------------------------------
 0 | my-vllm     | Single Model | 6ea17ba2-ae79-4852-a295-3aea553c1495
 1 | my-non-vllm | Single Model | b80c3041-19f0-464e-9705-ce4a81f7090c

To list only vLLM deployments, run:

project.deployments.list(filter_vllm=True)

Output:

   | name    | mode         | uid
----+---------+--------------+--------------------------------------
 0 | my-vllm | Single Model | 6ea17ba2-ae79-4852-a295-3aea553c1495

Print deployment information.

Let's compare the important deployment details, such as the following, between both VLLM and non-VLLM deployments.

Deployment details
Scorer details
Resource allocation details

new_line_char = "\n"
def print_deployment_info(deployment, deployment_type):
   while not deployment.is_healthy():
       deployment.raise_for_failure()
       time.sleep(5)
   print(f"{'=' * 30}{deployment_type} Deployment{'=' * 30}")
   print(f"Deployment Name: {deployment.name}")
   print(f"Deployment UID: {deployment.uid}")
   print(f"Deployment Status: {deployment.status()}")
   print(f"Scorer API Key: {deployment.security_options.passphrase}")
   print(f"Scorer API Base URL: {deployment.scorer_api_base_url}")
   print(f"OpenAI Base URL: {deployment.openai_base_url}")
   print(f"Configuration: {deployment.experiments[0].vllm_config}")
   print(f"Resources: [{str(deployment.kubernetes_options).replace(new_line_char, ', ')}]")
   print()

To print the deployment information for the vLLM deployment, run:

print_deployment_info(vllm_deployment, "vLLM")

Output:

==============================vLLM Deployment==============================
Deployment Name: my-vllm
Deployment UID: 6ea17ba2-ae79-4852-a295-3aea553c1495
Deployment Status: HEALTHY
Scorer API Key: passphrase
Scorer API Base URL: https://model.cloud-dev.h2o.dev/6ea17ba2-ae79-4852-a295-3aea553c1495
OpenAI Base URL: https://model.cloud-dev.h2o.dev/6ea17ba2-ae79-4852-a295-3aea553c1495/v1
Configuration: {'model': 'TheBloke/Llama-2-7B-Chat-AWQ', 'name': 'my-vllm'}
Resources: [replicas: 1, requests: {}, limits: {'nvidia.com/gpu': '1'}, affinity: , toleration: gpu]

To print the deployment information for the non-vLLM deployment, run:

print_deployment_info(non_vllm_deployment, "Non-vLLM")

Output:

==============================Non-vLLM Deployment==============================
Deployment Name: my-non-vllm
Deployment UID: b80c3041-19f0-464e-9705-ce4a81f7090c
Deployment Status: HEALTHY
Scorer API Key: passphrase
Scorer API Base URL: https://model.cloud-dev.h2o.dev/6ea17ba2-ae79-4852-a295-3aea553c1495
OpenAI Base URL: None
Configuration: None
Resources: [replicas: 1, requests: {}, limits: {}, affinity: , toleration: ]

Feedback

Submit and view feedback for this page
Send feedback about H2O MLOps to cloud-feedback@h2o.ai