Version: v0.70.7

LLMOps

Introduction

Large Language Model Operations (LLMOps) encompasses the specialized practices, techniques, and tools required to effectively manage, deploy, and maintain Large Language Model (LLM) applications. LLMOps ensures the efficient integration of LLMs into existing workflows while addressing their unique challenges and operational requirements.

Benefits of LLMOps

LLMOps facilitates the seamless integration of the LLMs into the organization, aligning them with existing processes.
LLMOps ensures a smooth transition across different lifecycle phases, from ideation and development to deployment.
LLMOps provides efficient, scalable and risk-controlled management of LLM applications, enabling organizations to maximize benefits while minimizing risks.

MLOps vs. LLMOps

MLOps focuses on managing the operational aspects of traditional machine learning models, whereas LLMOps specializes in addressing the distinct challenges associated with LLMs.

Key differences between MLOps and LLMOps

Feature	LLMOps	MLOps
Model scale	Handles large-scale models	Manages smaller models
Data focus	Primarily processes text data	Works with diverse data types (structured, image, audio, etc.)
Pre-trained models	Often leverages pre-trained models	Typically does not rely on pre-trained models
Optimization techniques	Uses prompt engineering and fine-tuning	Employs feature engineering and model selection
Generalization	Supports broad, multi-domain applications	Designed for specific, task-oriented models
Predictability	Can be unpredictable and prone to hallucinations	More predictable in generating outputs
Output format	Generates text-based responses	Produces task-specific outputs such as labels or probabilities

LLMOps on H2O MLOps

Using H2O MLOps, users can take any pre-trained LLM from Hugging Face Hub, deploy it seamlessly on H2O MLOps, and obtain an OpenAI-compatible API endpoint for easy integration into applications.

Deployment

The specified model is automatically downloaded from the Hugging Face Hub and deployed on H2O MLOps. Follow the steps below to get started:

Install h2o_mlops

To install h2o_mlops, refer to the Python Client installation guide.

note
You must install h2o_mlops version 1.3.0 or later.
Connect to H2O MLOps.

The first step is connecting to H2O MLOps. To connect to H2O MLOps from outside the H2O Cloud, use the following code:
```
import time
import h2o_mlops
import h2o_mlops.options

mlops = h2o_mlops.Client(
   h2o_cloud_url="<h2o_cloud_url>",
   refresh_token="<refresh_token>",
)
```
- h2o_cloud_url: This is the same URL used to access the H2O Cloud homepage.
- refresh_token: For information on how to retrieve your refresh token (also known as a platform token), see API authentication.
Specify the Hugging Face model.

The second step is to define the Hugging Face model along with other experiment parameters.
```
vllm_name       = "my-vllm"
hub_model_id    = "TheBloke/Llama-2-7B-Chat-AWQ"  # Hugging Face model repo id
passphrase      = "passphrase"
```
- vllm_name: Defines the name for the vLLM (Virtual Large Language Model) experiment.
- hub_model_id: Specifies the Hugging Face repo ID of a model to be deployed.
- passphrase: Defines the security passphrase used for authentication the deployment.
Create a project.

Create an H2O MLOps project named LLMOps Demo.
```
project = mlops.projects.create(name="LLMOps Demo")
```

Create a vLLM experiment.

vllm_experiment = project.experiments.create_vllm(
   hub_model_id=hub_model_id,
   name=vllm_name,
)

List created vLLM experiments.

To list all the vLLM experiments in the project, use the following code:

project.experiments.list(filter_vllm=True)

Output:

 | name        | uid                                  | tags
--+-------------+--------------------------------------+--------
0 | my-vllm     | 70e34587-e73f-4db4-b1bd-f6062a537ceb |

Check the vLLM configuration for the experiment.

Inspect the configuration for vllm_experiment.
```
print(vllm_experiment.vllm_config)
```
Output:
```
{'model': 'TheBloke/Llama-2-7B-Chat-AWQ', 'name': 'my-vllm'}
```

Create and register the model.

Create a model and register it with the experiment.

vllm_model = project.models.create(name=vllm_name)

vllm_model.register(vllm_experiment)

Retrieve the list of tolerations.

The following command retrieves the list of tolerations allowed in the H2O MLOps environment:
```
mlops.allowed_tolerations
```
Output:
```
['gpu', 'spot-cpu']
```

Deploy created vLLM model.

This step deploys the vLLM model to H2O MLOps. During deployment, ensure that one of the GPU-based, MLOps allowed tolerations is specified.

vllm_deployment = project.deployments.deploy_vllm(
   name=vllm_name,
   model=vllm_model,
   security_options=h2o_mlops.options.SecurityOptions(
       passphrase=passphrase,
   ),
   kubernetes_options=h2o_mlops.options.KubernetesOptions(
       limits={"nvidia.com/gpu": "1"}, 
       toleration="gpu", #specify one of the GPU-based, MLOps allowed tolerations
   ),
)

List vLLM deployments.

To list all the deployed vLLM models, use the following code:

project.deployments.list(filter_vllm=True)

Output:

   | name        | mode         | uid
----+-------------+--------------+--------------------------------------
 0 | my-vllm     | Single Model | 6ea17ba2-ae79-4852-a295-3aea553c1495

Print deployment information.

Once the deployment is healthy, print important details, such as,

Deployment details
Scorer details
Resource allocation details

new_line_char = "\n"
def print_deployment_info(deployment, deployment_type):
   while not deployment.is_healthy():
       deployment.raise_for_failure()
       time.sleep(5)
   print(f"{'=' * 30}{deployment_type} Deployment{'=' * 30}")
   print(f"Deployment Name: {deployment.name}")
   print(f"Deployment UID: {deployment.uid}")
   print(f"Deployment Status: {deployment.status()}")
   print(f"Scorer API Key: {deployment.security_options.passphrase}")
   print(f"Scorer API Base URL: {deployment.scorer_api_base_url}")
   print(f"OpenAI Base URL: {deployment.openai_base_url}")
   print(f"Configuration: {deployment.experiments[0].vllm_config}")
   print(f"Resources: [{str(deployment.kubernetes_options).replace(new_line_char, ', ')}]")
   print()

To print the deployment information for the vLLM deployment, run:

print_deployment_info(vllm_deployment, "vLLM")

Output:

==============================vLLM Deployment==============================
Deployment Name: my-vllm
Deployment UID: 6ea17ba2-ae79-4852-a295-3aea553c1495
Deployment Status: HEALTHY
Scorer API Key: passphrase
Scorer API Base URL: https://model.cloud-dev.h2o.dev/6ea17ba2-ae79-4852-a295-3aea553c1495
OpenAI Base URL: https://model.cloud-dev.h2o.dev/6ea17ba2-ae79-4852-a295-3aea553c1495/v1
Configuration: {'model': 'TheBloke/Llama-2-7B-Chat-AWQ', 'name': 'my-vllm'}
Resources: [replicas: 1, requests: {}, limits: {'nvidia.com/gpu': '1'}, affinity: , toleration: gpu]

Usage

Once deployed, the model is exposed through a basic OpenAI-compatible API endpoint. Follow the steps below to get started:

Install openai

To install the openai package, run the following command:
```
pip install openai
```

Initialize OpenAI

Use OpenAI with the deployment’s API credentials to interact with the model.

from openai import OpenAI

openai = OpenAI(
   api_key=vllm_deployment.security_options.passphrase,
   base_url=vllm_deployment.openai_base_url,
)

Retrieve the model

List the available models and select the deployed one.
```
model = openai.models.list().data[0].id
model
```
Output:
```
'my-vllm'
```

Start a chat session

Implement a simple interactive chat session with the deployed model.

print("Chat session started. Type '-1' to exit.\n")
messages = [
   {"role": "system", "content": "You are a helpful assistant."},
]

while True:
   user_input = input("User: ")

   if user_input == "-1":
       print("Exiting chat session.")
       break

   print(f"User: {user_input}\n")
   messages.append({"role": "user", "content": user_input})

   stream = openai.chat.completions.create(
       model=model,
       messages=messages[-3:],
       max_tokens=4000,
       stream=True,
       temperature=0,
   )

   response_content = ""
   print("Assistant: ", end="")
   for chunk in stream:
       response_content += chunk.choices[0].delta.content or ""
       print(chunk.choices[0].delta.content or "", end="")
   print("\n")

   messages.append({"role": "assistant", "content": response_content})

Output:

Chat session started. Type '-1' to exit.

User: Hi

Assistant:   Hello! It's nice to meet you! I'm here to help with any questions or tasks you may have. How can I assist you >today? Do you have a specific question or topic you'd like to discuss?

User: Who're you?

Assistant:   Hello! I'm just an AI assistant trained by Meta AI to help with a variety of tasks, such as answering questions, >providing information, and completing tasks. I'm here to help you with any questions or tasks you may have, so feel free to >ask me anything!

User: What's 1 + 1?

Assistant:   Sure! The answer to 1 + 1 is 2.

Exiting chat session.

Feedback

Submit and view feedback for this page
Send feedback about H2O MLOps to cloud-feedback@h2o.ai

Introduction​

Benefits of LLMOps​

MLOps vs. LLMOps​

Key differences between MLOps and LLMOps​

LLMOps on H2O MLOps​

Deployment​

Usage​

Introduction

Benefits of LLMOps

MLOps vs. LLMOps

Key differences between MLOps and LLMOps

LLMOps on H2O MLOps

Deployment

Usage