LLMOps
Introduction
Large Language Model Operations (LLMOps) encompasses the specialized practices, techniques, and tools required to effectively manage, deploy, and maintain Large Language Model (LLM) applications. LLMOps ensures the efficient integration of LLMs into existing workflows while addressing their unique challenges and operational requirements.
Benefits of LLMOps
- LLMOps facilitates the seamless integration of the LLMs into the organization, aligning them with existing processes.
- LLMOps ensures a smooth transition across different lifecycle phases, from ideation and development to deployment.
- LLMOps provides efficient, scalable and risk-controlled management of LLM applications, enabling organizations to maximize benefits while minimizing risks.
MLOps vs. LLMOps
MLOps focuses on managing the operational aspects of traditional machine learning models, whereas LLMOps specializes in addressing the distinct challenges associated with LLMs.
Key differences between MLOps and LLMOps
Feature | LLMOps | MLOps |
---|---|---|
Model scale | Handles large-scale models | Manages smaller models |
Data focus | Primarily processes text data | Works with diverse data types (structured, image, audio, etc.) |
Pre-trained models | Often leverages pre-trained models | Typically does not rely on pre-trained models |
Optimization techniques | Uses prompt engineering and fine-tuning | Employs feature engineering and model selection |
Generalization | Supports broad, multi-domain applications | Designed for specific, task-oriented models |
Predictability | Can be unpredictable and prone to hallucinations | More predictable in generating outputs |
Output format | Generates text-based responses | Produces task-specific outputs such as labels or probabilities |
LLMOps on H2O MLOps
Using H2O MLOps, users can take any pre-trained LLM from Hugging Face Hub, deploy it seamlessly on H2O MLOps, and obtain an OpenAI-compatible API endpoint for easy integration into applications.
Deployment
The specified model is automatically downloaded from the Hugging Face Hub and deployed on H2O MLOps. Follow the steps below to get started:
-
Install
h2o_mlops
To install
h2o_mlops
, refer to the Python Client installation guide.noteYou must install
h2o_mlops
version 1.3.0 or later. -
Connect to H2O MLOps.
The first step is connecting to H2O MLOps. To connect to H2O MLOps from outside the H2O Cloud, use the following code:
import time
import h2o_mlops
import h2o_mlops.options
mlops = h2o_mlops.Client(
h2o_cloud_url="<h2o_cloud_url>",
refresh_token="<refresh_token>",
)h2o_cloud_url
: This is the same URL used to access the H2O Cloud homepage.refresh_token
: For information on how to retrieve your refresh token (also known as a platform token), see API authentication.
-
Specify the Hugging Face model.
The second step is to define the Hugging Face model along with other experiment parameters.
vllm_name = "my-vllm"
hub_model_id = "TheBloke/Llama-2-7B-Chat-AWQ" # Hugging Face model repo id
passphrase = "passphrase"vllm_name
: Defines the name for the vLLM (Virtual Large Language Model) experiment.hub_model_id
: Specifies the Hugging Face repo ID of a model to be deployed.passphrase
: Defines the security passphrase used for authentication the deployment.
-
Create a project.
Create an H2O MLOps project named
LLMOps Demo
.project = mlops.projects.create(name="LLMOps Demo")
-
Create a vLLM experiment.
vllm_experiment = project.experiments.create_vllm(
hub_model_id=hub_model_id,
name=vllm_name,
) -
List created vLLM experiments.
To list all the vLLM experiments in the project, use the following code:
project.experiments.list(filter_vllm=True)
Output:
| name | uid | tags
--+-------------+--------------------------------------+--------
0 | my-vllm | 70e34587-e73f-4db4-b1bd-f6062a537ceb | -
Check the vLLM configuration for the experiment.
Inspect the configuration for
vllm_experiment
.print(vllm_experiment.vllm_config)
Output:
{'model': 'TheBloke/Llama-2-7B-Chat-AWQ', 'name': 'my-vllm'}
-
Create and register the model.
Create a model and register it with the experiment.
vllm_model = project.models.create(name=vllm_name)
vllm_model.register(vllm_experiment) -
Retrieve the list of tolerations.
The following command retrieves the list of tolerations allowed in the H2O MLOps environment:
mlops.allowed_tolerations
Output:
['gpu', 'spot-cpu']
-
Deploy created vLLM model.
This step deploys the vLLM model to H2O MLOps. During deployment, ensure that one of the GPU-based, MLOps allowed tolerations is specified.
vllm_deployment = project.deployments.deploy_vllm(
name=vllm_name,
model=vllm_model,
security_options=h2o_mlops.options.SecurityOptions(
passphrase=passphrase,
),
kubernetes_options=h2o_mlops.options.KubernetesOptions(
limits={"nvidia.com/gpu": "1"},
toleration="gpu", #specify one of the GPU-based, MLOps allowed tolerations
),
) -
List vLLM deployments.
To list all the deployed vLLM models, use the following code:
project.deployments.list(filter_vllm=True)
Output:
| name | mode | uid
----+-------------+--------------+--------------------------------------
0 | my-vllm | Single Model | 6ea17ba2-ae79-4852-a295-3aea553c1495 -
Print deployment information.
Once the deployment is healthy, print important details, such as,
-
Deployment details
-
Scorer details
-
Resource allocation details
new_line_char = "\n"
def print_deployment_info(deployment, deployment_type):
while not deployment.is_healthy():
deployment.raise_for_failure()
time.sleep(5)
print(f"{'=' * 30}{deployment_type} Deployment{'=' * 30}")
print(f"Deployment Name: {deployment.name}")
print(f"Deployment UID: {deployment.uid}")
print(f"Deployment Status: {deployment.status()}")
print(f"Scorer API Key: {deployment.security_options.passphrase}")
print(f"Scorer API Base URL: {deployment.scorer_api_base_url}")
print(f"OpenAI Base URL: {deployment.openai_base_url}")
print(f"Configuration: {deployment.experiments[0].vllm_config}")
print(f"Resources: [{str(deployment.kubernetes_options).replace(new_line_char, ', ')}]")
print()To print the deployment information for the vLLM deployment, run:
print_deployment_info(vllm_deployment, "vLLM")
Output:
==============================vLLM Deployment==============================
Deployment Name: my-vllm
Deployment UID: 6ea17ba2-ae79-4852-a295-3aea553c1495
Deployment Status: HEALTHY
Scorer API Key: passphrase
Scorer API Base URL: https://model.cloud-dev.h2o.dev/6ea17ba2-ae79-4852-a295-3aea553c1495
OpenAI Base URL: https://model.cloud-dev.h2o.dev/6ea17ba2-ae79-4852-a295-3aea553c1495/v1
Configuration: {'model': 'TheBloke/Llama-2-7B-Chat-AWQ', 'name': 'my-vllm'}
Resources: [replicas: 1, requests: {}, limits: {'nvidia.com/gpu': '1'}, affinity: , toleration: gpu] -
Usage
Once deployed, the model is exposed through a basic OpenAI-compatible API endpoint. Follow the steps below to get started:
-
Install
openai
To install the
openai
package, run the following command:pip install openai
-
Initialize OpenAI
Use OpenAI with the deployment’s API credentials to interact with the model.
from openai import OpenAI
openai = OpenAI(
api_key=vllm_deployment.security_options.passphrase,
base_url=vllm_deployment.openai_base_url,
) -
Retrieve the model
List the available models and select the deployed one.
model = openai.models.list().data[0].id
modelOutput:
'my-vllm'
-
Start a chat session
Implement a simple interactive chat session with the deployed model.
print("Chat session started. Type '-1' to exit.\n")
messages = [
{"role": "system", "content": "You are a helpful assistant."},
]
while True:
user_input = input("User: ")
if user_input == "-1":
print("Exiting chat session.")
break
print(f"User: {user_input}\n")
messages.append({"role": "user", "content": user_input})
stream = openai.chat.completions.create(
model=model,
messages=messages[-3:],
max_tokens=4000,
stream=True,
temperature=0,
)
response_content = ""
print("Assistant: ", end="")
for chunk in stream:
response_content += chunk.choices[0].delta.content or ""
print(chunk.choices[0].delta.content or "", end="")
print("\n")
messages.append({"role": "assistant", "content": response_content})Output:
Chat session started. Type '-1' to exit.
User: Hi
Assistant: Hello! It's nice to meet you! I'm here to help with any questions or tasks you may have. How can I assist you >today? Do you have a specific question or topic you'd like to discuss?
User: Who're you?
Assistant: Hello! I'm just an AI assistant trained by Meta AI to help with a variety of tasks, such as answering questions, >providing information, and completing tasks. I'm here to help you with any questions or tasks you may have, so feel free to >ask me anything!
User: What's 1 + 1?
Assistant: Sure! The answer to 1 + 1 is 2.
Exiting chat session.
- Submit and view feedback for this page
- Send feedback about H2O MLOps to cloud-feedback@h2o.ai