Skip to main content
Version: Next

LLMOps

Introduction

Large Language Model Operations (LLMOps) encompasses the specialized practices, techniques, and tools required to effectively manage, deploy, and maintain Large Language Model (LLM) applications. LLMOps ensures the efficient integration of LLMs into existing workflows while addressing their unique challenges and operational requirements.

Benefits of LLMOps

  • LLMOps facilitates the seamless integration of the LLMs into the organization, aligning them with existing processes.
  • LLMOps ensures a smooth transition across different lifecycle phases, from ideation and development to deployment.
  • LLMOps provides efficient, scalable and risk-controlled management of LLM applications, enabling organizations to maximize benefits while minimizing risks.

MLOps vs. LLMOps

MLOps focuses on managing the operational aspects of traditional machine learning models, whereas LLMOps specializes in addressing the distinct challenges associated with LLMs.

Key differences between MLOps and LLMOps

FeatureLLMOpsMLOps
Model scaleHandles large-scale modelsManages smaller models
Data focusPrimarily processes text dataWorks with diverse data types (structured, image, audio, etc.)
Pre-trained modelsOften leverages pre-trained modelsTypically does not rely on pre-trained models
Optimization techniquesUses prompt engineering and fine-tuningEmploys feature engineering and model selection
GeneralizationSupports broad, multi-domain applicationsDesigned for specific, task-oriented models
PredictabilityCan be unpredictable and prone to hallucinationsMore predictable in generating outputs
Output formatGenerates text-based responsesProduces task-specific outputs such as labels or probabilities

LLMOps on H2O MLOps

Using H2O MLOps, users can take any pre-trained LLM from Hugging Face Hub, deploy it seamlessly on H2O MLOps, and obtain an OpenAI-compatible API endpoint for easy integration into applications.

Deployment

The specified model is automatically downloaded from the Hugging Face Hub and deployed on H2O MLOps. Follow the steps below to get started:

  1. Install h2o_mlops

    To install h2o_mlops, refer to the Python Client installation guide.

    note

    You must install h2o_mlops version 1.3.0 or later.

  2. Connect to H2O MLOps.

    The first step is connecting to H2O MLOps. To connect to H2O MLOps from outside the H2O Cloud, use the following code:

    import time
    import h2o_mlops
    import h2o_mlops.options

    mlops = h2o_mlops.Client(
    h2o_cloud_url="<h2o_cloud_url>",
    refresh_token="<refresh_token>",
    )
    • h2o_cloud_url: This is the same URL used to access the H2O Cloud homepage.
    • refresh_token: For information on how to retrieve your refresh token (also known as a platform token), see API authentication.
  3. Specify the Hugging Face model.

    The second step is to define the Hugging Face model along with other experiment parameters.

    vllm_name       = "my-vllm"
    hub_model_id = "TheBloke/Llama-2-7B-Chat-AWQ" # Hugging Face model repo id
    passphrase = "passphrase"
    • vllm_name: Defines the name for the vLLM (Virtual Large Language Model) experiment.
    • hub_model_id: Specifies the Hugging Face repo ID of a model to be deployed.
    • passphrase: Defines the security passphrase used for authentication the deployment.
  4. Create a project.

    Create an H2O MLOps project named LLMOps Demo.

    project = mlops.projects.create(name="LLMOps Demo")
  5. Create a vLLM experiment.

    vllm_experiment = project.experiments.create_vllm(
    hub_model_id=hub_model_id,
    name=vllm_name,
    )
  6. List created vLLM experiments.

    To list all the vLLM experiments in the project, use the following code:

    project.experiments.list(filter_vllm=True)

    Output:

     | name        | uid                                  | tags
    --+-------------+--------------------------------------+--------
    0 | my-vllm | 70e34587-e73f-4db4-b1bd-f6062a537ceb |
  7. Check the vLLM configuration for the experiment.

    Inspect the configuration for vllm_experiment.

    print(vllm_experiment.vllm_config)

    Output:

    {'model': 'TheBloke/Llama-2-7B-Chat-AWQ', 'name': 'my-vllm'}
  8. Create and register the model.

    Create a model and register it with the experiment.

    vllm_model = project.models.create(name=vllm_name)

    vllm_model.register(vllm_experiment)
  9. Retrieve the list of tolerations.

    The following command retrieves the list of tolerations allowed in the H2O MLOps environment:

    mlops.allowed_tolerations

    Output:

    ['gpu', 'spot-cpu']
  10. Deploy created vLLM model.

    This step deploys the vLLM model to H2O MLOps. During deployment, ensure that one of the GPU-based, MLOps allowed tolerations is specified.

    vllm_deployment = project.deployments.deploy_vllm(
    name=vllm_name,
    model=vllm_model,
    security_options=h2o_mlops.options.SecurityOptions(
    passphrase=passphrase,
    ),
    kubernetes_options=h2o_mlops.options.KubernetesOptions(
    limits={"nvidia.com/gpu": "1"},
    toleration="gpu", #specify one of the GPU-based, MLOps allowed tolerations
    ),
    )
  11. List vLLM deployments.

    To list all the deployed vLLM models, use the following code:

    project.deployments.list(filter_vllm=True)

    Output:

       | name        | mode         | uid
    ----+-------------+--------------+--------------------------------------
    0 | my-vllm | Single Model | 6ea17ba2-ae79-4852-a295-3aea553c1495
  12. Print deployment information.

    Once the deployment is healthy, print important details, such as,

    • Deployment details

    • Scorer details

    • Resource allocation details

    new_line_char = "\n"
    def print_deployment_info(deployment, deployment_type):
    while not deployment.is_healthy():
    deployment.raise_for_failure()
    time.sleep(5)
    print(f"{'=' * 30}{deployment_type} Deployment{'=' * 30}")
    print(f"Deployment Name: {deployment.name}")
    print(f"Deployment UID: {deployment.uid}")
    print(f"Deployment Status: {deployment.status()}")
    print(f"Scorer API Key: {deployment.security_options.passphrase}")
    print(f"Scorer API Base URL: {deployment.scorer_api_base_url}")
    print(f"OpenAI Base URL: {deployment.openai_base_url}")
    print(f"Configuration: {deployment.experiments[0].vllm_config}")
    print(f"Resources: [{str(deployment.kubernetes_options).replace(new_line_char, ', ')}]")
    print()

    To print the deployment information for the vLLM deployment, run:

    print_deployment_info(vllm_deployment, "vLLM")

    Output:

    ==============================vLLM Deployment==============================
    Deployment Name: my-vllm
    Deployment UID: 6ea17ba2-ae79-4852-a295-3aea553c1495
    Deployment Status: HEALTHY
    Scorer API Key: passphrase
    Scorer API Base URL: https://model.cloud-dev.h2o.dev/6ea17ba2-ae79-4852-a295-3aea553c1495
    OpenAI Base URL: https://model.cloud-dev.h2o.dev/6ea17ba2-ae79-4852-a295-3aea553c1495/v1
    Configuration: {'model': 'TheBloke/Llama-2-7B-Chat-AWQ', 'name': 'my-vllm'}
    Resources: [replicas: 1, requests: {}, limits: {'nvidia.com/gpu': '1'}, affinity: , toleration: gpu]

Usage

Once deployed, the model is exposed through a basic OpenAI-compatible API endpoint. Follow the steps below to get started:

  1. Install openai

    To install the openai package, run the following command:

    pip install openai
  2. Initialize OpenAI

    Use OpenAI with the deployment’s API credentials to interact with the model.

    from openai import OpenAI

    openai = OpenAI(
    api_key=vllm_deployment.security_options.passphrase,
    base_url=vllm_deployment.openai_base_url,
    )
  3. Retrieve the model

    List the available models and select the deployed one.

    model = openai.models.list().data[0].id
    model

    Output:

    'my-vllm'
  4. Start a chat session

    Implement a simple interactive chat session with the deployed model.

    print("Chat session started. Type '-1' to exit.\n")
    messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    ]

    while True:
    user_input = input("User: ")

    if user_input == "-1":
    print("Exiting chat session.")
    break

    print(f"User: {user_input}\n")
    messages.append({"role": "user", "content": user_input})

    stream = openai.chat.completions.create(
    model=model,
    messages=messages[-3:],
    max_tokens=4000,
    stream=True,
    temperature=0,
    )

    response_content = ""
    print("Assistant: ", end="")
    for chunk in stream:
    response_content += chunk.choices[0].delta.content or ""
    print(chunk.choices[0].delta.content or "", end="")
    print("\n")

    messages.append({"role": "assistant", "content": response_content})

    Output:

    Chat session started. Type '-1' to exit.

    User: Hi

    Assistant: Hello! It's nice to meet you! I'm here to help with any questions or tasks you may have. How can I assist you >today? Do you have a specific question or topic you'd like to discuss?

    User: Who're you?

    Assistant: Hello! I'm just an AI assistant trained by Meta AI to help with a variety of tasks, such as answering questions, >providing information, and completing tasks. I'm here to help you with any questions or tasks you may have, so feel free to >ask me anything!

    User: What's 1 + 1?

    Assistant: Sure! The answer to 1 + 1 is 2.

    Exiting chat session.

Feedback