Key terms

H2O LLM DataStudio uses several key terms across its documentation, and each, in turn, is explained in the sections below.

LLM (Large Language Model)

A LLM refers to advanced AI models excelling in natural language understanding and generation, utilizing vast neural networks and extensive training data.

Data curation

Data curation refers to converting unstructured data like PDFs, DOCs, audio, and video files into structured formats, such as question-answer pairs or summaries. The goal is to make the information within those files more structured and easier to work with.

Tokenization

A process that breaks down text into smaller units, called tokens, to facilitate natural language processing tasks. Each token can be a word, subword, or character, allowing language models to analyze and understand text more effectively.

Truncation

The process of shortening text by removing characters from the beginning or end of a sequence, commonly used to fit text within specified length constraints or to prepare input data for language models with fixed input size.

RLHF (Reinforcement Learning with Human Feedback)

A training technique for Large Language Models (LLMs) that combines pre-training on a large corpus with fine-tuning on task-specific datasets, enhancing model performance for various tasks.

Augmentation

Augmentation allows you to blend your datasets with other publicly available datasets for the purpose of obtaining variety. In some cases, you can integrate your datasets with RLHF related datasets, to add more domain aspects. The Augment tab shows a catalog of rich datasets which can be used immediately in the Prepare pipeline. You can also bring your own datasets for the augmentation process. For more information, see Augment.

HAMC

H2O AI Managed cloud is the main platform for users whereas H2O Admin Center is connected with the application that customers can use to configure their dedicated cloud deployment based on business use cases. H2O Admin Center supports firewall management and user management.

The fully managed solution of the H2O AI Cloud has all the same features to make, operate, and innovate with your own AI. The infrastructure and software management of the application are completely handled by H2O.ai allowing the customer to focus on solving their business problems with AI.

These new features include:

A completely dedicated cloud environment for each customer.
All operation activities (installation, upgrades, day-to-day operations) are handled by H2O.ai.
Monitoring and automatic alerts of resource consumption.
Quick onboarding process.
Protection of public resources with several layers of security including DDoS protection, web application firewall, and firewalls.
Optional security control of who can access resources by include-listing specific IP addresses.

Most users prefer controlling the application in a self-service manner so users get access to their account as well as inbound/outbound rules for moving data in/out of the platform.

H2O Managed Cloud is designed for high availability and dependability, providing tools to:

Develop AI based applications
Deploy AI based applications
Manage AI based applications
Maintain AI based applications

Which helps protect the confidentiality, integrity, and availability of your systems and data is our top priority.

Relevance score

The relevance score is a numerical value that quantifies the relationship between a query and a text segment. It is used to assess how well a specific piece of content matches a given search or context, enabling better sorting and prioritization of information. Various approaches, such as the Bert approach, Regex approach, and FinBert approach, offer different methods to calculate this score. The Bert approach uses a transformer-based model for a deep understanding of context, while the Regex approach relies on pattern matching, and the FinBert approach is fine-tuned for financial text.

Sampling ratio

Sampling ratio refers to the proportion of data selected from a larger dataset to be processed or analyzed. In the context of Smart chunking, the sampling ratio determines the fraction of the entire document that will be used to generate records. A lower ratio means that only a small subset of the data is sampled, which can be useful for handling very large datasets more efficiently. Conversely, a higher ratio involves more data, potentially improving the accuracy of the model but at the cost of increased processing time. If set to 0, the system automatically determines the optimal sampling ratio based on the document size and other factors.

Robust evaluation dataset

A robust evaluation dataset is a special set of test data used to evaluate how well a system can handle different types of questions. It includes new versions of the original questions, created by slightly changing how they are asked, but keeping the same answers. This ensures that the system is tested not only on the original questions but also on various forms of those questions, helping to check if it can consistently give the right answers even when the questions are worded differently.

Feedback

Submit and view feedback for this page
Send feedback about H2O LLM DataStudio | Docs to cloud-feedback@h2o.ai

LLM (Large Language Model)​

Data curation​

Tokenization​

Truncation​

RLHF (Reinforcement Learning with Human Feedback)​

Augmentation​

HAMC​

Relevance score​

Sampling ratio​

Robust evaluation dataset​