Handling PII in h2oGPTe
Overview​
Protecting personally identifiable information (PII) is essential for the safe use of generative AI systems. Enterprise h2oGPTe includes configurable features for detecting and handling PII. These capabilities are designed for organizations working with sensitive data and can be customized to fit specific use cases.
The below capabilities can be customized and applied to the entire platform by the Kubernetes admin. The default in most environments is for PII detection to be turned off.
All capabilities can be customized at the collection-level through UI and APIs.
Options for Sanitization​
When PII is detected, h2oGPTe supports the following actions:
-
Take No Action (Allow Pass-through)
PII is identified but left in place. Useful for internal/testing scenarios or when full transparency is required. -
Redact or Sanitize
Sensitive parts are replaced with the detected type of PII (e.g., with [PHONENUMBER] or with [XXXXX]) before being used anywhere in the system. -
Fail
The action is halted and a failure is returned. Use where strict, fail-safe privacy controls are mandated.
You can set the appropriate action based on risk tolerance, user role, or business process at the collection level to apply to all specific types of PII in that collection.
Where PII Redaction Happens​
PII detection and redaction can occur at the following stages:
-
Document ingestion
When a document is added to a collection or chat session, the system uses the collection’s settings to detect and handle PII. This applies to the original document as well as to extracted or embedded data.noteWhen choosing Redact during Document Ingestion for audio files, the PII is only sanitized after it has been transcribed.
When documents are added to the vector database (used for retrieval-augmented generation, or RAG), PII can be detected and sanitized before storage.
-
On User Inputs
When a user interacts with an LLM or Agent, all queries can be checked and handled for PII before reaching the model. This applies for direct user inputs and for the System Prompt. -
On Model Outputs
Before an LLM output is displayed or sent to users, PII can be detected and handled.
You can configure PII handling independently for each stage depending on your compliance and privacy requirements.
PII Detection Methods​
h2oGPTe supports four different methods to find PII in your data. Each method has unique advantages and is customizable depending on your use case. It's common to use multiple methods in a single collection
-
Regular expressions (regex)
Detects PII using predefined patterns or templates. Suitable for standardized formats such as email addresses or Social Security numbers. -
Presidio labels (SpaCy/Presidio-style)
Uses Microsoft's Presidio engine to identify entities such asPERSON
,EMAIL_ADDRESS
, orLOCATION
using a mix of rules and natural language processing models. This method is useful for free-text data where PII appears in various forms. -
ModernBERT-based custom model
A transformer model fine-tuned by H2O.ai for identifying PII in complex or unstructured text. This method can be retrained to support different languages or domains.NoteThis model is regularly updated to improve its performance. Recent enhancements have increased the accuracy and reliability of PII detection, with particular improvements for identifying entities like phone numbers.
-
Custom PII entities for tabular data
Supports detection of PII in structured datasets by defining column-level identifiers (for example,SSN
,DOB
). Useful for spreadsheets and databases that include organization-specific PII.
Each method can be tailored based on your data and privacy needs—for example, tuning regex patterns, updating your list of Presidio labels, re-training the ModernBERT model, or specifying custom entities for your tabular data.
​
- Submit and view feedback for this page
- Send feedback about Enterprise h2oGPTe to cloud-feedback@h2o.ai