Version: v1.7.5-1 🚧

Rate limits and fairness

Overview

Enterprise h2oGPTe provides configurable resource limits and fair-use policies that maintain fair access across users and prevent resource exhaustion. These controls include:

Collection and document limits: Per-user caps on the number of collections and documents
LLM cost controls: Per-user spending limits on a rolling 24-hour and lifetime basis
Chat fairness: Three-tier priority queuing and per-user rate limiting for chat requests
Crawl fairness: Concurrent job limits and priority management for document ingestion
MCP rate limiting: Per-user request limits for Model Context Protocol (MCP) endpoints

note

All settings on this page require administrator privileges. Administrators can customize settings marked Overridable per role through Roles and Permissions.

Access rate limit settings

In Enterprise h2oGPTe, click Account Circle.
Select System Dashboard.
In the Configuration section, click System settings.
Select the LIMITS category tab.

Collection and document limits

These settings control the maximum number of collections and documents each user can create.

Setting	Overridable	Description
`collection_limit`	No	System-wide maximum number of collections.
`collection_limit_per_user`	Yes	Maximum collections per user.
`document_limit_per_user`	Yes	Maximum documents per user.
`agents_document_limit_per_user`	Yes	Maximum documents created by agents per user.
`default_collection_size_limit`	No	Default maximum storage per collection (in bytes). Range: 1 MB to 10 GB. See Collection Lifecycle for configuration examples.

Configure collection limits

# Set per-user collection limit
curl -X PUT "https://<YOUR_DOMAIN>/api/v1/configurations/collection_limit_per_user" \
  -H "Authorization: Bearer <API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{"string_value": "500"}'

# Set per-user document limit
curl -X PUT "https://<YOUR_DOMAIN>/api/v1/configurations/document_limit_per_user" \
  -H "Authorization: Bearer <API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{"string_value": "5000"}'

LLM cost controls

LLM cost controls track and cap the cost of LLM usage per user. Cost tracking is always active. When a user reaches a limit, the system rejects their additional LLM requests until the limit resets (24-hour rolling window) or an administrator increases it.

Setting	Overridable	Description
`max_llm_cost_per_user_per_24h`	Yes	Rolling 24-hour cost cap per user. Set to `-1` to disable.
`max_llm_cost_per_user`	Yes	Lifetime cost cap per user. Set to `-1` to disable.
`max_llm_cost_per_guest`	Yes	Cost cap for guest users. Set to `-1` to disable.
`llm_cost_units`	No	Currency unit for cost tracking (for example, `USD`).

Configure LLM cost limits

# Set 24-hour cost cap
curl -X PUT "https://<YOUR_DOMAIN>/api/v1/configurations/max_llm_cost_per_user_per_24h" \
  -H "Authorization: Bearer <API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{"string_value": "25"}'

# Set lifetime cost cap
curl -X PUT "https://<YOUR_DOMAIN>/api/v1/configurations/max_llm_cost_per_user" \
  -H "Authorization: Bearer <API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{"string_value": "1000"}'

note

LLM cost limits are overridable per role. Use Roles and Permissions to set different cost limits for different user groups.

Chat fairness

Chat fairness provides priority-based queuing and per-user rate limiting for chat requests. When enabled, the system uses a three-tier priority model that maintains fair access across users:

High priority: Users with no recent activity receive the fastest response times.
Normal priority: Active users below the heavy-use threshold receive standard response times.
Low priority: The system deprioritizes heavy users who exceed the activity threshold, giving other users fair access.
Starvation prevention: Every N requests (configured by chat_fairness_starvation_interval), the scheduler rotates queue priority so lower-priority queues get served first.

The following environment variables control chat fairness behavior at deployment time:

Setting	Description
`chat_fairness_enabled`	Top-level toggle for chat fairness.
`chat_rate_limit_per_minute`	Maximum chat requests per user per minute. Exceeding this limit returns an HTTP 429 response.
`chat_fresh_user_window_minutes`	Time window (in minutes) for classifying a user as high priority.
`chat_heavy_user_threshold`	Number of requests before a user is classified as low priority.
`chat_fairness_starvation_interval`	Number of requests between priority rotations. Default: 30.

note

These settings are environment variables configured at deployment time. They are not runtime-configurable through the /api/v1/configurations endpoint.

The following setting is runtime-configurable through the /api/v1/configurations endpoint:

Setting	Overridable	Description
`chat_max_concurrent_per_user`	Yes	Maximum simultaneous active chat requests per user.

important

Configure chat fairness limits appropriate to your user base. Setting the concurrent limit and rate limit prevents any single user from monopolizing chat resources at the expense of others.

Crawl fairness

Crawl fairness controls concurrent document ingestion jobs per user. When enabled, the system manages job priority to prevent any single user from monopolizing ingestion resources.

The following environment variables control crawl fairness behavior at deployment time:

Setting	Description
`crawl_fairness_enabled`	Top-level toggle for crawl fairness.
`crawl_fresh_user_window_minutes`	Time window (in minutes) for standard priority classification.
`crawl_heavy_user_jobs_threshold`	Number of jobs before a user is deprioritized.

note

These settings are environment variables configured at deployment time. They are not runtime-configurable through the /api/v1/configurations endpoint.

The following setting is runtime-configurable through the /api/v1/configurations endpoint:

Setting	Overridable	Description
`crawl_max_concurrent_per_user`	Yes	Maximum concurrent document ingestion jobs per user.

How crawl fairness works

Users below the heavy-use threshold use the standard ingestion queue.
The system routes users who exceed the threshold to a deprioritized queue, letting other users' jobs proceed first.

MCP rate limiting

Model Context Protocol (MCP) endpoints have dedicated rate limiting to control request volume and payload size.

Setting	Description
`mcp_rate_limit`	Maximum requests per user per minute. Exceeding this limit returns an HTTP 429 response.
`mcp_max_body_size_mb`	Maximum request body size in MB. Exceeding this limit returns an HTTP 413 response.
`mcp_max_concurrent_jobs`	Maximum concurrent blocking job waits per user.

note

MCP rate limit settings are environment variables configured on the mux deployment. They are not runtime-configurable through the /api/v1/configurations endpoint.

Configure rate limits with the Python SDK

The following example sets collection, document, and LLM cost limits using the Python SDK:

from h2ogpte import H2OGPTE

admin = H2OGPTE(address="https://<YOUR_DOMAIN>", api_key="<API_KEY>")

# Set collection limit per user (overridable per role)
admin.set_global_configuration(
    "collection_limit_per_user", "500", can_overwrite=True, is_public=True
)

# Set document limit per user (overridable per role)
admin.set_global_configuration(
    "document_limit_per_user", "5000", can_overwrite=True, is_public=True
)

# Set LLM cost limits (overridable per role)
admin.set_global_configuration(
    "max_llm_cost_per_user_per_24h", "25", can_overwrite=True, is_public=True
)
admin.set_global_configuration(
    "max_llm_cost_per_user", "1000", can_overwrite=True, is_public=True
)

System Settings - Manage global configuration settings including limits
Roles and Permissions - Configure per-role overrides for overridable limit settings
Collection Lifecycle - Collection expiration and size limits

Feedback

Submit and view feedback for this page
Send feedback about Enterprise h2oGPTe to cloud-feedback@h2o.ai

Overview​

Access rate limit settings​

Collection and document limits​

Configure collection limits​

LLM cost controls​

Configure LLM cost limits​

Chat fairness​

Crawl fairness​

How crawl fairness works​

MCP rate limiting​

Configure rate limits with the Python SDK​

Related topics​

Overview

Access rate limit settings

Collection and document limits

Configure collection limits

LLM cost controls

Configure LLM cost limits

Chat fairness

Crawl fairness

How crawl fairness works

MCP rate limiting

Configure rate limits with the Python SDK

Related topics