Rate limits and fairness
Overview​
Enterprise h2oGPTe provides configurable resource limits and fair-use policies that maintain fair access across users and prevent resource exhaustion. These controls include:
- Collection and document limits: Per-user caps on the number of collections and documents
- LLM cost controls: Per-user spending limits on a rolling 24-hour and lifetime basis
- Chat fairness: Three-tier priority queuing and per-user rate limiting for chat requests
- Crawl fairness: Concurrent job limits and priority management for document ingestion
- MCP rate limiting: Per-user request limits for Model Context Protocol (MCP) endpoints
All settings on this page require administrator privileges. Administrators can customize settings marked Overridable per role through Roles and Permissions.
Access rate limit settings​
- In Enterprise h2oGPTe, click Account Circle.
- Select System Dashboard.
- In the Configuration section, click System settings.
- Select the LIMITS category tab.
Collection and document limits​
These settings control the maximum number of collections and documents each user can create.
| Setting | Overridable | Description |
|---|---|---|
collection_limit | No | System-wide maximum number of collections. |
collection_limit_per_user | Yes | Maximum collections per user. |
document_limit_per_user | Yes | Maximum documents per user. |
agents_document_limit_per_user | Yes | Maximum documents created by agents per user. |
default_collection_size_limit | No | Default maximum storage per collection (in bytes). Range: 1 MB to 10 GB. See Collection Lifecycle for configuration examples. |
Configure collection limits​
# Set per-user collection limit
curl -X PUT "https://<YOUR_DOMAIN>/api/v1/configurations/collection_limit_per_user" \
-H "Authorization: Bearer <API_KEY>" \
-H "Content-Type: application/json" \
-d '{"string_value": "500"}'
# Set per-user document limit
curl -X PUT "https://<YOUR_DOMAIN>/api/v1/configurations/document_limit_per_user" \
-H "Authorization: Bearer <API_KEY>" \
-H "Content-Type: application/json" \
-d '{"string_value": "5000"}'
LLM cost controls​
LLM cost controls track and cap the cost of LLM usage per user. Cost tracking is always active. When a user reaches a limit, the system rejects their additional LLM requests until the limit resets (24-hour rolling window) or an administrator increases it.
| Setting | Overridable | Description |
|---|---|---|
max_llm_cost_per_user_per_24h | Yes | Rolling 24-hour cost cap per user. Set to -1 to disable. |
max_llm_cost_per_user | Yes | Lifetime cost cap per user. Set to -1 to disable. |
max_llm_cost_per_guest | Yes | Cost cap for guest users. Set to -1 to disable. |
llm_cost_units | No | Currency unit for cost tracking (for example, USD). |
Configure LLM cost limits​
# Set 24-hour cost cap
curl -X PUT "https://<YOUR_DOMAIN>/api/v1/configurations/max_llm_cost_per_user_per_24h" \
-H "Authorization: Bearer <API_KEY>" \
-H "Content-Type: application/json" \
-d '{"string_value": "25"}'
# Set lifetime cost cap
curl -X PUT "https://<YOUR_DOMAIN>/api/v1/configurations/max_llm_cost_per_user" \
-H "Authorization: Bearer <API_KEY>" \
-H "Content-Type: application/json" \
-d '{"string_value": "1000"}'
LLM cost limits are overridable per role. Use Roles and Permissions to set different cost limits for different user groups.
Chat fairness​
Chat fairness provides priority-based queuing and per-user rate limiting for chat requests. When enabled, the system uses a three-tier priority model that maintains fair access across users:
- High priority: Users with no recent activity receive the fastest response times.
- Normal priority: Active users below the heavy-use threshold receive standard response times.
- Low priority: The system deprioritizes heavy users who exceed the activity threshold, giving other users fair access.
- Starvation prevention: Every N requests (configured by
chat_fairness_starvation_interval), the scheduler rotates queue priority so lower-priority queues get served first.
The following environment variables control chat fairness behavior at deployment time:
| Setting | Description |
|---|---|
chat_fairness_enabled | Top-level toggle for chat fairness. |
chat_rate_limit_per_minute | Maximum chat requests per user per minute. Exceeding this limit returns an HTTP 429 response. |
chat_fresh_user_window_minutes | Time window (in minutes) for classifying a user as high priority. |
chat_heavy_user_threshold | Number of requests before a user is classified as low priority. |
chat_fairness_starvation_interval | Number of requests between priority rotations. Default: 30. |
These settings are environment variables configured at deployment time. They are not runtime-configurable through the /api/v1/configurations endpoint.
The following setting is runtime-configurable through the /api/v1/configurations endpoint:
| Setting | Overridable | Description |
|---|---|---|
chat_max_concurrent_per_user | Yes | Maximum simultaneous active chat requests per user. |
Configure chat fairness limits appropriate to your user base. Setting the concurrent limit and rate limit prevents any single user from monopolizing chat resources at the expense of others.
Crawl fairness​
Crawl fairness controls concurrent document ingestion jobs per user. When enabled, the system manages job priority to prevent any single user from monopolizing ingestion resources.
The following environment variables control crawl fairness behavior at deployment time:
| Setting | Description |
|---|---|
crawl_fairness_enabled | Top-level toggle for crawl fairness. |
crawl_fresh_user_window_minutes | Time window (in minutes) for standard priority classification. |
crawl_heavy_user_jobs_threshold | Number of jobs before a user is deprioritized. |
These settings are environment variables configured at deployment time. They are not runtime-configurable through the /api/v1/configurations endpoint.
The following setting is runtime-configurable through the /api/v1/configurations endpoint:
| Setting | Overridable | Description |
|---|---|---|
crawl_max_concurrent_per_user | Yes | Maximum concurrent document ingestion jobs per user. |
How crawl fairness works​
- Users below the heavy-use threshold use the standard ingestion queue.
- The system routes users who exceed the threshold to a deprioritized queue, letting other users' jobs proceed first.
MCP rate limiting​
Model Context Protocol (MCP) endpoints have dedicated rate limiting to control request volume and payload size.
| Setting | Description |
|---|---|
mcp_rate_limit | Maximum requests per user per minute. Exceeding this limit returns an HTTP 429 response. |
mcp_max_body_size_mb | Maximum request body size in MB. Exceeding this limit returns an HTTP 413 response. |
mcp_max_concurrent_jobs | Maximum concurrent blocking job waits per user. |
MCP rate limit settings are environment variables configured on the mux deployment. They are not runtime-configurable through the /api/v1/configurations endpoint.
Configure rate limits with the Python SDK​
The following example sets collection, document, and LLM cost limits using the Python SDK:
from h2ogpte import H2OGPTE
admin = H2OGPTE(address="https://<YOUR_DOMAIN>", api_key="<API_KEY>")
# Set collection limit per user (overridable per role)
admin.set_global_configuration(
"collection_limit_per_user", "500", can_overwrite=True, is_public=True
)
# Set document limit per user (overridable per role)
admin.set_global_configuration(
"document_limit_per_user", "5000", can_overwrite=True, is_public=True
)
# Set LLM cost limits (overridable per role)
admin.set_global_configuration(
"max_llm_cost_per_user_per_24h", "25", can_overwrite=True, is_public=True
)
admin.set_global_configuration(
"max_llm_cost_per_user", "1000", can_overwrite=True, is_public=True
)
Related topics​
- System Settings - Manage global configuration settings including limits
- Roles and Permissions - Configure per-role overrides for overridable limit settings
- Collection Lifecycle - Collection expiration and size limits
- Submit and view feedback for this page
- Send feedback about Enterprise h2oGPTe to cloud-feedback@h2o.ai