Skip to main content
Version: v1.7.3-14 🚧

Rate limits and fairness

Overview​

Enterprise h2oGPTe provides configurable resource limits and fair-use policies that maintain fair access across users and prevent resource exhaustion. These controls include:

  • Collection and document limits: Per-user caps on the number of collections and documents
  • LLM cost controls: Per-user spending limits on a rolling 24-hour and lifetime basis
  • Chat fairness: Three-tier priority queuing and per-user rate limiting for chat requests
  • Crawl fairness: Concurrent job limits and priority management for document ingestion
  • MCP rate limiting: Per-user request limits for Model Context Protocol (MCP) endpoints
note

All settings on this page require administrator privileges. Administrators can customize settings marked Overridable per role through Roles and Permissions.

Access rate limit settings​

  1. In Enterprise h2oGPTe, click Account Circle.
  2. Select System Dashboard.
  3. In the Configuration section, click System settings.
  4. Select the LIMITS category tab.

Collection and document limits​

These settings control the maximum number of collections and documents each user can create.

SettingOverridableDescription
collection_limitNoSystem-wide maximum number of collections.
collection_limit_per_userYesMaximum collections per user.
document_limit_per_userYesMaximum documents per user.
agents_document_limit_per_userYesMaximum documents created by agents per user.
default_collection_size_limitNoDefault maximum storage per collection (in bytes). Range: 1 MB to 10 GB. See Collection Lifecycle for configuration examples.

Configure collection limits​

# Set per-user collection limit
curl -X PUT "https://<YOUR_DOMAIN>/api/v1/configurations/collection_limit_per_user" \
-H "Authorization: Bearer <API_KEY>" \
-H "Content-Type: application/json" \
-d '{"string_value": "500"}'

# Set per-user document limit
curl -X PUT "https://<YOUR_DOMAIN>/api/v1/configurations/document_limit_per_user" \
-H "Authorization: Bearer <API_KEY>" \
-H "Content-Type: application/json" \
-d '{"string_value": "5000"}'

LLM cost controls​

LLM cost controls track and cap the cost of LLM usage per user. Cost tracking is always active. When a user reaches a limit, the system rejects their additional LLM requests until the limit resets (24-hour rolling window) or an administrator increases it.

SettingOverridableDescription
max_llm_cost_per_user_per_24hYesRolling 24-hour cost cap per user. Set to -1 to disable.
max_llm_cost_per_userYesLifetime cost cap per user. Set to -1 to disable.
max_llm_cost_per_guestYesCost cap for guest users. Set to -1 to disable.
llm_cost_unitsNoCurrency unit for cost tracking (for example, USD).

Configure LLM cost limits​

# Set 24-hour cost cap
curl -X PUT "https://<YOUR_DOMAIN>/api/v1/configurations/max_llm_cost_per_user_per_24h" \
-H "Authorization: Bearer <API_KEY>" \
-H "Content-Type: application/json" \
-d '{"string_value": "25"}'

# Set lifetime cost cap
curl -X PUT "https://<YOUR_DOMAIN>/api/v1/configurations/max_llm_cost_per_user" \
-H "Authorization: Bearer <API_KEY>" \
-H "Content-Type: application/json" \
-d '{"string_value": "1000"}'
note

LLM cost limits are overridable per role. Use Roles and Permissions to set different cost limits for different user groups.

Chat fairness​

Chat fairness provides priority-based queuing and per-user rate limiting for chat requests. When enabled, the system uses a three-tier priority model that maintains fair access across users:

  • High priority: Users with no recent activity receive the fastest response times.
  • Normal priority: Active users below the heavy-use threshold receive standard response times.
  • Low priority: The system deprioritizes heavy users who exceed the activity threshold, giving other users fair access.
  • Starvation prevention: Every N requests (configured by chat_fairness_starvation_interval), the scheduler rotates queue priority so lower-priority queues get served first.

The following environment variables control chat fairness behavior at deployment time:

SettingDescription
chat_fairness_enabledTop-level toggle for chat fairness.
chat_rate_limit_per_minuteMaximum chat requests per user per minute. Exceeding this limit returns an HTTP 429 response.
chat_fresh_user_window_minutesTime window (in minutes) for classifying a user as high priority.
chat_heavy_user_thresholdNumber of requests before a user is classified as low priority.
chat_fairness_starvation_intervalNumber of requests between priority rotations. Default: 30.
note

These settings are environment variables configured at deployment time. They are not runtime-configurable through the /api/v1/configurations endpoint.

The following setting is runtime-configurable through the /api/v1/configurations endpoint:

SettingOverridableDescription
chat_max_concurrent_per_userYesMaximum simultaneous active chat requests per user.
important

Configure chat fairness limits appropriate to your user base. Setting the concurrent limit and rate limit prevents any single user from monopolizing chat resources at the expense of others.

Crawl fairness​

Crawl fairness controls concurrent document ingestion jobs per user. When enabled, the system manages job priority to prevent any single user from monopolizing ingestion resources.

The following environment variables control crawl fairness behavior at deployment time:

SettingDescription
crawl_fairness_enabledTop-level toggle for crawl fairness.
crawl_fresh_user_window_minutesTime window (in minutes) for standard priority classification.
crawl_heavy_user_jobs_thresholdNumber of jobs before a user is deprioritized.
note

These settings are environment variables configured at deployment time. They are not runtime-configurable through the /api/v1/configurations endpoint.

The following setting is runtime-configurable through the /api/v1/configurations endpoint:

SettingOverridableDescription
crawl_max_concurrent_per_userYesMaximum concurrent document ingestion jobs per user.

How crawl fairness works​

  • Users below the heavy-use threshold use the standard ingestion queue.
  • The system routes users who exceed the threshold to a deprioritized queue, letting other users' jobs proceed first.

MCP rate limiting​

Model Context Protocol (MCP) endpoints have dedicated rate limiting to control request volume and payload size.

SettingDescription
mcp_rate_limitMaximum requests per user per minute. Exceeding this limit returns an HTTP 429 response.
mcp_max_body_size_mbMaximum request body size in MB. Exceeding this limit returns an HTTP 413 response.
mcp_max_concurrent_jobsMaximum concurrent blocking job waits per user.
note

MCP rate limit settings are environment variables configured on the mux deployment. They are not runtime-configurable through the /api/v1/configurations endpoint.

Configure rate limits with the Python SDK​

The following example sets collection, document, and LLM cost limits using the Python SDK:

from h2ogpte import H2OGPTE

admin = H2OGPTE(address="https://<YOUR_DOMAIN>", api_key="<API_KEY>")

# Set collection limit per user (overridable per role)
admin.set_global_configuration(
"collection_limit_per_user", "500", can_overwrite=True, is_public=True
)

# Set document limit per user (overridable per role)
admin.set_global_configuration(
"document_limit_per_user", "5000", can_overwrite=True, is_public=True
)

# Set LLM cost limits (overridable per role)
admin.set_global_configuration(
"max_llm_cost_per_user_per_24h", "25", can_overwrite=True, is_public=True
)
admin.set_global_configuration(
"max_llm_cost_per_user", "1000", can_overwrite=True, is_public=True
)

Feedback