Skip to main content
Version: Next

A Chat session's settings


You can personalize a Chat session using various settings. These settings, for example, let you adjust the system prompt and choose which Large Language Model (LLM) to use for generating responses.


  1. In the Enterprise h2oGPTe navigation menu, click Chats.
  2. In the Chat sessions table, select the Chat session you want to customize.
  3. Click Settings. Chat settings
  4. Customize the Chat session according to your requirements. For more detailed information about each setting, see Chat settings.
  5. Click Update to apply the changes.

Chat settings

The Chat settings tab includes the following settings:

Prompt template to use

This setting lets you choose a prompt template to customize the prompts utilized within the Collection. You can create your prompt template on the Prompts page and apply it to your Collection.

LLM to use

This setting lets you choose the Large Language Model (LLM) to generate responses.

Generation approach (RAG type to use)

This setting lets you select the generation approach for responses. Enterprise h2oGPTe provides various methods to generate responses:

  • LLM Only (no RAG)

    This option generates a response to answer the user's query solely based on the Large Language Model (LLM) without considering supporting Document contexts from the collection.

  • RAG (Retrieval Augmented Generation)

    This option utilizes a neural/lexical hybrid search approach to find relevant contexts from the collection based on the user's query for generating a response. Applicable when the prompt is easily understood and the context contains enough information to come up with a correct answer.

    RAG first performs a vector search for similar chunks limited by the number of chunks sorted by distance metric. By default, Enterprise h2oGPTe chooses the top 25 chunks using lexical distance and top 25 using neural distance. The distance metric is calculated by the cross entropy loss from the BAAI/bge-reranker-large model. These chunks are passed to the selected LLM to answer the user's query. Note that Enterprise h2oGPTe lets you view the exact prompt passed to the LLM.

  • HyDE RAG (Hypothetical Document Embeddings)

    This option extends RAG with neural/lexical hybrid search by utilizing the user's query and the LLM response to find relevant contexts from the collection to generate a response. It requires two LLM calls. Applicable when the prompt is somewhat ambiguous or the context does not contain enough information to come up with a correct answer.

    HyDE (Hypothetical Document Embeddings) is essentially the same as RAG except that it does not simply search for the embeddings with the smallest distance to the query. Instead, it first asks an LLM to try to answer the question. It then uses the question and the hypothetical answer to search for the nearest chunks.

    Example question: What are the implications of high interest rate?

    • RAG: Searches for chunks in the document with a small distance to the embedding of the question: "What are the implications of high interest rate?"

    • Hyde RAG:

      1. Asks an LLM: "What are the implications of high interest rate?"
      2. LLM answers: "High interest rates can have several implications, including: higher borrowing cost, slower economic growth, increased savings rate, higher returns on investment, exchange rate fluctuation, ..."
      3. RAG searches for chunks in the document with a small distance to the embedding of the question AND the answer from Step 2. This effectively increases the potentially relevant chunks.
  • HyDE RAG+ (Combined HyDE+RAG)

    This option utilizes RAG with neural/lexical hybrid search by using both the user's query and the HyDE RAG response to find relevant contexts from the collection to generate a response. It requires three LLM calls. Applicable when the prompt is very ambiguous or the context contains conflicting information and it's very difficult to come up with a correct answer.

  • RAG+ (RAG without LLM context limit)

    This option utilizes RAG (Retrieval Augmented Generation) with neural/lexical hybrid search using the user's query to find relevant contexts from the Collection for generating a response. It uses the recursive summarization technique to overcome the LLM's context limitations. The process requires multiple LLM calls. Applicable when the prompt is asking for a summary of the context or a lengthy answer such as a procedure that might require multiple large pieces of information to process.

    The vector search is repeated as in RAG but this time k neighboring chunks are added to the retrieved chunks. These returned chunks are then sorted in the order they appear in the document so that neighboring chunks stay together. The expanded set of chunks is essentially a filtered sub-document of the original document, but more pertinent to the user's question. Enterprise h2oGPTe then summarizes this sub-document while trying to answer the user's question. This step uses the summary API, which applies the prompt to each context-filling chunk of the sub-document. It then takes the answers and joins 2+ answers and subsequently applies the same prompt, recursively reducing until only one answer remains.

    The benefit of this additional complexity is that if the answer is throughout the document, this mode is able to include more information from the original document as well as neighboring chunks for additional context.

Number of neighboring chunks to include for RAG+

This setting lets you determine the number of neighboring chunks to include when using Retrieve and Generate Plus (RAG+) models. RAG+ combines the benefits of retrieval-based and generative models for more accurate and contextually relevant responses.

When you increase the number of neighboring chunks, you expand the context available for the RAG+ model to generate responses. This can enhance the accuracy of the generated outputs because the model has access to a broader range of information. By considering more surrounding context, the model can better understand the nuances and intricacies of the input query, resulting in more informed and relevant responses.

However, it's essential to note that higher values for this setting come with trade-offs. Increasing the number of neighboring chunks requires more time and computational resources. Each additional chunk expands the amount of data the model needs to process, leading to longer processing times and potentially increasing the number of language model (LLM) calls necessary to generate a response.

Therefore, when adjusting this setting, you need to strike a balance between accuracy and efficiency. Higher values can indeed improve the quality of responses, but they also come with increased computational costs. It's essential to consider your specific use case and resources available when determining the optimal value for this setting.


This setting lets you adjust the temperature parameter, which affects the model's text generation variability. By softening the probability distribution over the vocabulary, you encourage the model to produce more diverse and creative responses.

A higher temperature value makes the model more willing to take risks and explore less likely word choices. This can result in more unpredictable but more imaginative outputs. Conversely, lower temperatures produce more conservative and predictable responses, favoring high-probability words.

Adjusting the temperature parameter is particularly useful when injecting more variability into the generated text. For example, a higher temperature can inspire a broader range of ideas in creative writing or brainstorming scenarios. However, a lower temperature might be preferable to ensure accuracy in tasks requiring precise or factual information.

Max. number of output tokens

This setting lets you control the maximum number of tokens the model can generate as output. There's a constraint on the number of tokens (words or subwords) the model can process simultaneously. This includes both the input text you provide and the generated output.

This setting is crucial because it determines the length of the responses the model can provide. By default, the model limits the number of tokens in its output to ensure it can handle the input text and generate a coherent response. However, for detailed answers or to avoid incomplete responses, you may need to allow for longer responses.

Increasing the number of output tokens expands the model's capacity to generate longer responses. However, this expansion comes with a trade-off: it may require sacrificing some input context. In other words, allocating more tokens to the output might mean reducing the number of tokens available for processing the input text. This trade-off is important to consider because it can affect the quality and relevance of the model's responses.

Include self-reflection using gpt-4-1106-preview

This setting lets you engage in self-reflection with the model's responses. With self-reflection, the model reviews both the prompt you've given and the response it generates. It's particularly useful for spot checks, especially when working with less computationally expensive models.

Self-reflection lets you assess the quality and relevance of the model's output in the context of the input prompt. Reviewing both the prompt and the generated response, you can quickly identify any inconsistencies, errors, or areas for improvement.

Additionally, self-reflection becomes even more valuable when using a less expensive model like GPT-4-1106-preview. Since these models have limitations compared to larger ones, such as GPT-4, self-reflection helps ensure that the responses meet your expectations despite these constraints.