Skip to main content

Task 8: Surrogate models tab

Overview

In the Surrogate models tab, you'll explore a potent data mining and engineering technique. Here, surrogate models simplify and elucidate complex phenomena. For example, a decision tree surrogate model might predict outcomes of the more intricate H2O Driverless AI model using the original inputs. While these surrogate models provide a heuristic understanding, they need more mathematical precision. Nonetheless, they serve as valuable tools for explanation and debugging, offering insights into global and local predictions and model residuals or errors. It's crucial to note that surrogate models operate within the original feature space, providing approximations of the underlying complexity and shedding light on the behavior of the H2O Driverless AI model.

Let's better understand the H2O Driverless AI model by observing the decision tree surrogate model.

  1. In the Surrogate models tab, click the DECISION TREE tile. Surrogate models tab containing a decision tree surrogate model
note

Understanding the decision tree surrogate model

A decision tree surrogate model involves constructing a decision tree that mimics the predictions of the more complex model, allowing analysts to gain insights into its inner workings. Tracing the decision paths within the surrogate tree allows for insights into the decision-making processes of the H2O Driverless AI model.

  • Variable interactions: Variables positioned vertically in the decision tree may exhibit strong interactions within the H2O Driverless AI model.
  • Variable importance: Variables higher in the decision tree typically signify greater importance in decision-making.
  • Thick lines: Thick lines highlighting a specific path to a terminal node indicate a commonly traversed decision path.
  • Thin lines: Thin lines denote relatively infrequent decision paths.
  • Terminal nodes: Terminal nodes represent distinct options (for our model, each node represents default probabilities).

The Root Mean Square Error (RMSE) value of 0.08 indicates that the decision tree surrogate model effectively approximates the H2O Driverless AI model. Considering the moderate RMSE and the relatively high R-squared (R2) value of 0.84, it is reasonable to regard this surrogate model as somewhat reliable.

rmse-and-r2.png

Continuing the previous discussion, PAY_1 emerges as a significant, if not the most critical, feature within the decision tree. Given its position in the initial split of the tree, PAY_1 likely assumes prominence as the pivotal feature.

  1. Click the terminal node with the highest probability; it should turn red upon clicking. Highest probability path in the decision tree surrogate model

Before moving forward, recall that the values for PAY_1 represent the repayment status of individuals for the month of September 2005 within the credit card dataset. Each value in the PAY_1 column signifies the timeliness or delay in payment, categorized on a scale where -1 indicates timely payment and values from 1 to 9 denote payment delays of one to nine months or more. For instance, a value of 1 indicates a one-month delay, while 9 signifies a delay of nine months or longer.

Taking into account what we have discussed so far, the significance and relevance of PAY_1 become apparent. According to the selected node, being over one month late (PAY_1 IS >= 1.500) automatically steers individuals to the side of the tree, where higher default probabilities are prevalent.

As a cross-check, the decision tree surrogate model indicates that the most frequent predictions entail low default probabilities, which aligns logically with the observation that defaulting is relatively uncommon.

You have now explored the surrogate models tab. In Task 8, you will learn how to create a new model diagnostic based on the successfully completed experiment.


Feedback