Skip to main content
Version: v0.8.0

Analysis settings

H2O AutoInsights selects certain analysis types before starting an overall H2O AutoInsights analysis as part of the Analysis flow. Users can customize the default settings for all selected analysis types. All supported analysis types have specific default settings.

H2O AutoInsights will use, for the most part, default settings for a particular analysis type, and when necessary, it will customize them to extract as many possible insights. The customization will come as a result of the column data types of the selected dataset. However, users can always customize all settings of a particular analysis type right before starting an overall H2O AutoInsights analysis.

info

An H2O AutoInsights analysis is composed of supported analysis types.

Access analysis settings

To access and customize the settings of selected analysis types:

  1. Before starting an analysis (before you click Analyze), click Customize....
  2. In the Insight Types menu, click the analysis type you want to customize.

To learn about the different settings of each analysis type, see the below sections as each is explained in turn.

Save/view insights

Save

In the H2O AutoInsights dashboard, you can view all generated insights of an overall H2O AutoInsights analysis. To save a particular insight from the dashboard:

  1. In an insight card, click More Options.
  2. Click Star.
    info

    The H2O AutoInsights dashboard appears right after completing an overall H2O AutoInsights analysis.

View

All starred insights are saved and can always be view on the Starred page. To access the Starred page:

  1. Click Menu.
  2. In the H2O AutoInsights menu, click Starred.

View completed analyses

To view completed analyses in H2O AutoInsight:

  1. Click Menu.
  2. In the H2O AutoInsights menu, click Analysis.

On the Analysis page, all completed analyses are listed by name in the Completed Analyses table.

Analysis types settings

Frequency analysis

Below are all the available settings for a Frequency analysis.

  1. Number of insights
    • The number of insights H2O AutoInsights will generate around frequency analysis, H2O AutoInsights will sort these insights locally by variance.

Top and bottom analysis

Below are all the available settings for a Top and bottom analysis.

  1. Number of insights
    • The number of insights H2O AutoInsights will generate around a top and bottom analysis. H2O AutoInsights will sort these insights locally by the coefficient of variation of the category counts.
  2. Maximum cardinality

Measure by measure analysis

Below are all the available settings for a Measure by measure analysis.

  1. Number of insights
    • The number of insights H2O AutoInsights will generate around a measure by measure analysis.
      info

      A pair of variables (X, Y) can contain multiple sub-insights based on the additional insights selection. The maximum number of insights set here does not account for sub-insights and only relates to unique variable pairs used in the X and Y-axis.

  2. Minimum cardinality for the measures
  3. Fit regression line
  4. Additional insight choices
    Options
    • Default
    • Add dimension: Adding a dimension lets the points in the scatter plot be colored by the categories of a categorical column.
    • Add measure: Adding a measure lets the points in the scatter plot be sized by another numeric variable.
    • Add dimension and size by additional measure
  5. Remove outliers: Removing outliers helps visualize the relationship better.
  6. Contamination % of outliers: Specify the possible percentage of anomalies (outliers) in the dataset.

Correlation analysis

Below are all the available settings for a Correlation analysis.

  1. Include plots
    Options
    • Heatmap
    • Network graph
  2. Maximum number of columns
    • Select the number of columns to include in the correlation plot. When the original number of columns in the data is higher than the set value, a sub-sampling technique is applied to select the columns that best capture the data's maximum variance.

Dimension by dimension analysis

Below are all the available settings for a Dimension by dimension analysis.

  1. Number of insights
    • Insights are sorted locally by matrix cardinality.
  2. Maximum cardinality
    • Cardinality is the number of possible values a feature can assume. If the cardinality of a variable is higher than the maximum set, the heatmaps will be replaced with a barplot.
  3. Color heat map cells by
    Options
    • Aggregation of a numeric variable
    • Count across categories
  4. Aggregation method
    • The metric selected here is used to aggregate numeric data; this is represented by the intensity of a cell in the heatmap.
      Options
      • Average
      • Sum
      • Minimum
      • Maximum
  5. Additional insights using calendar heatmaps
    • Calendar heatmaps visualize data across time components like the day of the week, month, hour, etc.
  6. Plot type
    Options
    • Heatmap
    • Stacked bars
    • 100% Stacked bars
    • Grouped bars

Measure by dimension analysis

Below are all the available settings for a Measure by dimension analysis.

  1. Number of insights
  2. Maximum cardinality
  3. Aggregation method
    • The metric selected for this method aggregates the numeric data within each category and is visualized only in the bar plot.
      Options
      • Average
      • Sum
      • Minimum
      • Maximum

Clustering analysis

Below are all the available settings for a Clustering analysis.

  1. Auto cluster
  2. Maximum number of clusters
  3. Explanation fidelity
    • Controls explanation, faithfulness, and complexity. Ideally, one would like 100% faithfulness, but sometimes this affects the readability of the explanation in the current setting.

Anomaly detection multivariable analysis

Below are all the available settings for a Anomaly detection multivariable analysis.

  1. Imputation method for missing values
    Options
    • Default: The default method uses mean to replace missing values for numerical and transformed categorical columns.
    • Model-based imputation: Model-based imputation uses iterative predictive models to replace missing values.
  2. Estimated percentage of outliers (contamination)
    • The estimated percentage influences the maximum number of points classified as anomalies (outliers).
  3. Explanation fidelity
    • Explanation fidelity controls explanation, faithfulness, and complexity. Ideally, one would like 100% faithfulness, but sometimes this affects the readability of the explanation in the current setting.

Time series analysis

Below are all the available settings for a Time series analysis.

  1. Number of insights
  2. Maximum cardinality
    • Cardinality is the number of possible values a feature can assume. The Time Series Analysis skips a column if the cardinality is higher than the selected value.
  3. Include categoricals
    • Categoricals that are not marked as time series identifiers will be included in the Time Series Analysis.
  4. Include counts as measure
    • H2O AutoInsights will treat counts of the categories of a categorical variable at every date unit as a measure.
  5. Aggregation method
    • The selected method will aggregate the numeric columns at a date unit level.
      Options
      • Average
      • Sum
      • Minimum
      • Maximum
  6. Date aggregation level
    • H2O AutoInsights will use the selected data unit-level to aggregate the data.
      Options
      • Auto
      • Daily
      • Weekly
      • Monthly
      • Quarterly
      • Yearly

Keywords or phrases analysis

Below are all the available settings for a Keywords or phrases analysis.

  1. Number of top N-grams

Word embeddings analysis

Below are all the available settings for a Word embeddings analysis.

  1. Embedding algorithm
    • Either Word2Vec or FastText (an extension of word2vec) can be selected to learn word representation. While Word2Vec treats words as the smallest entity during training, FastText considers each word composed of character N-grams. So, the word vector is composed of the sum of the character N-grams.
      Options
      • Word2Vec
      • FastText
  2. Dimensionality reduction technique
    • Dimensionality reduction techniques are applied to the word embeddings for visualization.
      Options
      • PCA (Principal Component Analysis)
      • UMAP (Uniform Manifold Approximation and Projection)
  3. Minimum frequency of words
  4. Number of iterations over corpus
  5. Number of workers
  6. Dimension of embeddings
  7. Fix unicode
  8. Remove URL
  9. Remove email
  10. Remove phone numbers
  11. Remove numerals
  12. Remove currency
  13. Remove punctuation
  14. Remove accents

Sentiment analysis

info

There are no customizable settings for a Sentiment analysis.

Topic modeling analysis

Below are all the available settings for a Topic modeling analysis.

  1. Number of topics
  2. Alpha: The higher the alpha value, the more equal the number of documents will be across the topics. Setting a lower value will indicate very few topics dominate the dataset.
  3. Eta: The lower the Eta, the fewer words the topics will contain.
  4. Tokens occurred in at least N documents: The selected value will keep words that are contained in at least N documents.
  5. Tokens occurred in more than N percentage of documents: The selected value (N) will keep words that are contained in no more than N documents (fraction of total corpus size, not an absolute number).
  6. First N most frequent tokens: The selected value (N) will keep only the first N most frequent tokens.

Geographic analysis

Below are all the available settings for a Geographic analysis.

  1. Number of insights: Insights are sorted locally by statistical metric - coefficient of variation.
  2. Aggregation method: Data is aggregated at geo levels; the selected method aggregates the numeric columns.
    Options
    • Average
    • Sum
    • Minimum
    • Maximum
info
  • For a Geographic analysis to be activated, your dataset needs to have two columns named latitude and longitude. If your dataset specifies the latitude and longitude columns with different names, you can specify the appropriate columns on the Additional column settings card.
  • The following geographic dimensions are supported:
    • City (only for New York City)
    • U.S. States (at the state level)
    • Countries (at the country level)

Feedback