Analysis settings
H2O AutoInsights selects certain analysis types before starting an overall H2O AutoInsights analysis as part of the Analysis flow. Users can customize the default settings for all selected analysis types. All supported analysis types have specific default settings.
H2O AutoInsights will use, for the most part, default settings for a particular analysis type, and when necessary, it will customize them to extract as many possible insights. The customization will come as a result of the column data types of the selected dataset. However, users can always customize all settings of a particular analysis type right before starting an overall H2O AutoInsights analysis.
An H2O AutoInsights analysis is composed of supported analysis types.
Access analysis settings
To access and customize the settings of selected analysis types:
- Before starting an analysis (before you click Analyze), click Customize....
- In the Insight Types menu, click the analysis type you want to customize.
To learn about the different settings of each analysis type, see the below sections as each is explained in turn.
Save/view insights
Save
In the H2O AutoInsights dashboard, you can view all generated insights of an overall H2O AutoInsights analysis. To save a particular insight from the dashboard:
- In an insight card, click More Options.
- Click Star. info
The H2O AutoInsights dashboard appears right after completing an overall H2O AutoInsights analysis.
View
All starred insights are saved and can always be view on the Starred page. To access the Starred page:
- Click Menu.
- In the H2O AutoInsights menu, click Starred.
View completed analyses
To view completed analyses in H2O AutoInsight:
- Click Menu.
- In the H2O AutoInsights menu, click Analysis.
On the Analysis page, all completed analyses are listed by name in the Completed Analyses table.
Analysis types settings
Frequency analysis
Below are all the available settings for a Frequency analysis.
- Number of insights
- The number of insights H2O AutoInsights will generate around frequency analysis, H2O AutoInsights will sort these insights locally by variance.
Top and bottom analysis
Below are all the available settings for a Top and bottom analysis.
- Number of insights
- The number of insights H2O AutoInsights will generate around a top and bottom analysis. H2O AutoInsights will sort these insights locally by the coefficient of variation of the category counts.
- Maximum cardinality
Measure by measure analysis
Below are all the available settings for a Measure by measure analysis.
- Number of insights
- The number of insights H2O AutoInsights will generate around a measure by measure analysis.info
A pair of variables (X, Y) can contain multiple sub-insights based on the additional insights selection. The maximum number of insights set here does not account for sub-insights and only relates to unique variable pairs used in the X and Y-axis.
- The number of insights H2O AutoInsights will generate around a measure by measure analysis.
- Minimum cardinality for the measures
- Fit regression line
- Additional insight choicesOptions
- Default
- Add dimension: Adding a dimension lets the points in the scatter plot be colored by the categories of a categorical column.
- Add measure: Adding a measure lets the points in the scatter plot be sized by another numeric variable.
- Add dimension and size by additional measure
- Remove outliers: Removing outliers helps visualize the relationship better.
- Contamination % of outliers: Specify the possible percentage of anomalies (outliers) in the dataset.
Correlation analysis
Below are all the available settings for a Correlation analysis.
- Include plots Options
- Heatmap
- Network graph
- Maximum number of columns
- Select the number of columns to include in the correlation plot. When the original number of columns in the data is higher than the set value, a sub-sampling technique is applied to select the columns that best capture the data's maximum variance.
Dimension by dimension analysis
Below are all the available settings for a Dimension by dimension analysis.
- Number of insights
- Insights are sorted locally by matrix cardinality.
- Maximum cardinality
- Cardinality is the number of possible values a feature can assume. If the cardinality of a variable is higher than the maximum set, the heatmaps will be replaced with a barplot.
- Color heat map cells byOptions
- Aggregation of a numeric variable
- Count across categories
- Aggregation method
- The metric selected here is used to aggregate numeric data; this is represented by the intensity of a cell in the heatmap.Options
- Average
- Sum
- Minimum
- Maximum
- The metric selected here is used to aggregate numeric data; this is represented by the intensity of a cell in the heatmap.
- Additional insights using calendar heatmaps
- Calendar heatmaps visualize data across time components like the day of the week, month, hour, etc.
- Plot typeOptions
- Heatmap
- Stacked bars
- 100% Stacked bars
- Grouped bars
Measure by dimension analysis
Below are all the available settings for a Measure by dimension analysis.
- Number of insights
- Maximum cardinality
- Aggregation method
- The metric selected for this method aggregates the numeric data within each category and is visualized only in the bar plot.Options
- Average
- Sum
- Minimum
- Maximum
- The metric selected for this method aggregates the numeric data within each category and is visualized only in the bar plot.
Clustering analysis
Below are all the available settings for a Clustering analysis.
- Auto cluster
- Maximum number of clusters
- Explanation fidelity
- Controls explanation, faithfulness, and complexity. Ideally, one would like 100% faithfulness, but sometimes this affects the readability of the explanation in the current setting.
Anomaly detection multivariable analysis
Below are all the available settings for a Anomaly detection multivariable analysis.
- Imputation method for missing valuesOptions
- Default: The default method uses mean to replace missing values for numerical and transformed categorical columns.
- Model-based imputation: Model-based imputation uses iterative predictive models to replace missing values.
- Estimated percentage of outliers (contamination)
- The estimated percentage influences the maximum number of points classified as anomalies (outliers).
- Explanation fidelity
- Explanation fidelity controls explanation, faithfulness, and complexity. Ideally, one would like 100% faithfulness, but sometimes this affects the readability of the explanation in the current setting.
Time series analysis
Below are all the available settings for a Time series analysis.
- Number of insights
- Maximum cardinality
- Cardinality is the number of possible values a feature can assume. The Time Series Analysis skips a column if the cardinality is higher than the selected value.
- Include categoricals
- Categoricals that are not marked as time series identifiers will be included in the Time Series Analysis.
- Include counts as measure
- H2O AutoInsights will treat counts of the categories of a categorical variable at every date unit as a measure.
- Aggregation method
- The selected method will aggregate the numeric columns at a date unit level.Options
- Average
- Sum
- Minimum
- Maximum
- The selected method will aggregate the numeric columns at a date unit level.
- Date aggregation level
- H2O AutoInsights will use the selected data unit-level to aggregate the data. Options
- Auto
- Daily
- Weekly
- Monthly
- Quarterly
- Yearly
- H2O AutoInsights will use the selected data unit-level to aggregate the data.
Keywords or phrases analysis
Below are all the available settings for a Keywords or phrases analysis.
- Number of top N-grams
Word embeddings analysis
Below are all the available settings for a Word embeddings analysis.
- Embedding algorithm
- Either Word2Vec or FastText (an extension of word2vec) can be selected to learn word representation. While Word2Vec treats words as the smallest entity during training, FastText considers each word composed of character N-grams. So, the word vector is composed of the sum of the character N-grams.Options
- Word2Vec
- FastText
- Either Word2Vec or FastText (an extension of word2vec) can be selected to learn word representation. While Word2Vec treats words as the smallest entity during training, FastText considers each word composed of character N-grams. So, the word vector is composed of the sum of the character N-grams.
- Dimensionality reduction technique
- Dimensionality reduction techniques are applied to the word embeddings for visualization.Options
- PCA (Principal Component Analysis)
- UMAP (Uniform Manifold Approximation and Projection)
- Dimensionality reduction techniques are applied to the word embeddings for visualization.
- Minimum frequency of words
- Number of iterations over corpus
- Number of workers
- Dimension of embeddings
- Fix unicode
- Remove URL
- Remove email
- Remove phone numbers
- Remove numerals
- Remove currency
- Remove punctuation
- Remove accents
Sentiment analysis
There are no customizable settings for a Sentiment analysis.
Topic modeling analysis
Below are all the available settings for a Topic modeling analysis.
- Number of topics
- Alpha: The higher the alpha value, the more equal the number of documents will be across the topics. Setting a lower value will indicate very few topics dominate the dataset.
- Eta: The lower the Eta, the fewer words the topics will contain.
- Tokens occurred in at least N documents: The selected value will keep words that are contained in at least N documents.
- Tokens occurred in more than N percentage of documents: The selected value (N) will keep words that are contained in no more than N documents (fraction of total corpus size, not an absolute number).
- First N most frequent tokens: The selected value (N) will keep only the first N most frequent tokens.
Geographic analysis
Below are all the available settings for a Geographic analysis.
- Number of insights: Insights are sorted locally by statistical metric - coefficient of variation.
- Aggregation method: Data is aggregated at geo levels; the selected method aggregates the numeric columns.Options
- Average
- Sum
- Minimum
- Maximum
- For a Geographic analysis to be activated, your dataset needs to have two columns named latitude and longitude. If your dataset specifies the latitude and longitude columns with different names, you can specify the appropriate columns on the Additional column settings card.
- The following geographic dimensions are supported:
- City (only for New York City)
- U.S. States (at the state level)
- Countries (at the country level)
- Submit and view feedback for this page
- Send feedback about H2O AutoInsights to cloud-feedback@h2o.ai