Skip to main content
Version: Next

Analyze a dataset

Overview

H2O AutoInsights provides several analysis types that enable you to generate various distinct insights for a dataset.

caution

H2O AutoInsights supports tabular data. Image, video, and audio data are not currently supported.

Instructions

To analyze a dataset, consider the following instructions:

  1. In H2O AutoInsights, click Home.
  2. Click Datasets.
  3. In the Datasets table, click the name of the dataset you want to analyze (after importing your dataset, you can locate it in the Datasets table).
    note

    To learn how to import a dataset to H2O AutoInsights, see Import a dataset.

  4. Click Analyze.
  5. In the Enter a name for your analysis box, enter a name for the analysis.
  6. Click Save.
  7. (Optional) Transform column(s).
    note
    • H2O AutoInsights lets you control the treatment of numerical (measure) and categorical (dimension) type columns. If you skip this step, the auto transformation engine (H2O AuotInsights) handles the numeric to categorical conversion
    • To learn more about column transformations, see Data (column) transformations
  8. Click Skip/next.
  9. Select column(s) to analyze.
    note

    After specifying whether to transform specific data columns, H2O AutoInsights enables you to select the columns to analyze. H2O AutoInsights divides the dataset columns into three categories: Measure, dimension, and temporal.

    caution

    At least one column needs to be selected.

    • (Optional) In the Measures tab, select the checkbox of a column to analyze.
    • (Optional) In the Dimensions tab, select the checkbox of a column to analyze.
    • (Optional) In the Temporal tab, select the checkbox of a column to analyze.
  10. (Optional) Click the + Additional options tab.
    note
    • In the + Additional options column, you can:
      • Specify a reference target column to generate insights highlighting interactions between the selected reference column and other columns
      • Customize columns for the following two analysis types:
    • To learn more about the + Additional options tab, see + Additional options
  11. Click Next.
  12. (Optional) Click Customize.... CustomizeAnalyses settings
    note
    • After selecting the column(s) in the dataset to analyze, for the most part, H2O AutoInsights offers you the ability to customize the settings of the analysis type selected for the overall analysis of the dataset
  13. Click Analyze.

Data (column) transformations

+ Additional options

Overview

When preparing the settings of a dataset analysis, H2O AutoInsights provides certain additional column settings referred to as + Additional options that enable you to:

  • Specify a reference target column to generate insights that highlight interactions between the selected reference column and other columns
  • Among these + Additional options, users can customize columns for the following analysis types:
note

To learn about each additional option, see Options.

Additional column settings

Options

Select a reference (Target) column

Defines a reference target column to use for an overall H2O AutoInsights analysis. Selecting a reference target column generates insights highlighting interactions between the reference target column and other columns.

Select time series groups

Defines additional categories to use for a time series analysis. Selecting additional categories auto aggregates data to uncover hidden patterns across different groups.

Select a column for Latitude

Defines the latitude column to use for a geographic analysis. Selecting the latitude column allows the analysis to generate geographic insights that are rendered in interactive and visual maps.

Select a column for Longitude

Defines the longitude column to use for a geographic analysis. Selecting the longitude column allows the analysis to generate geographic insights that are rendered in interactive and visual maps.

Analyses settings

Overview

H2O AutoInsights selects certain analysis types for the overall analysis of a dataset. H2O AutoInsights will use, for the most part, default settings for a particular analysis type, and when necessary, it will customize them to extract as many possible insights. The customization comes as a result of the column data types of the selected dataset. However, users can always customize all settings of a particular analysis type right before starting a dataset analysis. H2O AutoInsights enables you to customize the default setting values for each chosen analysis type (except for sentiment analysis).

Frequency analysis

Below are all the available settings for a frequency analysis.

  1. Number of insights: The number of insights H2O AutoInsights generates from a frequency analysis, H2O AutoInsights sorts these insights locally by variance.

Top and bottom analysis

Below are all the available settings for a top and bottom analysis.

  1. Number of insights: The number of insights H2O AutoInsights generates from a top and bottom analysis. H2O AutoInsights sorts these insights locally by the coefficient of variation of the category counts.
  2. Maximum cardinality

Measure by measure analysis

Below are all the available settings for a measure by measure analysis.

  1. Number of insights: The number of insights H2O AutoInsights generates from a measure by measure analysis.
    info

    A pair of variables (X, Y) can contain multiple sub-insights based on the additional insights selection. The maximum number of insights set here does not account for sub-insights and only relates to unique variable pairs used in the X and Y-axis.

  2. Minimum cardinality for the measures
  3. Fit regression line
  4. Additional insight choices
    Options
    • Default
    • Add dimension: Adding a dimension lets the points in the scatter plot be colored by the categories of a categorical column.
    • Add measure: Adding a measure lets the points in the scatter plot be sized by another numeric variable.
    • Add dimension and size by additional measure
  5. Remove outliers: Removing outliers helps visualize the relationship better.
  6. Contamination % of outliers: Specify the possible percentage of anomalies (outliers) in the dataset.

Correlation analysis

Below are all the available settings for a correlation analysis.

  1. Include plots
    Options
    • Heatmap
    • Network graph
  2. Maximum number of columns: Select the number of columns to include in the correlation plot. When the original number of columns in the data is higher than the set value, a sub-sampling technique is applied to select the columns that best capture the data's maximum variance.

Dimension by dimension analysis

Below are all the available settings for a dimension by dimension analysis.

  1. Number of insights: Insights are sorted locally by matrix cardinality.
  2. Maximum cardinality: The maximun cardinality is the number of possible values a feature can assume. If the cardinality of a variable is higher than the maximum set, the heatmaps will be replaced with a barplot.
  3. Color heat map cells by
    Options
    • Aggregation of a numeric variable
    • Count across categories
  4. Aggregation method: The metric selected here is used to aggregate numeric data; this is represented by the intensity of a cell in the heatmap.
    Options
    • Average
    • Sum
    • Minimum
    • Maximum
  5. Additional insights using calendar heatmaps: Calendar heatmaps visualize data across time components like the day of the week, month, hour, etc.
  6. Plot type
    Options
    • Heatmap
    • Stacked bars
    • 100% stacked bars
    • Grouped bars

Measure by dimension analysis

Below are all the available settings for a measure by dimension analysis.

  1. Number of insights
  2. Maximum cardinality
  3. Aggregation method: The metric selected for this method aggregates the numeric data within each category and is visualized only in the bar plot.
    Options
    • Average
    • Sum
    • Minimum
    • Maximum

Clustering analysis

Below are all the available settings for a clustering analysis.

  1. Auto cluster
  2. Maximum number of clusters
  3. Explanation fidelity: Controls explanation, faithfulness, and complexity. Ideally, one would like 100% faithfulness, but sometimes this affects the readability of the explanation in the current setting.

Anomaly detection multivariable analysis

Below are all the available settings for a anomaly detection multivariable analysis.

  1. Imputation method for missing values
    Options
    • Default: The default method uses mean to replace missing values for numerical and to transformed categorical columns.
    • Model-based imputation: Model-based imputation uses iterative predictive models to replace missing values.
  2. Estimated percentage of outliers (contamination): The estimated percentage influences the maximum number of points classified as anomalies (outliers).
  3. Explanation fidelity: Explanation fidelity controls explanation, faithfulness, and complexity. Ideally, one would like 100% faithfulness, but sometimes this affects the readability of the explanation in the current setting.

Time series analysis

Below are all the available settings for a time series analysis.

  1. Number of insights
  2. Maximum cardinality: The maximun cardinality is the number of possible values a feature can assume. The time Series analysis skips a column if the cardinality is higher than the selected value.
  3. Include categoricals: Categoricals that are not marked as time series identifiers are included in the time series analysis.
  4. Include counts as measure: H2O AutoInsights treats counts of the categories of a categorical variable at every date unit as a measure.
  5. Aggregation method: The selected method aggregates the numeric columns at a date unit level.
    Options
    • Average
    • Sum
    • Minimum
    • Maximum
  6. Date aggregation level: H2O AutoInsights utilizes the selected data unit-level to aggregate the data.
    Options
    • Auto
    • Daily
    • Weekly
    • Monthly
    • Quarterly
    • Yearly

Keywords or phrases analysis

Below are all the available settings for a keywords or phrases analysis.

  1. Number of top N-grams

Word embeddings analysis

Below are all the available settings for a word embeddings analysis.

  1. Embedding algorithm: Either Word2Vec or FastText (an extension of word2vec) can be selected to learn word representation. While Word2Vec treats words as the smallest entity during training, FastText considers each word composed of character N-grams. So, the word vector is composed of the sum of the character N-grams.
    options
    • Word2Vec
    • FastText
  2. Dimensionality reduction technique: Dimensionality reduction techniques are applied to the word embeddings for visualization.
    Options
    • PCA (Principal Component Analysis)
    • UMAP (Uniform Manifold Approximation and Projection)
  3. Minimum frequency of words
  4. Number of iterations over corpus
  5. Number of workers
  6. Dimension of embeddings
  7. Fix unicode
  8. Remove URL
  9. Remove email
  10. Remove phone numbers
  11. Remove numerals
  12. Remove currency
  13. Remove punctuation
  14. Remove accents

Sentiment analysis

caution

There are no customizable settings for a sentiment analysis.

Topic modeling analysis

Below are all the available settings for a topic modeling analysis.

  1. Number of topics
  2. Alpha: The higher the alpha value, the more equal the number of documents are across the topics. Setting a lower value indicates very few topics dominate the dataset.
  3. Eta: The lower the Eta, the fewer words the topics contain.
  4. Tokens occurred in at least N documents: The selected value keeps words that are contained in at least N documents.
  5. Tokens occurred in more than N percentage of documents: The selected value (N) keeps words that are contained in no more than N documents (fraction of total corpus size, not an absolute number).
  6. First N most frequent tokens: The selected value (N) keeps only the first N most frequent tokens.

Geographic analysis (that is, geo spacial analysis)

Below are all the available settings for a geographic analysis.

  1. Number of insights: Insights are sorted locally by statistical metric - coefficient of variation.
  2. Aggregation method: Data is aggregated at geographic levels; the selected method aggregates the numeric columns.
    Options
    • Average
    • Sum
    • Minimum
    • Maximum
caution
  • For a geographic analysis to be activated, your dataset needs to have two columns named latitude and longitude. If your dataset specifies the latitude and longitude columns with different names, you can specify the appropriate columns on the + Additional options tab (when preparing the settings for a dataset analysis).
  • The following geographic dimensions are supported:
    • City (only for New York City)
    • U.S. States (at the state level)
    • Countries (at the country level)

Feedback