DHistogram (h2o-algos 3.46.0 API)

java.lang.Object
- water.Iced<DHistogram>
- - hex.tree.DHistogram

All Implemented Interfaces:

java.io.Externalizable, java.io.Serializable, java.lang.Cloneable, water.Freezable<DHistogram>
```
public final class DHistogram
extends water.Iced<DHistogram>
```
A Histogram, computed in parallel over a Vec.
A DHistogram bins every value added to it, and computes a the vec min and max (for use in the next split), and response mean and variance for each bin. DHistograms are initialized with a min, max and number-of- elements to be added (all of which are generally available from a Vec). Bins run from min to max in uniform sizes. If the DHistogram can determine that fewer bins are needed (e.g. boolean columns run from 0 to 1, but only ever take on 2 values, so only 2 bins are needed), then fewer bins are used.
DHistogram are shared per-node, and atomically updated. There's an add call to help cross-node reductions. The data is stored in primitive arrays, so it can be sent over the wire.
If we are successively splitting rows (e.g. in a decision tree), then a fresh DHistogram for each split will dynamically re-bin the data. Each successive split will logarithmically divide the data. At the first split, outliers will end up in their own bins - but perhaps some central bins may be very full. At the next split(s) - if they happen at all - the full bins will get split, and again until (with a log number of splits) each bin holds roughly the same amount of data. This 'UniformAdaptive' binning resolves a lot of problems with picking the proper bin count or limits - generally a few more tree levels will equal any fancy but fixed-size binning strategy.
Support for histogram split points based on quantiles (or random points) is available as well, via _histoType.

See Also:

Serialized Form

Nested Class Summary

Nested Classes
Modifier and Type Class and Description

static class DHistogram.NASplitDir
Split direction for missing values.

Nested Classes
Modifier and Type	Class and Description
`static class`	`DHistogram.NASplitDir` Split direction for missing values.

Field Summary

Fields
Modifier and Type	Field and Description
`boolean`	`_absoluteSplitPts`
`boolean`	`_checkFloatSplits`
`water.Key<hex.tree.DHistogram.HistoSplitPoints>`	`_globalSplitPointsKey`
`SharedTreeModel.SharedTreeParameters.HistogramType`	`_histoType`
`boolean`	`_initNA`
`boolean`	`_intOpt`
`byte`	`_isInt`
`double`	`_maxEx`
`protected double`	`_maxIn`
`double`	`_min`
`protected double`	`_min2`
`int`	`_minInt`
`double`	`_minSplitImprovement`
`java.lang.String`	`_name`
`char`	`_nbin`
`double`	`_pred1`
`double`	`_pred2`
`long`	`_seed`
`double`	`_step`
`protected Divergence`	`_upliftMetric`
`protected boolean`	`_useUplift`
`protected double[]`	`_vals`
`protected int`	`_vals_dim`
`protected int`	`_valsDimUplift`
`protected double[]`	`_valsUplift`
`static int`	`INT_NA`

Method Summary

All Methods Static Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`int`	`actNBins()`
`void`	`add(DHistogram dsh)`
`int`	`bin(double col_data)`
`double`	`binAt(int b)`
`double`	`bins(int b)`
`double`	`denNA()`
`double`	`find_maxEx()`
`static double`	`find_maxEx(double maxIn, int isInt)`
`double`	`find_maxIn()`
`double`	`find_min()`
`double[]`	`getRawVals()`
`boolean`	`hasNABin()`
`void`	`init()`
`void`	`init(double[] vals)`
`void`	`init(double[] vals, double[] valsUplift)`
`static DHistogram[]`	`initialHist(water.fvec.Frame fr, int ncols, int nbins, DHistogram[] hs, long seed, SharedTreeModel.SharedTreeParameters parms, water.Key<hex.tree.DHistogram.HistoSplitPoints>[] globalSplitPointsKey, Constraints cs, boolean checkFloatSplits, GlobalInteractionConstraints ics)` The initial histogram bins are setup from the Vec rollups.
`static DHistogram`	`make(java.lang.String name, int nbins, byte isInt, double min, double maxEx, boolean intOpt, boolean hasNAs, long seed, SharedTreeModel.SharedTreeParameters parms, water.Key<hex.tree.DHistogram.HistoSplitPoints> globalSplitPointsKey, Constraints cs, boolean checkFloatSplits, double[] customSplitPoints)`
`int`	`nbins()`
`double`	`nomNA()`
`int`	`nonEmptyBins()`
`double`	`numControlNA()`
`double`	`numTreatmentNA()`
`void`	`reducePrecision()` Cast bin values (except for sums of weights) to floats to drop least significant bits.
`double`	`respControlNA()`
`double`	`respTreatmentNA()`
`double`	`seP1NA()` Squared Error for NA bucket and prediction value _pred1
`double`	`seP2NA()` Squared Error for NA bucket and prediction value _pred2
`java.lang.String`	`toString()`
`static boolean`	`useIntOpt(water.fvec.Vec v, SharedTreeModel.SharedTreeParameters parms, Constraints cs)` Determines if histogram making can use integer optimization when extracting data.
`double`	`var(int b)` compute the sample variance within a given bin
`double`	`w(int i)`
`double`	`wNA()`
`double`	`wY(int i)`
`double`	`wYNA()`
`double`	`wYY(int i)`
`double`	`wYYNA()`

Methods inherited from class water.Iced
asBytes, clone, copyOver, frozenType, read, readExternal, readJSON, reloadFromBytes, toJsonBytes, toJsonString, write, writeExternal, writeJSON

Methods inherited from class java.lang.Object
equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait

Field Detail

INT_NA
```
public static final int INT_NA
```
See Also:

Constant Field Values

_name

public final transient java.lang.String _name

_minSplitImprovement

public final double _minSplitImprovement

_isInt
```
public final byte _isInt
```

_intOpt
```
public final boolean _intOpt
```

_nbin
```
public char _nbin
```

_step
```
public double _step
```

_min
```
public final double _min
```

_maxEx
```
public final double _maxEx
```

_minInt
```
public final int _minInt
```

_initNA
```
public final boolean _initNA
```

_pred1
```
public final double _pred1
```

_pred2
```
public final double _pred2
```

_vals
```
protected double[] _vals
```

_vals_dim
```
protected final int _vals_dim
```

_useUplift
```
protected final boolean _useUplift
```

_valsUplift
```
protected double[] _valsUplift
```

_valsDimUplift
```
protected final int _valsDimUplift
```
See Also:

Constant Field Values

_upliftMetric

protected final Divergence _upliftMetric

_min2
```
protected double _min2
```

_maxIn
```
protected double _maxIn
```

_histoType

public SharedTreeModel.SharedTreeParameters.HistogramType _histoType

_checkFloatSplits
```
public final boolean _checkFloatSplits
```

_seed
```
public final long _seed
```

_absoluteSplitPts

public transient boolean _absoluteSplitPts

_globalSplitPointsKey

public water.Key<hex.tree.DHistogram.HistoSplitPoints> _globalSplitPointsKey

Method Detail

w
```
public double w(int i)
```

wY
```
public double wY(int i)
```

wYY
```
public double wYY(int i)
```

wNA
```
public double wNA()
```

wYNA
```
public double wYNA()
```

wYYNA
```
public double wYYNA()
```

getRawVals
```
public double[] getRawVals()
```

seP1NA
```
public double seP1NA()
```
Squared Error for NA bucket and prediction value _pred1

Returns:

se

seP2NA
```
public double seP2NA()
```
Squared Error for NA bucket and prediction value _pred2

Returns:

se

denNA
```
public double denNA()
```

nomNA
```
public double nomNA()
```

numTreatmentNA
```
public double numTreatmentNA()
```

respTreatmentNA
```
public double respTreatmentNA()
```

numControlNA
```
public double numControlNA()
```

respControlNA
```
public double respControlNA()
```

bin
```
public int bin(double col_data)
```

binAt
```
public double binAt(int b)
```

nbins
```
public int nbins()
```

actNBins
```
public int actNBins()
```

bins
```
public double bins(int b)
```

nonEmptyBins
```
public int nonEmptyBins()
```

hasNABin
```
public boolean hasNABin()
```

init
```
public void init()
```

init
```
public void init(double[] vals)
```

init

public void init(double[] vals,
                 double[] valsUplift)

add
```
public void add(DHistogram dsh)
```

find_min
```
public double find_min()
```

find_maxIn
```
public double find_maxIn()
```

find_maxEx
```
public double find_maxEx()
```

find_maxEx

public static double find_maxEx(double maxIn,
                                int isInt)

initialHist

public static DHistogram[] initialHist(water.fvec.Frame fr,
                                       int ncols,
                                       int nbins,
                                       DHistogram[] hs,
                                       long seed,
                                       SharedTreeModel.SharedTreeParameters parms,
                                       water.Key<hex.tree.DHistogram.HistoSplitPoints>[] globalSplitPointsKey,
                                       Constraints cs,
                                       boolean checkFloatSplits,
                                       GlobalInteractionConstraints ics)

The initial histogram bins are setup from the Vec rollups.

Parameters:: fr - frame with column data; ncols - number of columns; nbins - number of bins; hs - an array of histograms to be initialize; seed - seed to reproduce; parms - parameters of the model; globalSplitPointsKey - array of global split-points keys; cs - monotone constraints (could be null); checkFloatSplits -
Returns:: array of DHistograms objects

make

public static DHistogram make(java.lang.String name,
                              int nbins,
                              byte isInt,
                              double min,
                              double maxEx,
                              boolean intOpt,
                              boolean hasNAs,
                              long seed,
                              SharedTreeModel.SharedTreeParameters parms,
                              water.Key<hex.tree.DHistogram.HistoSplitPoints> globalSplitPointsKey,
                              Constraints cs,
                              boolean checkFloatSplits,
                              double[] customSplitPoints)

useIntOpt

public static boolean useIntOpt(water.fvec.Vec v,
                                SharedTreeModel.SharedTreeParameters parms,
                                Constraints cs)

Determines if histogram making can use integer optimization when extracting data.

Parameters:: v - input Vec; parms - algo params; cs - constraints specification
Returns:: can we use integer representation for extracted data?

toString
```
public java.lang.String toString()
```
Overrides:

toString in class java.lang.Object

var
```
public double var(int b)
```
compute the sample variance within a given bin

Parameters:

b - bin id

Returns:

sample variance (>= 0)

reducePrecision
```
public void reducePrecision()
```
Cast bin values (except for sums of weights) to floats to drop least significant bits. Improves reproducibility (drop bits most affected by floating point error).

Class DHistogram

Nested Class Summary

Field Summary

Method Summary

Methods inherited from class water.Iced

Methods inherited from class java.lang.Object

Field Detail

INT_NA

_name

_minSplitImprovement

_isInt

_intOpt

_nbin

_step

_min

_maxEx

_minInt

_initNA

_pred1

_pred2

_vals

_vals_dim

_useUplift

_valsUplift

_valsDimUplift

_upliftMetric

_min2

_maxIn

_histoType

_checkFloatSplits

_seed

_absoluteSplitPts

_globalSplitPointsKey

Method Detail

w

wY

wYY

wNA

wYNA

wYYNA

getRawVals

seP1NA

seP2NA

denNA

nomNA

numTreatmentNA

respTreatmentNA

numControlNA

respControlNA

bin

binAt

nbins

actNBins

bins

nonEmptyBins

hasNABin

init

init

init

add

find_min

find_maxIn

find_maxEx

find_maxEx

initialHist

make

useIntOpt

toString

var

reducePrecision