public final class DHistogram extends water.Iced<DHistogram>
A DHistogram
bins every value added to it, and computes a the
vec min and max (for use in the next split), and response mean and variance
for each bin. DHistogram
s are initialized with a min, max and
number-of- elements to be added (all of which are generally available from
a Vec). Bins run from min to max in uniform sizes. If the DHistogram
can determine that fewer bins are needed (e.g. boolean columns
run from 0 to 1, but only ever take on 2 values, so only 2 bins are
needed), then fewer bins are used.
DHistogram
are shared per-node, and atomically updated. There's
an add
call to help cross-node reductions. The data is stored in
primitive arrays, so it can be sent over the wire.
If we are successively splitting rows (e.g. in a decision tree), then a
fresh DHistogram
for each split will dynamically re-bin the data.
Each successive split will logarithmically divide the data. At the first
split, outliers will end up in their own bins - but perhaps some central
bins may be very full. At the next split(s) - if they happen at all -
the full bins will get split, and again until (with a log number of splits)
each bin holds roughly the same amount of data. This 'UniformAdaptive' binning
resolves a lot of problems with picking the proper bin count or limits -
generally a few more tree levels will equal any fancy but fixed-size binning strategy.
Support for histogram split points based on quantiles (or random points) is
available as well, via _histoType
.
Modifier and Type | Class and Description |
---|---|
static class |
DHistogram.NASplitDir
Split direction for missing values.
|
Modifier and Type | Field and Description |
---|---|
boolean |
_absoluteSplitPts |
boolean |
_checkFloatSplits |
water.Key<hex.tree.DHistogram.HistoSplitPoints> |
_globalSplitPointsKey |
SharedTreeModel.SharedTreeParameters.HistogramType |
_histoType |
boolean |
_initNA |
boolean |
_intOpt |
byte |
_isInt |
double |
_maxEx |
protected double |
_maxIn |
double |
_min |
protected double |
_min2 |
int |
_minInt |
double |
_minSplitImprovement |
java.lang.String |
_name |
char |
_nbin |
double |
_pred1 |
double |
_pred2 |
long |
_seed |
double |
_step |
protected Divergence |
_upliftMetric |
protected boolean |
_useUplift |
protected double[] |
_vals |
protected int |
_vals_dim |
protected int |
_valsDimUplift |
protected double[] |
_valsUplift |
static int |
INT_NA |
Modifier and Type | Method and Description |
---|---|
int |
actNBins() |
void |
add(DHistogram dsh) |
int |
bin(double col_data) |
double |
binAt(int b) |
double |
bins(int b) |
double |
denNA() |
double |
find_maxEx() |
static double |
find_maxEx(double maxIn,
int isInt) |
double |
find_maxIn() |
double |
find_min() |
double[] |
getRawVals() |
boolean |
hasNABin() |
void |
init() |
void |
init(double[] vals) |
void |
init(double[] vals,
double[] valsUplift) |
static DHistogram[] |
initialHist(water.fvec.Frame fr,
int ncols,
int nbins,
DHistogram[] hs,
long seed,
SharedTreeModel.SharedTreeParameters parms,
water.Key<hex.tree.DHistogram.HistoSplitPoints>[] globalSplitPointsKey,
Constraints cs,
boolean checkFloatSplits,
GlobalInteractionConstraints ics)
The initial histogram bins are setup from the Vec rollups.
|
static DHistogram |
make(java.lang.String name,
int nbins,
byte isInt,
double min,
double maxEx,
boolean intOpt,
boolean hasNAs,
long seed,
SharedTreeModel.SharedTreeParameters parms,
water.Key<hex.tree.DHistogram.HistoSplitPoints> globalSplitPointsKey,
Constraints cs,
boolean checkFloatSplits,
double[] customSplitPoints) |
int |
nbins() |
double |
nomNA() |
int |
nonEmptyBins() |
double |
numControlNA() |
double |
numTreatmentNA() |
void |
reducePrecision()
Cast bin values (except for sums of weights) to floats to drop least significant bits.
|
double |
respControlNA() |
double |
respTreatmentNA() |
double |
seP1NA()
Squared Error for NA bucket and prediction value _pred1
|
double |
seP2NA()
Squared Error for NA bucket and prediction value _pred2
|
java.lang.String |
toString() |
static boolean |
useIntOpt(water.fvec.Vec v,
SharedTreeModel.SharedTreeParameters parms,
Constraints cs)
Determines if histogram making can use integer optimization when extracting data.
|
double |
var(int b)
compute the sample variance within a given bin
|
double |
w(int i) |
double |
wNA() |
double |
wY(int i) |
double |
wYNA() |
double |
wYY(int i) |
double |
wYYNA() |
public static final int INT_NA
public final transient java.lang.String _name
public final double _minSplitImprovement
public final byte _isInt
public final boolean _intOpt
public char _nbin
public double _step
public final double _min
public final double _maxEx
public final int _minInt
public final boolean _initNA
public final double _pred1
public final double _pred2
protected double[] _vals
protected final int _vals_dim
protected final boolean _useUplift
protected double[] _valsUplift
protected final int _valsDimUplift
protected final Divergence _upliftMetric
protected double _min2
protected double _maxIn
public SharedTreeModel.SharedTreeParameters.HistogramType _histoType
public final boolean _checkFloatSplits
public final long _seed
public transient boolean _absoluteSplitPts
public water.Key<hex.tree.DHistogram.HistoSplitPoints> _globalSplitPointsKey
public double w(int i)
public double wY(int i)
public double wYY(int i)
public double wNA()
public double wYNA()
public double wYYNA()
public double[] getRawVals()
public double seP1NA()
public double seP2NA()
public double denNA()
public double nomNA()
public double numTreatmentNA()
public double respTreatmentNA()
public double numControlNA()
public double respControlNA()
public int bin(double col_data)
public double binAt(int b)
public int nbins()
public int actNBins()
public double bins(int b)
public int nonEmptyBins()
public boolean hasNABin()
public void init()
public void init(double[] vals)
public void init(double[] vals, double[] valsUplift)
public void add(DHistogram dsh)
public double find_min()
public double find_maxIn()
public double find_maxEx()
public static double find_maxEx(double maxIn, int isInt)
public static DHistogram[] initialHist(water.fvec.Frame fr, int ncols, int nbins, DHistogram[] hs, long seed, SharedTreeModel.SharedTreeParameters parms, water.Key<hex.tree.DHistogram.HistoSplitPoints>[] globalSplitPointsKey, Constraints cs, boolean checkFloatSplits, GlobalInteractionConstraints ics)
fr
- frame with column datancols
- number of columnsnbins
- number of binshs
- an array of histograms to be initializeseed
- seed to reproduceparms
- parameters of the modelglobalSplitPointsKey
- array of global split-points keyscs
- monotone constraints (could be null)checkFloatSplits
- public static DHistogram make(java.lang.String name, int nbins, byte isInt, double min, double maxEx, boolean intOpt, boolean hasNAs, long seed, SharedTreeModel.SharedTreeParameters parms, water.Key<hex.tree.DHistogram.HistoSplitPoints> globalSplitPointsKey, Constraints cs, boolean checkFloatSplits, double[] customSplitPoints)
public static boolean useIntOpt(water.fvec.Vec v, SharedTreeModel.SharedTreeParameters parms, Constraints cs)
v
- input Vecparms
- algo paramscs
- constraints specificationpublic java.lang.String toString()
toString
in class java.lang.Object
public double var(int b)
b
- bin idpublic void reducePrecision()