Available in: GBM, XGBoost
Specify the column (y-axis) sampling rate (without replacement). This acceptable value range is 0.0 to 1.0, and this value defaults to 1. Higher values may improve training accuracy. Test accuracy improves when either columns or rows are sampled. (For details, refer to “Stochastic Gradient Boosting” (Friedman, 1999)).
The following illustrates how column sampling is implemented.
For an example model using:
col_sample_rate=0.8(Samples 80% of columns per split)
For each tree, the floor is used to determine the number of columns - in this example, (0.754 * 100)=75 out of 100 - that are randomly picked, and then the floor is used to determine the number of columns - in this case, (0.754 * 0.8 * 100)=60 - that are then randomly chosen for each split decision (out of the 75).
Row and column sampling (
col_sample_rate) can improve generalization and lead to lower validation and test set errors. Good general values for large datasets are around 0.7 to 0.8 (sampling 70-80 percent of the data) for both parameters. Column sampling per tree (
col_sample_rate_per_tree) can also be used. Note that
col_sample_rate_per_tree is multiplicative with
col_sample_rate, so setting both parameters to 0.8, for example, results in 64% of columns being considered at any given node to split.