Group by¶
The group_by
function lets you group one or more columns and apply a function to the result. Specifically, the group_by
function performs the following actions on an H2O Frame:
Splits the data into groups based on some criteria.
Applies a function to each group independently.
Combines the results into an H2OFrame.
The result is a new H2OFrame with columns equivalent to the number of groups created. The returned groups are sorted by the natural group-by column sort.
Group by parameters¶
The group_by
function accepts the following parameters:
Python and R¶
by
: Theby
option can take a list of columns if you want to group by more than one column to compute the summary.
H2OFrame
: This specifies the H2OFrame that you want the group by operation to be performed on.
Python only¶
na
: This option controls the treatment of NA values during the calculation. It can be one of:all
(default): Any NA values are used in the calculation as-is (which usually results in the final result being NA, too).ignore
: NA entries are not included in calculations, but the total number of entries is taken as the total number of rows. For example,mean([1, 2, 3, nan], na="ignore")
will produce1.5
.rm
: NA entries are skipped during the calculations, reducing the total effective count of entries. For example,mean([1, 2, 3, nan], na="rm")
will produce2
.
R only¶
gb.control
: In R, thegb.control
option specifies how to handle NA values in the dataset as well as how to name output columns. Note that to specify a list of column names in thegb.control
list, you must add thecol.names
argument.na.methods
: This option controls the treatment of NA values during the calculation. It can be one of:all
(default): Any NA values are used in the calculation as-is (which usually results in the final result being NA, too).ignore
: NA entries are not included in calculations, but the total number of entries is taken as the total number of rows. For example,mean([1, 2, 3, nan], na="ignore")
will produce1.5
.rm
: NA entries are skipped during the calculations, reducing the total effective count of entries. For example,mean([1, 2, 3, nan], na="rm")
will produce2
.
nrow
: Specify the name of the generated column.
Note
If a list smaller than the number of columns groups is supplied, then the list will be padded by ignore
.
Aggregations¶
In addition to the above parameters, any number of the following aggregations can be chained together in the group_by
function:
count
: Count the number of rows in each group of a GroupBy object.max
: Calculate the maximum of each column specified incol
for each group of a GroupBy object.mean
: Calculate the mean of each column specified incol
for each group of a GroupBy object.min
: Calculate the minimum of each column specified incol
for each group of a GroupBy object.mode
: Calculate the mode of each column specified incol
for each group of a GroupBy object.sd
: Calculate the standard deviation of each column specified incol
for each group of a GroupBy object.ss
: Calculate the sum of squares of each column specified incol
for each group of a GroupBy object.sum
: Calculate the sum of each column specified incol
for each group of a GroupBy object.var
: Calculate the variance of each column specified incol
for each group of a GroupBy object.
Note
If no arguments are given to the aggregation (e.g. max()
in grouped.sum(col="X1", na="all").mean(col="X5", na="all").max()
), then it is assumed that the aggregation should apply to all columns except the GroupBy columns.
Once the aggregation operations are complete, calling the GroupBy object with a new set of aggregations will yield no effect. You must generate a new GroupBy object in order to apply a new aggregation on it. In addition, certain aggregations are only defined for numerical or categorical columns. An error will be thrown for calling aggregation on the wrong data types.
Examples¶
The following examples in Python and R show how to find the months with the highest cancellation using group_by
.
import h2o
h2o.init()
# Upload the airlines dataset
air = h2o.import_file("https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv")
air.dim
[43978, 31]
# Find number of flights by airport
origin_flights = air.group_by("Origin")
origin_fights.count()
origin_fights.get_frame()
Origin nrow
-------- ------
ABE 59
ABQ 876
ACY 31
...
# Find number of flights per month based on the origin
cols = ["Origin","Month"]
flights_by_origin_month = air.group_by(by=cols).count(na ="all")
flights_by_origin_month.get_frame()
Origin Month nrow
-------- ------- ------
ABE 1 59
ABQ 1 846
ABQ 10 30
...
# Find months with the highest cancellation ratio
cancellation_by_month = air.group_by(by='Month').sum('Cancelled', na="all")
flights_by_month = air.group_by('Month').count(na="all")
cancelled = cancellation_by_month.get_frame()['sum_Cancelled']
flights = flights_by_month.get_frame()['nrow']
month_count = flights_by_month.get_frame()['Month']
ratio = cancelled/flights
month_count.cbind(ratio)
Month sum_Cancelled
------- ---------------
1 0.0254175
10 0.00950475
[2 rows x 2 columns]
# Use group_by with multiple columns. Summarize the destination,
# arrival delays, and departure delays for an origin
cols_1 = ['Origin', 'Dest', 'IsArrDelayed', 'IsDepDelayed']
cols_2 = ["Dest", "IsArrDelayed", "IsDepDelayed"]
air[cols_1].group_by(by='Origin').sum(cols_2, na="ignore").get_frame()
Origin sum_Dest sum_IsDepDelayed sum_IsArrDelayed
-------- ---------- ------------------ ------------------
ABE 5884 30 40
ABQ 84505 370 545
ACY 3131 7 9
ALB 3646 50 49
AMA 317 6 4
ANC 100 1 0
...
library(h2o)
h2o.init()
# Import the airlines data set and display a summary.
airlines_url <- "https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv"
airlines <- h2o.importFile(path = airlines_url)
summary(airlines)
# Find number of flights by airport
origin_flights <- h2o.group_by(data = airlines, by = "Origin", nrow("Origin"), gb.control = list(na.methods = "rm"))
origin_flights_df <- as.data.frame(origin_flights)
origin_flights_df
Origin nrow
1 ABE 59
2 ABQ 876
3 ACY 31
...
# Find number of flights per month
flights_by_month <- h2o.group_by(data = airlines,
by = "Month",
nrow("Month"),
gb.control = list(na.methods = "rm"))
flights_by_month_df <- as.data.frame(flights_by_month)
flights_by_month_df
Month nrow
1 1 41979
2 10 1999
# Find the number of flights in a given month based on the origin
cols <- c("Origin","Month")
flights_by_origin_month <- h2o.group_by(data = airlines,
by = cols,
nrow("Month"),
gb.control = list(na.methods = "rm"))
flights_by_origin_month_df <- as.data.frame(flights_by_origin_month)
flights_by_origin_month_df
Origin Month nrow
1 ABE 1 59
2 ABQ 1 846
3 ABQ 10 30
4 ACY 1 31
5 ALB 1 75
...
# Find months with the highest cancellation ratio
which(colnames(airlines)=="Cancelled")
[1] 22
cancellations_by_month <- h2o.group_by(data = airlines,
by = "Month",
sum("Cancelled"),
gb.control=list(na.methods="rm"))
cancellation_rate <- cancellations_by_month$sum_Cancelled/flights_by_month$nrow
rates_table <- h2o.cbind(flights_by_month$Month,cancellation_rate)
rates_table_df <- as.data.frame(rates_table)
rates_table_df
Month sum_Cancelled
1 1 0.025417471
2 10 0.009504752
# Use group_by with multiple columns. Summarize the destination,
# arrival delays, and departure delays for an origin
cols <- c("Dest", "IsArrDelayed", "IsDepDelayed")
origin_flights <- h2o.group_by(data = airlines[c("Origin",cols)],
by = "Origin",
sum(cols),
gb.control = list(na.methods = "ignore", col.names = NULL))
# Note a warning because col.names null
res <- h2o.cbind(lapply(cols, function(x){h2o.group_by(airlines, by = "Origin", sum(x))}))[,c(1,2,4,6)]
res
Origin sum_Dest sum_IsArrDelayed sum_IsDepDelayed
1 ABE 5884 40 30
2 ABQ 84505 545 370
3 ACY 3131 9 7
4 ALB 3646 49 50
5 AMA 317 4 6
6 ANC 100 0 1
The following R code shows the options by-variable with gb.control
.
# Import H2O-3:
library(h2o)
h2o.init()
# Import the airlines dataset:
airlines.hex <- h2o.importFile("https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv")
# View quantiles and histograms:
quantile(x = airlines.hex$ArrDelay, na.rm = TRUE)
h2o.hist(airlines.hex$ArrDelay)
# Find the number of flights by airport:
originFlights <- h2o.group_by(data = airlines.hex, by = "Origin", nrow("Origin"), gb.control <- list(na.methods = "rm"))
originFlights.R <- as.data.frame(originFlights)
# Find the number of flights per month:
flightsByMonth <- h2o.group_by(data = airlines.hex, by = "Month", nrow("Month"), gb.control <- list(na.methods = "rm"))
flightsByMonth.R <- as.data.frame(flightsByMonth)
# Find months with the highest cancellation ratio:
which(colnames(airlines.hex)=="Cancelled")
cancellationsByMonth <- h2o.group_by(data = airlines.hex, by = "Month", sum("Cancelled"), gb.control <- list(na.methods = "rm"))
cancellation_rate <- cancellationsByMonth$sum_Cancelled/flightsByMonth$nrow
rates_table <- h2o.cbind(flightsByMonth$Month, cancellation_rate)
rates_table.R <- as.data.frame(rates_table)
# Construct test and train sets using sampling:
airlines.split <- h2o.splitFrame(data = airlines.hex, ratio = 0.85)
airlines.train <- airlines.split[[1]]
airlines.test <- airlines.split[[2]]
# Display a summary using table-like functions:
h2o.table(airlines.train$Cancelled)
h2o.table(airlines.test$Cancelled)
# Set the predictor and response variables:
Y <- "IsDepDelayed"
X <- c("Origin", "Dest", "DayofMonth", "Year", "UniqueCarrier", "DayOfWeek", "Month", "DepTime", "ArrTime", "Distance")
# Define the data for the model and display the results:
airlines.glm <- h2o.glm(training_frame = airlines.train, x = X, y = Y, family = "binomial", alpha = 0.5)
# View the model information (training statistics, performance, important variables):
summary(airlines.glm)
# Predict using the GLM model:
pred <- h2o.predict(object = airlines.glm, newdata = airlines.test)
# Look at the summary of predictions (probability of TRUE class p1):
summary(pred$p1)