Merging Two Datasets -------------------- You can use the `merge` function to combine two datasets that share a common column name. By default, all columns in common are used as the merge key; uncommon will be ignored. Also, if you want to use only a subset of the columns in common, rename the other columns so the columns are unique in the merged result. Note that in order for a merge to work in multinode clusters, one of the datasets must be small enough to exist in every node. .. tabs:: .. code-tab:: r R # Currently, this function only supports `all.x = TRUE`. All other permutations will fail. library(h2o) h2o.init() # Create two simple, two-column R data frames by inputting values, ensuring that both have a common column (in this case, "fruit"). left <- data.frame(fruit = c('apple', 'orange', 'banana', 'lemon', 'strawberry', 'blueberry'), color = c('red', 'orange', 'yellow', 'yellow', 'red', 'blue')) right <- data.frame(fruit = c('apple', 'orange', 'banana', 'lemon', 'strawberry', 'watermelon'), citrus = c(FALSE, TRUE, FALSE, TRUE, FALSE, FALSE)) # Create the H2O data frames from the inputted data. left_frame <- as.h2o(left) print(left_frame) fruit color 1 apple red 2 orange orange 3 banana yellow 4 lemon yellow 5 strawberry red 6 blueberry blue [6 rows x 2 columns] right_frame <- as.h2o(right) print(right_frame) fruit citrus 1 apple FALSE 2 orange TRUE 3 banana FALSE 4 lemon TRUE 5 strawberry FALSE 6 watermelon FALSE [6 rows x 2 columns] # Merge the data frames. The result is a single dataset with three columns. new_frame <- h2o.merge(left_frame, right_frame, all.x = TRUE) print(new_frame) fruit color citrus 1 blueberry blue 2 apple red FALSE 3 banana yellow FALSE 4 lemon yellow TRUE 5 orange orange TRUE 6 strawberry red FALSE [6 rows x 3 columns] .. code-tab:: python import h2o h2o.init() import numpy as np # Create a dataset by inputting raw data. df1 = h2o.H2OFrame.from_python({'A':['Hello', 'World', 'Welcome', 'To', 'H2O', 'World'], 'n': [0,1,2,3,4,5]}) df1.describe A n ------- --- Hello 0 World 1 Welcome 2 To 3 H2O 4 World 5 [6 rows x 2 columns] # Generate a random dataset from python. df2 = h2o.H2OFrame.from_python([[x] for x in np.random.randint(0, 10, size=20).tolist()], column_names=['n']) df2.describe n --- nan 0 8 6 1 7 8 5 1 3 [21 rows x 1 column] # Merge the first dataset into the second dataset. Note that only columns # in common are merged (i.e, values in df2 greater than 5 will not be merged). df3 = df2.merge(df1) df3.describe n A --- ------- nan Hello 3 To 3 To 0 Hello 5 World 3 To 0 Hello 5 World 1 World 2 Welcome [14 rows x 2 columns] # Merge all of df2 into df1. Note that this will result in missing values for # column A, which does not include values greater than 5. df4 = df2.merge(df1, all_x=True) df4.describe n A --- ----- nan Hello 0 Hello 8 6 1 World 7 8 5 World 1 World 3 To [21 rows x 2 columns]