Dataset format: Graph node regression


The dataset for a graph node regression experiment needs to be organized in a specific folder structure and format.

Dataset format

Dataset structure

Here is an example of how the folder structure should look like: (1)

├───train.csv (2)

├───meta_node.csv (3)

├───feature_dataframe_for_nodetype1.csv (4)
├───feature_dataframe_for_nodetype2.csv (4)
│ ...

├───meta_relation.csv (5)

├───feature_dataframe_for_relation1.csv (6)
├───feature_dataframe_for_relation2.csv (6)
│ ...

├───image_folder_name1 (7)
│ ├───image_name1.image_extension
│ ├───image_name2.image_extension
│ ...

├───image_folder_name2 (7)
│ ├───image_name1.image_extension
│ ├───image_name2.image_extension
│ ...


  1. (componet 1) Create a zip file (for example, that will contain all the necessary files and folders.

  2. Inside the zip file, include the following files and folders:

    • (component 2) train.csv: This is the training data file in CSV format. It should contain the following columns:

      • node_id: This column should contain the IDs of the target node type.
      • label1, label2, label3, ...: This column should be one or be converted to more than one label column containing the numerical labels (targets).

        H2O Hydrogen Torch can train models that predict multiple labels simultaneously. You can provide multiple columns with multiple unique labels and choose which labels to predict when starting a new graph node regression experiment.

      • fold (optional): This column should contain cross-validation fold indexes.

        The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.

    • (component 3) meta_node.csv: This file contains metadata about the node types used in the graph. It should have the following columns:

      • node_type: This column should contain the names of the node types.
      • file_name: This column should contain the feature dataframe file names for every node type.
      • is_target: This column should indicate which node type train and predict on. It should have binary 0/1 values, with 1 referring to the target node type.
    • (component 4) feature_dataframe_for_nodetype1.csv, feature_dataframe_for_nodetype2.csv, ...: These are CSV files that store the features for each node type in the graph. Each file should have the following columns:

      • node_id: This column should contain the node IDs.
      • Numerical features (optional): Any number of numerical features with column names ending with _num.
      • Categorical features (optional): Any number of categorical features with column names ending with _cat.
      • Pandas datetime features (optional): Any number of pandas datetime features with column names ending with _dtm.
      • Text features (optional): Any number of text features with column names ending with _txt.
      • Image features (optional): Any number of image features with column names ending with _img. The image feature columns should only contain image_name.image_extension.
    • (component 5) meta_relation.csv: This file contains metadata about the relations between nodes in the graph. It should have the following columns:

      • src_node_type: This column should contain the names of the source node types.
      • dst_node_type: This column should contain the names of the destination node types. The destination node type can be the same as the source node type.
      • edge_type: This column should contain the names of the edge types.
      • file_name: This column should contain the feature dataframe file names for every relation.
      • is_directional: This column indicates whether the relation is directional or not.
    • (component 6) feature_dataframe_for_relation1.csv, feature_dataframe_for_relation2.csv, ...: These are CSV files that store the features for each relation in the graph. Each file should have the following columns:

      • src_node_id: This column should contain the source node IDs.
      • dst_node_id: This column should contain the destination node IDs.
      • Numerical features (optional): Any number of numerical features with column names ending with _num.
      • Categorical features (optional): Any number of categorical features with column names ending with _cat.
      • Pandas datetime features (optional): Any number of pandas datetime features with column names ending with _dtm.
      • Text features (optional): Any number of text features with column names ending with _txt.
      • Image features (optional): Any number of image features with column names ending with _img. The image feature columns should only contain image_name.image_extension.
    • (component 7) image_folder_name1, image_folder_name2, ...: These are folders that contain the image files used as features in the graph. Each folder should be named after the corresponding image feature column in the feature dataframes.

Important notes

Please note the following:

  • Train, validation and test CSV files: Besides the train.csv file, you can have multiple CSV files in the zip file that you can use as training, validation, and test dataframes.
    • A train CSV file must follow the format described in the Components section.
    • A validation CSV file must follow the same format as a train CSV file.
    • A test CSV file must follow the same format as a train CSV file but does not require a label column(s).
      • For inference with a graph node regression model, the inference dataset must contain a test CSV file as long as the graph structure is the same as the one utilized in the training dataset. If the test graph is different, you need to include all the graph data; that is, the dataset needs to contain components 3 - 7 (described in the Components section).
  • meta_relation.csv: When defining the columns for the meta_relation.csv file, recall the following: In the context of graph theory, a relation refers to a connection or association between two nodes in a graph. Three components define a relation, and together, these components define the nature of the relationship between nodes in the graph:
    1. Source node type (src_node_type): This is the type of node from which the relation originates. For example, in a social network graph, the source node type could be "user."
    2. Destination node type (dst_node_type): This is the type of node that the relation points to. In the same social network example, the destination node type could be "group."
    3. Edge type (edge_type): This is the type of connection or association between the source and destination nodes. In the social network example, the edge type could be "joins" or "belongs to."
  • meta_node.csv and meta_relation.csv: The meta_node.csv and meta_relation.csv files are mandatory and their names cannot be modified.
  • Image folders: Image folders should contain the image files specified in the image feature columns of the feature dataframes.


To help users understand the dataset format for a graph node regression model, we provide a hypothetical dataset. This dataset focuses on predicting the publication year of papers in the graph and includes two node types: "paper" and "author," along with three relations: "author_write_paper," "author_collaborate_author," and "paper_cite_paper." The structure of the dataset is organized within a zip file as follows: 

├───train.csv (1)
├───validation.csv (2)

├───meta_node.csv (3)

├───paper.csv (4)
├───author.csv (5)

├───meta_relation.csv (6)

├───author_wrote_paper.csv (7)
├───author_collaborated_author.csv (8)
├───paper_cited_paper.csv (9)

└───photo_img (10)
  1. The train.csv file contains the labeled data used for training the model and has the following structure:


    It is normal to have unlabelled and uninterested nodes in the graph. This means that not all nodes in the dataset will have labels or be of interest for the specific task at hand.

    For example, in the given dataset, there is a paper with the ID "paper1." However, this paper is neither labeled nor predicted for the task of predicting the publication year. This indicates that "paper1" is an unlabelled and uninterested node in the graph for this particular regression task.

    The presence of unlabelled and uninterested nodes is common in real-world datasets. It could be due to various reasons, such as missing data, incomplete information, or the focus of the task being on specific nodes or subsets of the graph. Machine learning models can handle such scenarios by ignoring or excluding these unlabelled and uninterested nodes during the training and prediction phases.

  1. The validation.csv file contains the labeled data used for validating the model and has the following structure:

  2. The meta_node.csv file contains information about the node types and their corresponding files. It has the following structure:

  3. The paper.csv file contains details about the papers in the dataset and has the following structure:

    paper1title 1abstract 1
    paper2abstract 2
    paper3title 3abstract 3
    paper4title 4

    For this node type, there are four nodes and two text features. Note that some features may have missing values.

  4. The author.csv file contains information about the authors in the dataset and has the following structure:


    For this node type, there are three nodes, and it includes one categorical feature, one numerical feature, and one image feature. Because of the image column name, you need to store the images in the "photo_img" directory.

  5. The meta_relation.csv file provides information about the relations between different node types. It has the following structure:


    For directional relations like "author_wrote_paper," there is no need to add reverse relations like "paper_writtenby_author."

  6. The author_wrote_paper.csv file contains the connections between authors and papers and has the following structure:

  1. The author_collaborated_author.csv file contains information about collaborations between authors and has the following structure:

  1. The paper_cited_paper.csv file contains information about citations between papers and has the following structure:


    Note that even in undirectional relations, the edges are directionally defined. However, there is no need to add reverse edges. For example, if "paper4" cites "paper3," there is no need to include "paper3" being cited by "paper4."

  2. The photo_img folder (directory) contains the image files used as features in the graph, the folder name must be same as the image feature column name.
