Skip to main content
Version: v1.4.0

Dataset format: Graph node classification

Overview

The dataset for a graph node classification experiment needs to be organized in a specific folder structure and format.

Dataset format

Dataset structure

Here is an example of how the folder structure should look like:

folder_name.zip (1)

├───train.csv (2)

├───meta_node.csv (3)

├───feature_dataframe_for_nodetype1.csv (4)
├───feature_dataframe_for_nodetype2.csv (4)
│ ...

├───meta_relation.csv (5)

├───feature_dataframe_for_relation1.csv (6)
├───feature_dataframe_for_relation2.csv (6)
│ ...

├───image_folder_name1 (7)
│ ├───image_name1.image_extension
│ ├───image_name2.image_extension
│ ...

├───image_folder_name2 (7)
│ ├───image_name1.image_extension
│ ├───image_name2.image_extension
│ ...
...

Components

  1. (Component 1) Create a zip file (for example, folder_name.zip) that will contain all the necessary files and folders.
  2. Inside the zip file, include the following files and folders:
    • (Component 2) train.csv: This is the training data file in CSV format. It should contain the following columns:
      • node_id: This column should contain the IDs of the target node type.
      • label1, label2, label3, ...: One or more label columns containing either one-hot encoded multi-class labels or multiple multi-class/multi-label labels. For multi-class and binary classification, a single-label column is sufficient.
        Note
        • H2O Hydrogen Torch can solve both multi-class and multi-label classification problems. In multi-class problems, the classes are mutually exclusive, while the classes represent unique labels for multi-label problems. For N label class columns, in multi-class problems, only a single column could be set to 1, while in multi-label problems, all or none could be set to 1.
        • For binary classification experiments utilizing precision, recall, F1, F05, or F2 as a metric, the label column needs to contain 0/1 values. If the column contains string values, the column is transformed into multiple columns using a one-hot encoder method, resulting in the experiment being treated as a multi-class classification experiment while leading to an incorrect calculation of the precision, recall, F1, F05, or F2 metric.
      • fold (optional): This column should contain cross-validation fold indexes.
        note

        The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.

    • (Component 3) meta_node.csv: This file contains metadata about the node types used in the graph. It should have the following columns:
      • node_type: This column should contain the names of the node types.
      • file_name: This column should contain the feature dataframe file names for every node type.
      • is_target: This column should indicate which node type to train and predict on. It should have binary 0/1 values, with 1 referring to the target node type. There must be one and only one target node type.
    • (Component 4) feature_dataframe_for_nodetype1.csv, feature_dataframe_for_nodetype2.csv, ...: These are CSV files that store the features for each node type in the graph. Each file should have the following columns:
      • node_id: This column should contain the node IDs.
      • Numerical features (optional): Any number of numerical features with column names ending with _num.
      • Categorical features (optional): Any number of categorical features with column names ending with _cat.
      • Pandas datetime features (optional): Any number of pandas datetime features with column names ending with _dtm.
      • Text features (optional): Any number of text features with column names ending with _txt.
      • Image features (optional): Any number of image features with column names ending with _img. The image feature columns should only contain image_name.image_extension.
    • (Component 5) meta_relation.csv: This file contains metadata about the relations between nodes in the graph. It should have the following columns:
      • src_node_type: This column should contain the names of the source node types.
      • dst_node_type: This column should contain the names of the destination node types. The destination node type can be the same as the source node type.
      • edge_type: This column should contain the names of the edge types.
        note

        In the case of multiple relations between the same source and destination node types, they can be differentiated by using different edge types.

      • file_name: This column should contain the feature dataframe file names for every relation.
      • is_directional: This column indicates whether the relation is directional or not.
    • (Component 6) feature_dataframe_for_relation1.csv, feature_dataframe_for_relation2.csv, ...: These are CSV files that store the features for each relation in the graph. Each file should have the following columns:
      • src_node_id: This column should contain the source node IDs.
      • dst_node_id: This column should contain the destination node IDs.
      • Numerical features (optional): Any number of numerical features with column names ending with _num.
      • Categorical features (optional): Any number of categorical features with column names ending with _cat.
      • Pandas datetime features (optional): Any number of pandas datetime features with column names ending with _dtm.
      • Text features (optional): Any number of text features with column names ending with _txt.
      • Image features (optional): Any number of image features with column names ending with _img. The image feature columns should only contain image_name.image_extension.
    • (Component 7) image_folder_name1, image_folder_name2, ...: These are folders that contain the image files used as features in the graph. Each folder should be named after the corresponding image feature column in the feature dataframes.

Important notes

Please note the following:

  • Train, validation, and test CSV files: Besides the train.csv file, you can have multiple CSV files in the zip file that you can use as training, validation, and test dataframes.
    • A train CSV file must follow the format described in the Components section.
    • A validation CSV file must follow the same format as a train CSV file.
    • A test CSV file must follow the same format as a train CSV file but does not require a label column(s).
      • For inference with a graph node classification model, the inference dataset must contain a test CSV file as long as the graph structure is the same as the one utilized in the training dataset. If the test graph is different, you need to include all the graph data; that is, the dataset needs to contain components 3 - 7 (described in the Components section).
  • meta_relation.csv: When defining the columns for the meta_relation.csv file, recall the following: In the context of graph theory, a relation refers to a connection or association between two nodes in a graph. Three components define a relation, and together, these components define the nature of the relationship between nodes in the graph:
    1. Source node type (src_node_type): This is the type of node from which the relation originates. For example, in a social network graph, the source node type could be "user."
    2. Destination node type (dst_node_type): This is the type of node that the relation points to. In the same social network example, the destination node type could be "group."
    3. Edge type (edge_type): This is the type of connection or association between the source and destination nodes. In the social network example, the edge type could be "joins" or "belongs to."
  • meta_node.csv and meta_relation.csv: The meta_node.csv and meta_relation.csv files are mandatory, and their names cannot be modified.
  • Image folders: Image folders should contain the image files specified in the image feature columns of the feature dataframes.

Example

We present a hypothetical dataset for graph node classification to aid users in understanding the dataset format. This dataset comprises two node types, namely "paper" and "author," and three relations: "author_write_paper," "author_collaborate_author," and "paper_cite_paper." The objective is to predict the research categories of papers in the graph. The dataset is organized within a zip file, and its structure is detailed below.

hyperthetical_graph_node_classification.zip

├───train.csv (1)
├───validation.csv (2)

├───meta_node.csv (3)

├───paper.csv (4)
├───author.csv (5)

├───meta_relation.csv (6)

├───author_wrote_paper.csv (7)
├───author_collaborated_author.csv (8)
├───paper_cited_paper.csv (9)

└───photo_img (10)
├───a1.jpg
├───a2.jpg
└───a3.png
  1. The train.csv file contains labeled data used for training the model and has the following structure:

    node_idlabel
    paper2physics
    paper4statistics
    Note

    It is normal to have unlabeled and uninterested nodes in the graph. This means that not all nodes in the dataset will have labels or be of interest for the specific task at hand.

    For example, in the given dataset, there is a paper with the ID "paper1." However, this paper is neither labeled nor predicted for the task of predicting the research category. This indicates that "paper1" is an unlabeled and uninterested node in the graph for this particular classification task.

    The presence of unlabeled and uninterested nodes is common in real-world datasets. It could be due to various reasons, such as missing data, incomplete information, or the focus of the task being on specific nodes or subsets of the graph. Machine learning models can handle such scenarios by ignoring or excluding these unlabeled and uninterested nodes during the training and prediction phases.

  2. The validation.csv file contains labeled data used for validating the model and has the following structure:

    node_idlabel
    paper3physics
  3. The meta_node.csv file contains information about node types and their corresponding files. It has the following structure:

    node_typefile_nameis_target
    paperpaper.csv1
    authorauthor.csv0
  4. The paper.csv file contains details about papers in the dataset and has the following structure:

    node_idtitle_txtabstract_txt
    paper1title 1abstract 1
    paper2abstract 2
    paper3title 3abstract 3
    paper4title 4
    note

    For this node type, there are four nodes and two text features. Note that some features may have missing values.

  5. The author.csv file contains information about authors in the dataset and has the following structure:

    node_idjob_title_catage_numphoto_img
    author1professor41a1.jpg
    author2researchera2.jpg
    author3student28a3.png
    note

    For this node type, there are three nodes, and it includes one categorical feature, one numerical feature, and one image feature. Because of the image column name, you need to store the images in the "photo_img" directory.

  6. The meta_relation.csv file provides information about relations between different node types. It has the following structure:

    src_node_typedst_node_typeedge_typefile_nameis_directional
    authorpaperwroteauthor_wrote_paper.csv1
    authorauthorcollaboratedauthor_collaborated_author.csv0
    paperpapercitedpaper_cited_paper.csv1
    note

    For directional relations like "author_wrote_paper," there is no need to add reverse relations like "paper_writtenby_author."

  7. The author_wrote_paper.csv file contains connections between authors and papers and has the following structure:

    src_node_iddst_node_id
    author1paper1
    author1paper2
    author2paper3
    author2paper4
    author3paper3
  8. The author_collaborated_author.csv file contains information about collaborations between authors and has the following structure:

    src_node_iddst_node_idyear_dtm
    author2author32018-09-22
  9. The paper_cited_paper.csv file contains information about citations between papers and has the following structure:

    src_node_iddst_node_id
    paper4paper3
    paper2paper1
    note

    Note that even in undirectional relations, the edges are directionally defined. However, there is no need to add reverse edges. For example, if "paper4" cites "paper3," there is no need to include "paper3" being cited by "paper4."

  10. The photo_img folder (directory) contains the image files used as features in the graph, the folder name must be same as the image feature column name.


Feedback