Version: v1.4.0

Dataset format: Graph node classification

Overview

The dataset for a graph node classification experiment needs to be organized in a specific folder structure and format.

Dataset format

Dataset structure

Here is an example of how the folder structure should look like:

folder_name.zip (1)
│   
├───train.csv (2)
│   
├───meta_node.csv (3)
│   
├───feature_dataframe_for_nodetype1.csv (4)
├───feature_dataframe_for_nodetype2.csv (4)
│   ...
│   
├───meta_relation.csv (5)
│   
├───feature_dataframe_for_relation1.csv (6)
├───feature_dataframe_for_relation2.csv (6)
│   ...
│   
├───image_folder_name1 (7)
│   ├───image_name1.image_extension
│   ├───image_name2.image_extension
│   ...
│   
├───image_folder_name2 (7)
│   ├───image_name1.image_extension
│   ├───image_name2.image_extension
│   ...
...

Components

(Component 1) Create a zip file (for example, folder_name.zip) that will contain all the necessary files and folders.
Inside the zip file, include the following files and folders:
- (Component 2) train.csv: This is the training data file in CSV format. It should contain the following columns:
  - node_id: This column should contain the IDs of the target node type.
  - label1, label2, label3, ...: One or more label columns containing either one-hot encoded multi-class labels or multiple multi-class/multi-label labels. For multi-class and binary classification, a single-label column is sufficient.
    Note
    H2O Hydrogen Torch can solve both multi-class and multi-label classification problems. In multi-class problems, the classes are mutually exclusive, while the classes represent unique labels for multi-label problems. For N label class columns, in multi-class problems, only a single column could be set to 1, while in multi-label problems, all or none could be set to 1.
    
    For binary classification experiments utilizing precision, recall, F1, F05, or F2 as a metric, the label column needs to contain 0/1 values. If the column contains string values, the column is transformed into multiple columns using a one-hot encoder method, resulting in the experiment being treated as a multi-class classification experiment while leading to an incorrect calculation of the precision, recall, F1, F05, or F2 metric.
  - fold (optional): This column should contain cross-validation fold indexes.
    note
    The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.
- (Component 3) meta_node.csv: This file contains metadata about the node types used in the graph. It should have the following columns:
  - node_type: This column should contain the names of the node types.
  - file_name: This column should contain the feature dataframe file names for every node type.
  - is_target: This column should indicate which node type to train and predict on. It should have binary 0/1 values, with 1 referring to the target node type. There must be one and only one target node type.
- (Component 4) feature_dataframe_for_nodetype1.csv, feature_dataframe_for_nodetype2.csv, ...: These are CSV files that store the features for each node type in the graph. Each file should have the following columns:
  - node_id: This column should contain the node IDs.
  - Numerical features (optional): Any number of numerical features with column names ending with _num.
  - Categorical features (optional): Any number of categorical features with column names ending with _cat.
  - Pandas datetime features (optional): Any number of pandas datetime features with column names ending with _dtm.
  - Text features (optional): Any number of text features with column names ending with _txt.
  - Image features (optional): Any number of image features with column names ending with _img. The image feature columns should only contain image_name.image_extension.
- (Component 5) meta_relation.csv: This file contains metadata about the relations between nodes in the graph. It should have the following columns:
  - src_node_type: This column should contain the names of the source node types.
  - dst_node_type: This column should contain the names of the destination node types. The destination node type can be the same as the source node type.
  - edge_type: This column should contain the names of the edge types.
  note
  In the case of multiple relations between the same source and destination node types, they can be differentiated by using different edge types.
  - file_name: This column should contain the feature dataframe file names for every relation.
  - is_directional: This column indicates whether the relation is directional or not.
- (Component 6) feature_dataframe_for_relation1.csv, feature_dataframe_for_relation2.csv, ...: These are CSV files that store the features for each relation in the graph. Each file should have the following columns:
  - src_node_id: This column should contain the source node IDs.
  - dst_node_id: This column should contain the destination node IDs.
  - Numerical features (optional): Any number of numerical features with column names ending with _num.
  - Categorical features (optional): Any number of categorical features with column names ending with _cat.
  - Pandas datetime features (optional): Any number of pandas datetime features with column names ending with _dtm.
  - Text features (optional): Any number of text features with column names ending with _txt.
  - Image features (optional): Any number of image features with column names ending with _img. The image feature columns should only contain image_name.image_extension.
- (Component 7) image_folder_name1, image_folder_name2, ...: These are folders that contain the image files used as features in the graph. Each folder should be named after the corresponding image feature column in the feature dataframes.

Important notes

Please note the following:

Train, validation, and test CSV files: Besides the train.csv file, you can have multiple CSV files in the zip file that you can use as training, validation, and test dataframes.
- A train CSV file must follow the format described in the Components section.
- A validation CSV file must follow the same format as a train CSV file.
- A test CSV file must follow the same format as a train CSV file but does not require a label column(s).
  - For inference with a graph node classification model, the inference dataset must contain a test CSV file as long as the graph structure is the same as the one utilized in the training dataset. If the test graph is different, you need to include all the graph data; that is, the dataset needs to contain components 3 - 7 (described in the Components section).
meta_relation.csv: When defining the columns for the meta_relation.csv file, recall the following: In the context of graph theory, a relation refers to a connection or association between two nodes in a graph. Three components define a relation, and together, these components define the nature of the relationship between nodes in the graph:
1. Source node type (src_node_type): This is the type of node from which the relation originates. For example, in a social network graph, the source node type could be "user."
2. Destination node type (dst_node_type): This is the type of node that the relation points to. In the same social network example, the destination node type could be "group."
3. Edge type (edge_type): This is the type of connection or association between the source and destination nodes. In the social network example, the edge type could be "joins" or "belongs to."
meta_node.csv and meta_relation.csv: The meta_node.csv and meta_relation.csv files are mandatory, and their names cannot be modified.
Image folders: Image folders should contain the image files specified in the image feature columns of the feature dataframes.

Example

We present a hypothetical dataset for graph node classification to aid users in understanding the dataset format. This dataset comprises two node types, namely "paper" and "author," and three relations: "author_write_paper," "author_collaborate_author," and "paper_cite_paper." The objective is to predict the research categories of papers in the graph. The dataset is organized within a zip file, and its structure is detailed below.

hyperthetical_graph_node_classification.zip
│   
├───train.csv (1)
├───validation.csv (2)
│   
├───meta_node.csv (3)
│   
├───paper.csv (4)
├───author.csv (5)
│   
├───meta_relation.csv (6)
│   
├───author_wrote_paper.csv (7)
├───author_collaborated_author.csv (8)
├───paper_cited_paper.csv (9)
│   
└───photo_img (10)
    ├───a1.jpg
    ├───a2.jpg
    └───a3.png

The train.csv file contains labeled data used for training the model and has the following structure:

node_id label
paper2 physics
paper4 statistics

Note
It is normal to have unlabeled and uninterested nodes in the graph. This means that not all nodes in the dataset will have labels or be of interest for the specific task at hand.
For example, in the given dataset, there is a paper with the ID "paper1." However, this paper is neither labeled nor predicted for the task of predicting the research category. This indicates that "paper1" is an unlabeled and uninterested node in the graph for this particular classification task.
The presence of unlabeled and uninterested nodes is common in real-world datasets. It could be due to various reasons, such as missing data, incomplete information, or the focus of the task being on specific nodes or subsets of the graph. Machine learning models can handle such scenarios by ignoring or excluding these unlabeled and uninterested nodes during the training and prediction phases. :::`
The validation.csv file contains labeled data used for validating the model and has the following structure:

node_id label
paper3 physics
The meta_node.csv file contains information about node types and their corresponding files. It has the following structure:

node_type file_name is_target
paper paper.csv 1
author author.csv 0
The paper.csv file contains details about papers in the dataset and has the following structure:

node_id title_txt abstract_txt
paper1 title 1 abstract 1
paper2 abstract 2
paper3 title 3 abstract 3
paper4 title 4

note
For this node type, there are four nodes and two text features. Note that some features may have missing values.
The author.csv file contains information about authors in the dataset and has the following structure:

node_id job_title_cat age_num photo_img
author1 professor 41 a1.jpg
author2 researcher a2.jpg
author3 student 28 a3.png

note
For this node type, there are three nodes, and it includes one categorical feature, one numerical feature, and one image feature. Because of the image column name, you need to store the images in the "photo_img" directory.
The meta_relation.csv file provides information about relations between different node types. It has the following structure:

src_node_type dst_node_type edge_type file_name is_directional
author paper wrote author_wrote_paper.csv 1
author author collaborated author_collaborated_author.csv 0
paper paper cited paper_cited_paper.csv 1

note
For directional relations like "author_wrote_paper," there is no need to add reverse relations like "paper_writtenby_author."
The author_wrote_paper.csv file contains connections between authors and papers and has the following structure:

src_node_id dst_node_id
author1 paper1
author1 paper2
author2 paper3
author2 paper4
author3 paper3
The author_collaborated_author.csv file contains information about collaborations between authors and has the following structure:

src_node_id dst_node_id year_dtm
author2 author3 2018-09-22
The paper_cited_paper.csv file contains information about citations between papers and has the following structure:

src_node_id dst_node_id
paper4 paper3
paper2 paper1

note
Note that even in undirectional relations, the edges are directionally defined. However, there is no need to add reverse edges. For example, if "paper4" cites "paper3," there is no need to include "paper3" being cited by "paper4."
The photo_img folder (directory) contains the image files used as features in the graph, the folder name must be same as the image feature column name.

node_id	label
paper2	physics
paper4	statistics

node_id	label
paper3	physics

node_type	file_name	is_target
paper	paper.csv	1
author	author.csv	0

node_id	title_txt	abstract_txt
paper1	title 1	abstract 1
paper2		abstract 2
paper3	title 3	abstract 3
paper4	title 4

node_id	job_title_cat	age_num	photo_img
author1	professor	41	a1.jpg
author2	researcher		a2.jpg
author3	student	28	a3.png

src_node_type	dst_node_type	edge_type	file_name	is_directional
author	paper	wrote	author_wrote_paper.csv	1
author	author	collaborated	author_collaborated_author.csv	0
paper	paper	cited	paper_cited_paper.csv	1

src_node_id	dst_node_id
author1	paper1
author1	paper2
author2	paper3
author2	paper4
author3	paper3

src_node_id	dst_node_id	year_dtm
author2	author3	2018-09-22

src_node_id	dst_node_id
paper4	paper3
paper2	paper1

Feedback

Submit and view feedback for this page
Send feedback about H2O Hydrogen Torch to cloud-feedback@h2o.ai

Overview​

Dataset format​

Dataset structure​

Components​

Important notes​

Example​