Dataset format: Graph node classification
Overview
The dataset for a graph node classification experiment needs to be organized in a specific folder structure and format.
Dataset format
Dataset structure
Here is an example of how the folder structure should look like:
folder_name.zip (1)
│
├───train.csv (2)
│
├───meta_node.csv (3)
│
├───feature_dataframe_for_nodetype1.csv (4)
├───feature_dataframe_for_nodetype2.csv (4)
│ ...
│
├───meta_relation.csv (5)
│
├───feature_dataframe_for_relation1.csv (6)
├───feature_dataframe_for_relation2.csv (6)
│ ...
│
├───image_folder_name1 (7)
│ ├───image_name1.image_extension
│ ├───image_name2.image_extension
│ ...
│
├───image_folder_name2 (7)
│ ├───image_name1.image_extension
│ ├───image_name2.image_extension
│ ...
...
Components
- (Component 1) Create a zip file (for example,
folder_name.zip
) that will contain all the necessary files and folders. - Inside the zip file, include the following files and folders:
- (Component 2)
train.csv
: This is the training data file in CSV format. It should contain the following columns:node_id
: This column should contain the IDs of the target node type.label1
,label2
,label3
, ...: One or more label columns containing either one-hot encoded multi-class labels or multiple multi-class/multi-label labels. For multi-class and binary classification, a single-label column is sufficient.Note- H2O Hydrogen Torch can solve both multi-class and multi-label classification problems. In multi-class problems, the classes are mutually exclusive, while the classes represent unique labels for multi-label problems. For N label class columns, in multi-class problems, only a single column could be set to 1, while in multi-label problems, all or none could be set to 1.
- For binary classification experiments utilizing precision, recall, F1, F05, or F2 as a metric, the label column needs to contain 0/1 values. If the column contains string values, the column is transformed into multiple columns using a one-hot encoder method, resulting in the experiment being treated as a multi-class classification experiment while leading to an incorrect calculation of the precision, recall, F1, F05, or F2 metric.
fold
(optional): This column should contain cross-validation fold indexes.noteThe fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.
- (Component 3)
meta_node.csv
: This file contains metadata about the node types used in the graph. It should have the following columns:node_type
: This column should contain the names of the node types.file_name
: This column should contain the feature dataframe file names for every node type.is_target
: This column should indicate which node type to train and predict on. It should have binary 0/1 values, with 1 referring to the target node type. There must be one and only one target node type.
- (Component 4)
feature_dataframe_for_nodetype1.csv
,feature_dataframe_for_nodetype2.csv
, ...: These are CSV files that store the features for each node type in the graph. Each file should have the following columns:node_id
: This column should contain the node IDs.- Numerical features (optional): Any number of numerical features with column names ending with
_num
. - Categorical features (optional): Any number of categorical features with column names ending with
_cat
. - Pandas datetime features (optional): Any number of pandas datetime features with column names ending with
_dtm
. - Text features (optional): Any number of text features with column names ending with
_txt
. - Image features (optional): Any number of image features with column names ending with
_img
. The image feature columns should only containimage_name.image_extension
.
- (Component 5)
meta_relation.csv
: This file contains metadata about the relations between nodes in the graph. It should have the following columns:src_node_type
: This column should contain the names of the source node types.dst_node_type
: This column should contain the names of the destination node types. The destination node type can be the same as the source node type.edge_type
: This column should contain the names of the edge types.noteIn the case of multiple relations between the same source and destination node types, they can be differentiated by using different edge types.
file_name
: This column should contain the feature dataframe file names for every relation.is_directional
: This column indicates whether the relation is directional or not.
- (Component 6)
feature_dataframe_for_relation1.csv
,feature_dataframe_for_relation2.csv
, ...: These are CSV files that store the features for each relation in the graph. Each file should have the following columns:src_node_id
: This column should contain the source node IDs.dst_node_id
: This column should contain the destination node IDs.- Numerical features (optional): Any number of numerical features with column names ending with
_num
. - Categorical features (optional): Any number of categorical features with column names ending with
_cat
. - Pandas datetime features (optional): Any number of pandas datetime features with column names ending with
_dtm
. - Text features (optional): Any number of text features with column names ending with
_txt
. - Image features (optional): Any number of image features with column names ending with
_img
. The image feature columns should only containimage_name.image_extension
.
- (Component 7)
image_folder_name1
,image_folder_name2
, ...: These are folders that contain the image files used as features in the graph. Each folder should be named after the corresponding image feature column in the feature dataframes.
- (Component 2)
Important notes
Please note the following:
- Train, validation, and test CSV files: Besides the
train.csv
file, you can have multiple CSV files in the zip file that you can use as training, validation, and test dataframes.- A train CSV file must follow the format described in the Components section.
- A validation CSV file must follow the same format as a train CSV file.
- A test CSV file must follow the same format as a train CSV file but does not require a label column(s).
- For inference with a graph node classification model, the inference dataset must contain a test CSV file as long as the graph structure is the same as the one utilized in the training dataset. If the test graph is different, you need to include all the graph data; that is, the dataset needs to contain components 3 - 7 (described in the Components section).
- meta_relation.csv: When defining the columns for the
meta_relation.csv
file, recall the following: In the context of graph theory, a relation refers to a connection or association between two nodes in a graph. Three components define a relation, and together, these components define the nature of the relationship between nodes in the graph:- Source node type (
src_node_type
): This is the type of node from which the relation originates. For example, in a social network graph, the source node type could be "user." - Destination node type (
dst_node_type
): This is the type of node that the relation points to. In the same social network example, the destination node type could be "group." - Edge type (
edge_type
): This is the type of connection or association between the source and destination nodes. In the social network example, the edge type could be "joins" or "belongs to."
- Source node type (
- meta_node.csv and meta_relation.csv: The
meta_node.csv
andmeta_relation.csv
files are mandatory, and their names cannot be modified. - Image folders: Image folders should contain the image files specified in the image feature columns of the feature dataframes.
Example
We present a hypothetical dataset for graph node classification to aid users in understanding the dataset format. This dataset comprises two node types, namely "paper" and "author," and three relations: "author_write_paper," "author_collaborate_author," and "paper_cite_paper." The objective is to predict the research categories of papers in the graph. The dataset is organized within a zip file, and its structure is detailed below.
hyperthetical_graph_node_classification.zip
│
├───train.csv (1)
├───validation.csv (2)
│
├───meta_node.csv (3)
│
├───paper.csv (4)
├───author.csv (5)
│
├───meta_relation.csv (6)
│
├───author_wrote_paper.csv (7)
├───author_collaborated_author.csv (8)
├───paper_cited_paper.csv (9)
│
└───photo_img (10)
├───a1.jpg
├───a2.jpg
└───a3.png
The
train.csv
file contains labeled data used for training the model and has the following structure:node_id label paper2 physics paper4 statistics NoteIt is normal to have unlabeled and uninterested nodes in the graph. This means that not all nodes in the dataset will have labels or be of interest for the specific task at hand.
For example, in the given dataset, there is a paper with the ID "paper1." However, this paper is neither labeled nor predicted for the task of predicting the research category. This indicates that "paper1" is an unlabeled and uninterested node in the graph for this particular classification task.
The presence of unlabeled and uninterested nodes is common in real-world datasets. It could be due to various reasons, such as missing data, incomplete information, or the focus of the task being on specific nodes or subsets of the graph. Machine learning models can handle such scenarios by ignoring or excluding these unlabeled and uninterested nodes during the training and prediction phases.
The
validation.csv
file contains labeled data used for validating the model and has the following structure:node_id label paper3 physics The
meta_node.csv
file contains information about node types and their corresponding files. It has the following structure:node_type file_name is_target paper paper.csv 1 author author.csv 0 The
paper.csv
file contains details about papers in the dataset and has the following structure:node_id title_txt abstract_txt paper1 title 1 abstract 1 paper2 abstract 2 paper3 title 3 abstract 3 paper4 title 4 noteFor this node type, there are four nodes and two text features. Note that some features may have missing values.
The
author.csv
file contains information about authors in the dataset and has the following structure:node_id job_title_cat age_num photo_img author1 professor 41 a1.jpg author2 researcher a2.jpg author3 student 28 a3.png noteFor this node type, there are three nodes, and it includes one categorical feature, one numerical feature, and one image feature. Because of the image column name, you need to store the images in the "photo_img" directory.
The
meta_relation.csv
file provides information about relations between different node types. It has the following structure:src_node_type dst_node_type edge_type file_name is_directional author paper wrote author_wrote_paper.csv 1 author author collaborated author_collaborated_author.csv 0 paper paper cited paper_cited_paper.csv 1 noteFor directional relations like "author_wrote_paper," there is no need to add reverse relations like "paper_writtenby_author."
The
author_wrote_paper.csv
file contains connections between authors and papers and has the following structure:src_node_id dst_node_id author1 paper1 author1 paper2 author2 paper3 author2 paper4 author3 paper3 The
author_collaborated_author.csv
file contains information about collaborations between authors and has the following structure:src_node_id dst_node_id year_dtm author2 author3 2018-09-22 The
paper_cited_paper.csv
file contains information about citations between papers and has the following structure:src_node_id dst_node_id paper4 paper3 paper2 paper1 noteNote that even in undirectional relations, the edges are directionally defined. However, there is no need to add reverse edges. For example, if "paper4" cites "paper3," there is no need to include "paper3" being cited by "paper4."
The
photo_img
folder (directory) contains the image files used as features in the graph, the folder name must be same as the image feature column name.
- Submit and view feedback for this page
- Send feedback about H2O Hydrogen Torch to cloud-feedback@h2o.ai