Dataset format: Text classification
- Formats
- Example
The data for a text classification experiment can be formatted following format 1 or 2.
- Format 1
- Format 2
A CSV file.
csv_name.csv (1)(2)
A zip file containing a CSV file.
folder_name.zip (1)
│ └───csv_name.csv (2)
You can have multiple CSV files in the zip file that you can use as train, validation, and test dataframes:
- A train CSV file needs to follow the format described above
- A validation CSV file needs to follow the same format as a train CSV file
- A test CSV file needs to follow the same format as a train CSV file, but does not require a label column(s)
- The available dataset connectors require the data for a text classification experiment to be in a zip or CSV file. Note
To learn how to upload your zip or CSV file as your dataset in H2O Hydrogen Torch, see Dataset connectors.
- A CSV file containing the following columns:
- A text column containing the texts for the experiment
- One or more label columns containing either either one-hot encoded multi-class labels or multiple multi-class/multi-label labels. For multi-class and binary classification, a single label column is sufficientNote
- H2O Hydrogen Torch can solve both multi-class and multi-label classification problems. In multi-class problems, the classes are mutually exclusive, while the classes represent unique labels for multi-label problems. For N label class columns, in multi-class problems, only a single column could be set to 1, while in multi-label problems, all or none could be set to 1.
- For binary classification experiments utilizing precison, recall, F1, F05, or F2 as a metric, the label column needs to contain 0/1 values. If the column contains string values, the column is transformed into multiple columns using a one-hot encoder method resulting in the experiment being treated as a multi-class classification experiment while leading to an incorrect calculation of the precision, recall, F1, F05, or F2 metric.
- An optional fold column containing cross-validation fold indexes Note
The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.
The amazon_reviews_text_classification.csv file is a preprocessed dataset in H2O Hydrogen Torch and was formatted to solve a text classification problem.
The first two rows of the CSV file are as follows:
text | label |
---|---|
GREAT!!!!! Review: I got this toy a couple of days ago and I ABSOLUTELY LOVE IT! It is so much more realistic looking than my other baby born comfort seat. All though I dont have a baby born I had one before but I sold it at a garage sale. So I use It for my berenguar baby doll. And it even has the buckle that goes across the shoulder like a real babies car seat!!!! DEFFINATELY WORTH THE MONEY!!!!!! | Positive |
This Or "Dixie Chicken" Presents Them At A Peak Review: Though lyrically the overall feel of this record is slightly provincial, it can still transport me to places I wanna be. Musically, this pop product from California is stylistically consistent. Yet the instrumentation is diverse and each member is resourceful. But it's Lowell George's vocals and slide guitar that are primarily at the center. He's not flashy and that's a positive. You get treated to 12-bar blues, a song of prescription meds for tripping and a blues with an accordian.But the three highlights are "Easy To Slip", a jaunty acoustic/electric number about lighting up and the sheer joy that memory drifting can project, "Teenage Nervous Breakdown" in which they switch to the domain of energy-driven rock and roll and the title track, a leisurely-paced country blues in which a generous helping of background vocals provides just the right amount of tension. | Positive |
To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Demo (preprocessed) datasets.
- Submit and view feedback for this page
- Send feedback about H2O Hydrogen Torch to cloud-feedback@h2o.ai