Dataset format: Text token classification
- Formats
- Example
- Conversions
The data for a text token classification experiment can be formatted following format 1 or 2.
- Format 1
- Format 2
A Parquet file.
parquet_name.pq (1)(2)
A zip file containing a Parquet file.
folder_name.zip (1)
│ └───parquet_name.pq (2)
Note
You can have multiple Parquet files in the zip file that you can use as train, validation, and test dataframes:
- A train Parquet file needs to follow the format described above
- A validation Parquet file needs to follow the same format as a train Parquet file
- A test Parquet file needs to follow the same format as a train Parquet file, but does not require a label column
- The available dataset connectors require the data for a text token classification to be in a zip or Parquet file. Note
To learn how to upload your zip or Parquet file as your dataset in H2O Hydrogen Torch, see Dataset connectors.
- A Parquet file containing the following columns:
- A text column containing tokenized text: each sample should have a list of string tokens
- A label column containing token labels for the tokenized text; each sample should have a list of token labels. Labels should be represented as categorical string values
- An optional fold column containing cross-validation fold indexesNote
The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.
The conll2003_text_token_classification.zip file is a preprocessed dataset in H2O Hydrogen Torch and was formatted to solve a text token classification problem. The structure of the zip file is as follows:
conll2003_text_token_classification.zip
│ └───test.pq
│ └───train.pq
│ └───validation.pq
As follows, a random row from the train.pq
file:
id | text | pos_tags | chunk_tags | ner_tags |
---|---|---|---|---|
4158 | ['Nijmeh' 'of' 'Lebanon' 'beat' 'Nasr' 'of' 'Saudi' 'Arabia' '1-0' '(' 'halftime' '1-0' ')' 'in' 'their' 'Asian' 'club' 'championship' 'second' 'round' 'first' 'leg' 'tie' 'on' 'Saturday' '.'] | ['NNS' 'IN' 'NNP' 'VBD' 'NNP' 'IN' 'NNP' 'NNP' 'NNP' '(' 'NN' 'CD' ')' 'IN' 'PRP$' 'JJ' 'NN' 'NN' 'NN' 'NN' 'JJ' 'NN' 'NN' 'IN' 'NNP' '.'] | ['B-NP' 'B-VP' 'B-VP' 'I-VP' 'B-NP' 'I-NP' 'B-PP' 'B-NP' 'O' 'O' 'B-NP' 'B-NP' 'I-NP' 'I-NP' 'B-PP' 'B-NP' 'I-NP' 'B-NP' 'I-NP' 'B-VP' 'B-NP' 'B-PP' 'B-VP' 'O'] | ['B-ORG' 'O' 'B-LOC' 'O' 'B-ORG' 'O' 'B-LOC' 'I-LOC' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'B-MISC' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'O'] |
Note
- The *_tags columns refer to the label column and can only be selected when running a text token classification experiment. Only one column from the available label columns can be selected when running an experiment.
- To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Demo (preprocessed) datasets.
Convert `CoNLL-2003` dataset
from pathlib import Path
import pandas as pd
try:
import datasets
except ImportError:
raise ImportError("Need datasets>=1.11.0 to download English CoNLL2003 data!")
dataset = datasets.load_dataset("conll2003")
for subset in dataset:
out_path = Path(f"/data/conll2003/{subset}.pq")
out_path.parent.mkdir(exist_ok=True, parents=True)
df = pd.DataFrame(dataset[subset])
# Decode the label encoded labels
for feature in dataset[subset].features:
if isinstance(dataset[subset].features[feature], datasets.Sequence):
feat = dataset[subset].features[feature].feature
if isinstance(feat, datasets.ClassLabel):
df[feature] = df[feature].apply(feat.int2str)
df.rename(columns={"tokens": "text"}, inplace=True)
df.to_parquet(out_path, engine="pyarrow", index=False)
Feedback
- Submit and view feedback for this page
- Send feedback about H2O Hydrogen Torch to cloud-feedback@h2o.ai