Version: v1.4.0

Dataset format: Text token classification

Formats
Example
Conversions

The data for a text token classification experiment can be formatted following format 1 or 2.

Format 1
Format 2

A Parquet file.

parquet_name.pq (1)(2)

A zip file containing a Parquet file.

folder_name.zip (1)
│   └───parquet_name.pq (2)

Note

You can have multiple Parquet files in the zip file that you can use as train, validation, and test dataframes:

A train Parquet file needs to follow the format described above
A validation Parquet file needs to follow the same format as a train Parquet file
A test Parquet file needs to follow the same format as a train Parquet file, but does not require a label column

The available dataset connectors require the data for a text token classification to be in a zip or Parquet file.
Note
To learn how to upload your zip or Parquet file as your dataset in H2O Hydrogen Torch, see Dataset connectors.
A Parquet file containing the following columns:
- A text column containing tokenized text: each sample should have a list of string tokens
- A label column containing token labels for the tokenized text; each sample should have a list of token labels. Labels should be represented as categorical string values
- An optional fold column containing cross-validation fold indexes
  Note
  The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.

The conll2003_text_token_classification.zip file is a preprocessed dataset in H2O Hydrogen Torch and was formatted to solve a text token classification problem. The structure of the zip file is as follows:

conll2003_text_token_classification.zip
│   └───test.pq
│   └───train.pq
│   └───validation.pq

As follows, a random row from the train.pq file:

id	text	pos_tags	chunk_tags	ner_tags
4158	['Nijmeh' 'of' 'Lebanon' 'beat' 'Nasr' 'of' 'Saudi' 'Arabia' '1-0' '(' 'halftime' '1-0' ')' 'in' 'their' 'Asian' 'club' 'championship' 'second' 'round' 'first' 'leg' 'tie' 'on' 'Saturday' '.']	['NNS' 'IN' 'NNP' 'VBD' 'NNP' 'IN' 'NNP' 'NNP' 'NNP' '(' 'NN' 'CD' ')' 'IN' 'PRP$' 'JJ' 'NN' 'NN' 'NN' 'NN' 'JJ' 'NN' 'NN' 'IN' 'NNP' '.']	['B-NP' 'B-VP' 'B-VP' 'I-VP' 'B-NP' 'I-NP' 'B-PP' 'B-NP' 'O' 'O' 'B-NP' 'B-NP' 'I-NP' 'I-NP' 'B-PP' 'B-NP' 'I-NP' 'B-NP' 'I-NP' 'B-VP' 'B-NP' 'B-PP' 'B-VP' 'O']	['B-ORG' 'O' 'B-LOC' 'O' 'B-ORG' 'O' 'B-LOC' 'I-LOC' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'B-MISC' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'O']

Note

The *_tags columns refer to the label column and can only be selected when running a text token classification experiment. Only one column from the available label columns can be selected when running an experiment.
To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Demo (preprocessed) datasets.

Details

Convert CoNLL-2003 dataset

from pathlib import Path

import pandas as pd

try:
    import datasets
except ImportError:
    raise ImportError("Need datasets>=1.11.0 to download English CoNLL2003 data!")

dataset = datasets.load_dataset("conll2003")

for subset in dataset:
    out_path = Path(f"/data/conll2003/{subset}.pq")
    out_path.parent.mkdir(exist_ok=True, parents=True)

    df = pd.DataFrame(dataset[subset])

    # Decode the label encoded labels
    for feature in dataset[subset].features:
        if isinstance(dataset[subset].features[feature], datasets.Sequence):
            feat = dataset[subset].features[feature].feature

            if isinstance(feat, datasets.ClassLabel):
                df[feature] = df[feature].apply(feat.int2str)

    df.rename(columns={"tokens": "text"}, inplace=True)

    df.to_parquet(out_path, engine="pyarrow", index=False)

Feedback

Submit and view feedback for this page
Send feedback about H2O Hydrogen Torch to cloud-feedback@h2o.ai