Skip to main content

Data preparation flow

Overview

Data preparation involves a series of steps to transform and clean your dataset. The flow of data preparation using H2O LLM DataStudio can be summarized in the following sequential steps:

Each of the steps given above has been summarised in the sections below.

Step 1: Ingest data

As the first step in the data preparation flow, upload your datasets to H2O LLM DataStudio in the specified format for the project type. Select the appropriate dataset files and map column names in your dataset to the required formats. It is essential to ensure that your dataset is compatible with the application’s supported formats.

To learn about data ingestion, see Data ingestion.

Step 2: Build the workflow

As the second step in the data preparation flow, drag the required steps to form the desired sequence. The workflow builder lets you to configure the order of data preparation steps. It lets you to define how the dataset will be prepared and transformed according to your requirements.

To learn how to build the workflow, see Workflow builder.

Step 3: Configure the parameters

As the third step in the data preparation flow, customize the behavior of the function by setting parameters. You can use default parameters or configure them based on your specific requirements and the characteristics of your dataset.

To learn how to configure the parameters, see Configuration.

Step 4: Review and execute

As the fourth step in the data preparation flow, carefully review the configured parameters to ensure accuracy. Once you are satisfied, initiate the execution of the workflow by clicking the Run pipeline button at the bottom of the page. The application will process the dataset according to the defined steps and parameters.

To learn how to review and execute the workflow, see Review and execute.

Step 5: Analyze the output

As the fifth step in the data preparation flow, take time to review and analyze the output to ensure that it meets your expectations. You can export the obtained output dataset in JSON or CSV file format.

To learn about the generated resulting dataset, see Output.

Step 6: Compare datasets

Using the Insights tab, you can compare the input and output datasets and the new columns generated. To learn about comparing datasets, see Insights.


Feedback