Skip to main content
Version: 0.19.3

Create new feature set version API

A feature set is a collection of features. Users can create a new version of an existing feature set for various reasons.

When to create a new version of a feature set

A new major version of a feature set can be created for various reasons, for example:

  • If the schema of a feature set has changed, such as, changing the data type of one or more features, adding one or more features, deleting one or more features or modifying a special data field in one or more features
  • A new version of a feature set may need to be derived from another feature set.
  • If there is a change in how a feature is calculated by an external tool, which refers to an affected feature in the Feature Store The API is capable of specifying a list of affected features, which will lead to an increment in the version number of those affected features.
  • Changing partition columns, primary keys or whether time travel columns is used as partition column
  • User wants to create a new version of feature set by back-filling with data from other feature set version

What happens after creating a new version

  • The feature set's major version number is incremented.
  • For all the affected features, the version number is incremented.
  • The version number is incremented for all features whose type has been changed because the schema has been provided.
  • Appropriate messaging is updated on the feature set and features describing the new version.
  • If a new version of the feature set is derived, an automatic ingestion job will be triggered.

How to create a new version

The following command is used to create a new version of a feature set.

feature_set.create_new_version(...)

The following examples show how new version can be created:

Create a new version on a schema change

fs = project.feature_sets.get("abc")

# Get current schema
schema = fs.schema.get()

# Change datatype
from featurestore.core.data_types import STRING
schema["xyz"].data_type = STRING
# Change special flag
schema["xyz"].sensitive = True

# Create new version
new_fs = fs.create_new_version(schema=schema, reason="some message", primary_key=[])
  • schema is the new schema of the feature set. Refer to Schema API for information on how to create the schema.
  • reason (optional) is your provided message describing the changes to the feature set. This message will be applied to the feature set and the affected features. By default, an auto-generated message will be populated describing the features added/removed/modified.
  • primary_key (optional) if not empty, new primary key is set on the feature set
  • partition_by (optional) if not empty new partition columns are set on the feature set
  • time_travel_column_as_partition (optional) if true, time travel column is used as partition in the new feature set version
  • backfill_options (optional) If specified, feature store will back-fill data from older feature set version based on the configuration passed in this object
note

In case primary key or partition by arguments contain same feature multiple times, only distinct values are used.

Create a new version by specifying affected features

fs = project.feature_sets.get("abc")

# Create new data source
new_source = CSVFile("new file source")

# Create new version
new_fs = fs.create_new_version(affected_features=["xyz"], reason="Computation of feature XYZ changed")
  • affected_features is a list of feature names for which you are explicitly asking to increment the version number.
  • reason (optional) is your provided message describing the changes to the feature set. This message will be applied to the feature set and the affected features. By default, an auto-generated message will be populated describing the features added/removed/modified.

Create a new version by specifying affected features and schema

A new schema will define a new feature set version. For features marked as affected and included in the old feature set version and in the new version, the version number will be incremented as in Option 2: Create a new version by specifying affected features.

Create a new version with backfilling

In H2O Feature Store, backfilling involves creating a new version of a feature set that includes data from a previous version, along with any necessary transformations such as feature mapping or filtering based on a time range.

User scenario:

You have a previous version (version 1.5) of a feature set that contains data from the past 5 years, and you want to create a new version (version 2.0) that only includes data from the past 2 years. To accomplish this, you need to use backfilling. You must specify the version (version 1.5) from which you want to use the data. Then you apply a time range filter on a "time travel" column in the feature set to select the data from the past 2 years. Once the filter is applied, the H2O Feature Store will create a new version of the feature set (version 2.0) that includes only the selected data.

fs = project.feature_sets.get("abc")

# Get current schema
schema = fs.schema.get()

# Change datatype
from featurestore.core.data_types import STRING
schema["xyz"].data_type = STRING
# Change special flag
schema["xyz"].sensitive = True

# Create new version with backfilling
backfill = BackfillOption(from_version="", from_date = None, to_date = None, spark_pipeline=None, feature_mapping = None)
new_fs = fs.create_new_version(schema=schema, reason="some message", backfill_options=backfill)
  • from_version is the version from which backfill will be executed. If the argument refers to just major version, e.g. "1", then the corresponding latest minor version will be used.
  • from_date is date from which data will be filter out
  • to_date is date to which data will be filter out
  • spark_pipeline is transformation that will be applied to data. Refer to Supported derived transformation for information on how to use transformation
  • feature_mapping is default value mapping for feature

Example:

import datetime
import featurestore.core.transformations as t
spark_pipeline_transformation = t.SparkPipeline("/path_to_pipeline/spark_pipeline.zip")
backfill = BackfillOption(from_version="1.1", from_date = datetime.datetime(2021, 2, 24, 00, 00), to_date = datetime.datetime(2021, 4, 2, 13, 33), spark_pipeline=spark_pipeline_transformation, feature_mapping = {"xyz": "test value"})
new_fs = fs.create_new_version(schema=schema, reason="some message", backfill_options=backfill)
note

Spark pipeline transformation is triggered after applying all options: from_date, to_date, feature_mapping.


Feedback