Create new feature set version API
A feature set is a collection of features. Users can create a new version of an existing feature set for various reasons.
When to create a new version of a feature set
A new major version of a feature set can be created for various reasons, for example:
- If the schema of a feature set has changed, such as, changing the data type of one or more features, adding one or more features, deleting one or more features or modifying a special data field in one or more features
- A new version of a feature set may need to be derived from another feature set.
- If there is a change in how a feature is calculated by an external tool, which refers to an affected feature in the Feature Store The API is capable of specifying a list of affected features, which will lead to an increment in the version number of those affected features.
- Changing partition columns, primary keys or whether time travel columns is used as partition column
- User wants to create a new version of feature set by back-filling with data from other feature set version
What happens after creating a new version
- The feature set's major version number is incremented.
- For all the affected features, the version number is incremented.
- The version number is incremented for all features whose type has been changed because the schema has been provided.
- Appropriate messaging is updated on the feature set and features describing the new version.
- If a new version of the feature set is derived, an automatic ingestion job will be triggered.
How to create a new version
The following command is used to create a new version of a feature set.
- Python
- Scala
feature_set.create_new_version(...)
featureSet.createNewVersion(...)
The following examples show how new version can be created:
- Create a new version on a schema change
- Create a new version by specifying affected features
- Create a new version by specifying affected features and schema
- Create a new version with backfilling
Create a new version on a schema change
- Python
- Scala
fs = project.feature_sets.get("abc")
# Get current schema
schema = fs.schema.get()
# Change datatype
from featurestore.core.data_types import STRING
schema["xyz"].data_type = STRING
# Change special flag
schema["xyz"].sensitive = True
# Create new version
new_fs = fs.create_new_version(schema=schema, reason="some message", primary_key=[])
schema
is the new schema of the feature set. Refer to Schema API for information on how to create the schema.reason (optional)
is your provided message describing the changes to the feature set. This message will be applied to the feature set and the affected features. By default, an auto-generated message will be populated describing the features added/removed/modified.primary_key (optional)
if not empty, new primary key is set on the feature setpartition_by (optional)
if not empty new partition columns are set on the feature settime_travel_column_as_partition (optional)
if true, time travel column is used as partition in the new feature set versionbackfill_options (optional)
If specified, feature store will back-fill data from older feature set version based on the configuration passed in this object
val fs = project.featureSets.get("abc")
// Get current schema
val schema = fs.schema.get()
// Change datatype
import ai.h2o.featurestore.core.DataTypes.STRING
schema("xyz").dataType = STRING
# Change special flag
schema("xyz").sensitive = true
// Create new version
val newFs = fs.createNewVersion(schema=schema, reason="some message", primaryKey=Seq())
schema
is the new schema of the feature set. Refer to Schema API for information on how to create the schema.reason (optional)
is your provided message describing the changes to the feature set. This message will be applied to the feature set and the affected features. By default, an auto-generated message will be populated describing the features added/removed/modified.primaryKey (optional)
if not empty, new primary key is set on the feature setpartitionBy (optional)
if not empty new partition columns are set on the feature settimeTravelColumnSsPartition (optional)
if true, time travel column is used as partition in the new feature set versionbackfillOptions (optional)
If specified, feature store will back-fill data from older feature set version based on the configuration passed in this object
In case primary key or partition by arguments contain same feature multiple times, only distinct values are used.
Create a new version by specifying affected features
- Python
- Scala
fs = project.feature_sets.get("abc")
# Create new data source
new_source = CSVFile("new file source")
# Create new version
new_fs = fs.create_new_version(affected_features=["xyz"], reason="Computation of feature XYZ changed")
affected_features
is a list of feature names for which you are explicitly asking to increment the version number.reason (optional)
is your provided message describing the changes to the feature set. This message will be applied to the feature set and the affected features. By default, an auto-generated message will be populated describing the features added/removed/modified.
val fs = project.featureSets.get("abc")
// Create new data source
val newSource = CSVFile("new file source")
// Create new version
val newFs = fs.createNewVersion(dataSource=newSource, affectedFeatures=Seq("xyz", "abc"), reason="new feature additions")
affectedFeatures
is a list of feature names for which you are explicitly asking to increment the version number.reason (optional)
is your provided message describing the changes to the feature set. This message will be applied to the feature set and the affected features. By default, an auto-generated message will be populated describing the features added/removed/modified.
Create a new version by specifying affected features and schema
A new schema will define a new feature set version. For features marked as affected and included in the old feature set version and in the new version, the version number will be incremented as in Option 2: Create a new version by specifying affected features.
Create a new version with backfilling
In H2O Feature Store, backfilling involves creating a new version of a feature set that includes data from a previous version, along with any necessary transformations such as feature mapping or filtering based on a time range.
User scenario:
You have a previous version (version 1.5) of a feature set that contains data from the past 5 years, and you want to create a new version (version 2.0) that only includes data from the past 2 years. To accomplish this, you need to use backfilling. You must specify the version (version 1.5) from which you want to use the data. Then you apply a time range filter on a "time travel" column in the feature set to select the data from the past 2 years. Once the filter is applied, the H2O Feature Store will create a new version of the feature set (version 2.0) that includes only the selected data.
- Python
- Scala
fs = project.feature_sets.get("abc")
# Get current schema
schema = fs.schema.get()
# Change datatype
from featurestore.core.data_types import STRING
schema["xyz"].data_type = STRING
# Change special flag
schema["xyz"].sensitive = True
# Create new version with backfilling
backfill = BackfillOption(from_version="", from_date = None, to_date = None, spark_pipeline=None, feature_mapping = None)
new_fs = fs.create_new_version(schema=schema, reason="some message", backfill_options=backfill)
from_version
is the version from which backfill will be executed. If the argument refers to just major version, e.g. "1", then the corresponding latest minor version will be used.from_date
is date from which data will be filter outto_date
is date to which data will be filter outspark_pipeline
is transformation that will be applied to data. Refer to Supported derived transformation for information on how to use transformationfeature_mapping
is default value mapping for feature
val fs = project.featureSets.get("abc")
// Get current schema
val schema = fs.schema.get()
// Change datatype
import ai.h2o.featurestore.core.DataTypes.STRING
schema("xyz").dataType = STRING
# Change special flag
schema("xyz").sensitive = true
// Create new version with backfilling
val backfillOption = BackfillOption(fromVersion = "", fromDate = None, toDate = None, sparkPipeline = None, featureMapping = None)
val newFs = fs.createNewVersion(schema=schema, reason="some message", backfillOptions=Some(backfillOption))
fromVersion
is the version from which backfill will be executed. If the argument refers to just major version, e.g. "1", then the corresponding latest minor version will be used.fromDate
is date from which data will be filter outtoDate
is date to which data will be filter outsparkPipeline
is transformation that will be applied to data. Refer to Supported derived transformation for information on how to use transformationfeatureMapping
is default value mapping for feature
Example:
- Python
- Scala
import datetime
import featurestore.core.transformations as t
spark_pipeline_transformation = t.SparkPipeline("/path_to_pipeline/spark_pipeline.zip")
backfill = BackfillOption(from_version="1.1", from_date = datetime.datetime(2021, 2, 24, 00, 00), to_date = datetime.datetime(2021, 4, 2, 13, 33), spark_pipeline=spark_pipeline_transformation, feature_mapping = {"xyz": "test value"})
new_fs = fs.create_new_version(schema=schema, reason="some message", backfill_options=backfill)
Spark pipeline transformation
is triggered after applying all options:
from_date
, to_date
, feature_mapping
.
import java.time.{LocalDateTime, ZoneOffset}
import ai.h2o.featurestore.core.entities.BackfillOption
import ai.h2o.featurestore.core.transformations.SparkPipeline
val sparkPipelineTransformation = SparkPipeline("/Users/adrian/h2o/spark_pipeline_2.zip")
val fromDate = LocalDateTime.parse("2021-02-24T00:00:00").toInstant(ZoneOffset.UTC)
val toDate = LocalDateTime.parse("2021-04-02T13:20:17").toInstant(ZoneOffset.UTC)
val backfillOption = BackfillOption(fromVersion = "1.1", fromDate = Some(fromDate), toDate = Some(toDate), sparkPipeline = Some(sparkPipelineTransformation),
featureMapping = Some(Map("xyz": "test value")))
val newFs = fs.createNewVersion(schema=schema, reason="some message", backfillOptions=Some(backfillOption))
Spark pipeline transformation
is triggered after applying all options:
fromDate
, toDate
, featureMapping
.
- Submit and view feedback for this page
- Send feedback about H2O Feature Store to cloud-feedback@h2o.ai