Feature set API
Registering a feature set
To register a feature set, you first need to obtain the schema. See Schema API for information on how to create the schema.
- Python
- Scala
project.feature_sets.register(schema, "feature_set_name", description="", primary_key=None, time_travel_column=None, time_travel_column_format="yyyy-MM-dd HH:mm:ss", secret=False, partition_by=None, time_travel_column_as_partition=False, flow=None)
If the secret
argument is set to True
, the feature set is visible
only to its owners (which also means all owners of the project where
this feature set is being registered). Other users in the system can not
see the feature set in the output of the "list feature sets" call and
cannot view the feature set details.
If the partition_by
argument is not set, the time travel column will
be used by Feature Store to partition the layout by each ingestion. If
it is defined, time_travel_column_as_partition
can be set to True
to
use time travel based partitioning additionally.
The feature_sets.register
, and feature_set.flow
methods use the enum FeatureSetFlow
. Enum (enumeration) is a fundamental concept in programming languages that allow developers to define a set of named values. They provide a convenient way to group related values and make code more readable and maintainable.
If the flow
argument is set, it will influence where data is stored.
Following values (from enum FeatureSetFlow) are supported:
- FeatureSetFlow.OFFLINE_ONLY - data is stored only in offline feature store. Online ingestion and materialization is disabled.
- FeatureSetFlow.ONLINE_ONLY - data is stored only in online feature store. Offline ingestion and materialization is disabled.
- FeatureSetFlow.OFFLINE_ONLINE_MANUAL - data is stored in both offline and online Feature Store, but automatic materialization to online is disabled. That means propagating data between online to offline is automated, but offline to online is manual and must be triggered by online materialization call.
- FeatureSetFlow.OFFLINE_ONLINE_AUTOMATIC - data is stored in both offline and online Feature Store, and automatic materialization to online is enabled. That means this is used to automatically propagate data between offline - online and online - offline. You don’t have to call
materialize_online
as it is done automatically.
project.featureSets.register(schema, "feature_set_name", description="", primaryKey=Seq(), timeTravelColumn="", timeTravelColumn_format="yyyy-MM-dd HH:mm:ss", secret=false, partitionBy="", timeTravelColumnAsPartition=false, flow="")
If the secret
argument is set to true
, the feature set is visible
only to its owners (which also means all owners of the project where
this feature set is being registered). Other users in the system can not
see the feature set in the output of the "list feature sets" call and
can not view the feature set details.
If the partitionBy
argument is not set, the time travel column will be
used by Feature Store to partition the layout by each ingestion. If it
is defined, timeTravelColumnAsPartition
can be set to true
to use
time travel based partitioning additionally.
The featureSets.register
, and featureSets.flow
methods use the enum FeatureSetFlow
. Enum (enumeration) is a fundamental concept in programming languages that allow developers to define a set of named values. They provide a convenient way to group related values and make code more readable and maintainable.
If the flow
argument is set, it will influence where data is stored.
Following values (from enumeration ai.h2o.featurestore.core.FeatureSetFlow) are supported:
- FeatureSetFlow.OFFLINE_ONLY - data is stored only in offline feature store. Online ingestion and materialization is disabled.
- FeatureSetFlow.ONLINE_ONLY - data is stored only in online feature store. Offline ingestion and materialization is disabled.
- FeatureSetFlow.OFFLINE_ONLINE_MANUAL - data is stored in both offline and online Feature Store, but automatic materialization to online is disabled. That means propagating data between online to offline is automated, but offline to online is manual and must be triggered by online materialization call.
- FeatureSetFlow.OFFLINE_ONLINE_AUTOMATIC - data is stored in both offline and online Feature Store, and automatic materialization to online is enabled. That means this is used to automatically propagate data between offline - online and online - offline. You don’t have to call
materializeOnline
as it is done automatically.
In case primary key or partition by arguments contain same feature multiple times, only distinct values are used.
If value in primary key or partition by or time travel column corresponds to two or more features, most nested is selected by default. In other cases, specific feature can be selected by enclosing the feature name in ``
For example, feature set contains feature named "test.data" and second feature "test" with nested feature "data". But default for value "test.data", nested feature "data" will be selected. If feature with name "test.data" should be selected, value should be changed to "`.test.data'`"
Feature Store is using time format used by Spark. Specification is available here.
To see naming conventions for feature set names, please visit Default naming rules.
To register a derived feature set, you first need to obtain the derived schema. See Schema API for information on how to create the schema.
- Python
- Scala
import featurestore.transformations as t
spark_pipeline_transformation = t.SparkPipeline("...")
derived_schema = client.extract_derived_schema([parent_feature_set], spark_pipeline_transformation)
project.feature_sets.register(derived_schema, "derived_feature_set", description="", primary_key=None, time_travel_column=None, time_travel_column_format="yyyy-MM-dd HH:mm:ss", secret=False, partition_by=None, time_travel_column_as_partition=False)
import ai.h2o.featurestore.core.transformations.SparkPipeline
val sparkPipelineTransformation = t.SparkPipeline("...")
val derivedSchema = client.extractDerivedSchema(Seq(parentFeatureSet), sparkPipelineTransformation)
project.featureSets.register(derivedSchema, "derived_feature_set", description="", primaryKey=Seq(), timeTravelColumn="", timeTravelColumn_format="yyyy-MM-dd HH:mm:ss", secret=false, partitionBy="", timeTravelColumnAsPartition=false)
Features can be masked by setting Special Data fields in the schema. For further information, please visit Modify special data on a schema.
Setting any of the following attributes to true
marks the feature for
masking:
spi
- Sensitive Personal Informationpci
- Payment Card Industryrpi
- Real Property Inventorydemographic
sensitive
Any of the special data tags would allow for the masking functionality to work and separate sensitive consumer output (e.g. unmasked data) from the masked view that the consumer role sees. Which tag is selected is more bookkeeping than leading to different functionality.
Feature Store does not support registering feature sets with the following characters in column names:
,
;
{
or}
(
or)
new line character
tab character
=
Time travel column selection
You can specify a time travel column during the registration call. If the column is specified, Feature Store will use that column to obtain time travel data and will use it for incremental ingest purposes. The explicitly passed time travel column must be present in the schema passed to the registration call.
If the time travel column is not specified, a virtual one is created, so you can still do time travel on static feature sets. Each ingestion to this feature set is treated as a new batch of data with a new timestamp.
Use the following register method argument to specify the name of the time travel column explicitly:
- Python
- Scala
time_travel_column
timeTravelColumn
Inferring the data type of date-time columns during feature set registration
File types without schema information: For file types that have no metadata about column types (e.g., CSV), Feature Store detects date-time columns as regular string.
File types containing schema information: For file types that keep information about the data types (e.g., Parquet), Feature Store respects those types. If a date-time column is stored with a type of Timestamp or Date, Feature Store will respect that during the registration.
Listing feature sets within a project
The list method does not return feature sets directly. Instead, it returns an iterator which obtains the feature sets lazily.
- Python
- Scala
project.feature_sets.list(query=None, advanced_search_options=None)
project.featureSets.list(query="", advancedSearchOption=Seq())
The query
and advancedSearchOption
arguments are optional and specify which feature sets
should be returned. By default, no filtering options are specified.
To filter feature sets by name, description or tags please use query
parameter.
- Python
- Scala
project.feature_sets.list(query="My feature")
project.featureSets.list(query="My feature")
The advancedSearchOption
allows to filter feature sets by feature name, description or tags.
To provide the 'advancedSearchOption' in your requests, follow these steps:
- Python
- Scala
from featurestore.core.search_operator import SearchOperator
from featurestore.core.search_field import SearchField
from featurestore import AdvancedSearchOption
search_options = [AdvancedSearchOption(search_operator=SearchOperator.SEARCH_OPERATOR_LIKE, search_field=SearchField.SEARCH_FIELD_FEATURE_NAME, search_value="super feature")]
project.feature_sets.list(advanced_search_options=search_options)
import ai.h2o.featurestore.core.SearchOperator
import ai.h2o.featurestore.core.SearchField
import ai.h2o.featurestore.core.entities.AdvancedSearchOption
searchOptions = Seq(AdvancedSearchOption(SearchOperator.SEARCH_OPERATOR_LIKE, SearchField.SEARCH_FIELD_FEATURE_NAME, "super feature"))
project.featureSets.list(advancedSearchOption=searchOptions)
Both parameters could be used together.
You can also list all major versions of the feature set:
- Python
- Scala
fs.major_versions()
fs.majorVersions()
This call shows all major versions of the feature set (the current and previous ones).
You can also list all versions of the feature set:
- Python
- Scala
fs.list_versions()
fs.listVersions()
This call shows all versions of the feature set (the current and previous ones).
Obtaining a feature set
- Python
- Scala
fs = project.feature_sets.get("feature_set_name", version=None)
val fs = project.featureSets.get("feature_set_name")
or
val fs = project.featureSets.get("feature_set_name", "1.0")
If the version is not specified, the latest version of the feature set is returned.
You can also get the latest minor version of feature set for given major version
- Python
- Scala
fs = project.feature_sets.get_major_version("feature_set_name", 2)
val fs = project.featureSets.getMajorVersion("feature_set_name", 2)
It is also possible to obtain different version of a feature set from some feature set instance as:
- Python
- Scala
fs = feature_set.get_version("2.1")
val fs = feature_set.get_version("2.0")
Previewing data
You can preview up to a maximum of 100 rows and 50 features.
- Python
- Scala
fs.get_preview()
fs.getPreview()
Setting feature set permissions
Refer to Permissions for more information.
Deleting feature sets
- Python
- Scala
fs = project.feature_sets.get("name")
fs.delete()
val fs = project.featureSets.get("name")
fs.delete()
Deleting feature set major versions
- Python
- Scala
fs = project.feature_sets.get("name")
major_versions = fs.major_versions()
major_versions[0].delete()
val fs = project.featureSets.get("name")
majorVersions = fs.majorVersions()
majorVersions(0).delete()
Updating feature set fields
To update the field, simply call the setter of that field, for example:
- Python
- Scala
fs = project.feature_sets.get("name")
fs.secret = False
fs.deprecated = True
fs.time_to_live.offline = 2
fs.special_data.legal.approved = True
fs.special_data.legal.notes = "Legal notes"
fs.features["col"].special_data.legal.approved = True
fs.features["col"].special_data.legal.notes = "Legal notes"
# Add a new tag to the feature set
fs.tags.append("new tag") # This will add the new tag to the list of existing tags
# Add new tags that will overwrite any existing tags
fs.tags = ["new tag 1", "new tag 2"] # This will overwrite the existing tags with the given list of values
# Assigning a string to tags is not supported
fs.tags = "new tag" # This operation is not supported as tags accepts only a list of strings as input
# Add a new value to the data source domains on the feature set
fs.data_source_domains.append("new domain") # This will add the new domain to the list of existing domains
# Add new domains that will overwrite any existing domains
fs.data_source_domains = ["new domain 1", "new domain 2"] # This will overwrite the existing domains with the given list of values
# Assigning a string to domain is not supported
fs.data_source_domains = "new domain" # This operation is not supported as domain accepts only a list of strings as input
val fs = project.featureSets.get("name")
fs.secret = false
fs.deprecated = true
fs.timeToLive.offline = 2
fs.specialData.legal.approved = True
fs.specialData.legal.notes = "Legal notes"
fs.features("col").specialData.legal.approved = true
fs.features("col").specialData.legal.notes = "Legal notes"
// Add a new tag to the feature set
fs.tags = fs.tags :+ "new tag" # This will add the new tag to the list of existing tags
// Add new tags that will overwrite any existing tags
fs.tags = Seq("new tag 1", "new tag 2") # This will overwrite the existing tags with the given seq of values
// Assigning a string to tags is not supported
fs.tags = "new tag" # This operation is not supported as tags accepts only a list of strings as input
// Add a new value to the data source domains on the feature set
fs.dataSourceDomains = fs.dataSourceDomains :+ "new domain" # This will add the new domain to the list of existing domains
// Add new domains that will overwrite any existing domains
fs.dataSourceDomains = Seq("new domain 1", "new domain 2") # This will overwrite the existing domains with the given list of values
// Assigning a string to domain is not supported
fs.dataSourceDomains = "new domain" # This operation is not supported as domain accepts only a list of strings as input
Feature type can be changed by:
- Python
- Scala
from featurestore.core.entities.feature import CATEGORICAL
fs = project.feature_sets.get("name")
feature = fs.features["feature"]
my_feature.profile.feature_type = CATEGORICAL
import ai.h2o.featurestore.core.entities.Feature.CATEGORICAL
val fs = project.featureSets.get("name")
val feature = fs.features("feature")
feature.profile.featureType = CATEGORICAL
The following list of fields can be updated on the feature set object:
- Python
- Scala
- tags
- data_source_domains
- feature_set_type
- description
- application_name
- application_id
- deprecated
- process_interval
- process_interval_unit
- flow
- feature_set_state
- secret
- time_to_live.ttl_offline
- time_to_live.ttl_offline_interval
- time_to_live.ttl_online
- time_to_live.ttl_online_interval
- special_data.legal.approved
- special_data.legal.notes
- feature[].status
- feature[].profile.feature_type
- feature[].importance
- feature[].description
- feature[].special
- feature[].monitoring.anomaly_detection
- feature[].classifiers
feature_set_type
has two values,RAW
orENGINEERED
. It denotes whether the feature set was derived from raw or processed data. This classification exists for information purposes and does not affect Feature Store behavior.time_to_live
is currently respected for data in online feature store only. It indicates the duration for which records remain stored before they are evicted.
- tags
- dataSourceDomains
- featureSetType
- description
- applicationName
- applicationId
- deprecated
- processInterval
- processIntervalUnit
- flow
- featureSetState
- secret
- timeToLive.ttlOffline
- timeToLive.ttlOfflineInterval
- timeToLive.ttlOnline
- timeToLive.ttlOnlineInterval
- specialData.legal.approved
- specialData.legal.notes
- feature[].status
- feature[].profile.featureType
- feature[].importance
- feature[].description
- feature[].special
- feature[].monitoring.anomalyDetection
- feature[].classifiers
featureSetType
has two values,RAW
orENGINEERED
. It denotes whether the feature set was derived from raw or processed data. This classification exists for information purposes and does not affect Feature Store behavior.timeToLive
is currently respected for data in online feature store only. It indicates the duration for which records remain stored before they are evicted.
To retrospectively find out who and when updated a feature set, call:
- Python
- Scala
fs.last_updated_by
fs.last_updated_date_time
fs.lastUpdatedBy
fs.lastUpdatedDateTime
Recommendation and classifiers
Refer to the Recommendation API for more information.
New version API
Refer to the Create new feature set version API for more information.
Feature set schema API
Getting schema
To get feature set's schema, run:
- Python
- Scala
fs = project.feature_sets.get("gdp")
fs.schema.get()
val fs = project.featureSets.get("gdp")
fs.schema.get()
Checking schema compatibility
To compare feature set's schema with the new data source's schema, run:
- Python
- Scala
fs = project.feature_sets.get("gdp")
new_schema = client.extract_schema_from_source(<source>)
fs.schema.is_compatible_with(new_schema, compare_data_types=True)
val fs = project.featureSets.get("gdp")
val newSchema = client.extractSchemaFromSource(<source>)
fs.schema.isCompatibleWith(newSchema, compareDataTypes=true)
Parameters explanation:
- Python
- Scala
new_schema
new schema to check compatibility with.compare_data_types
accepts True/False, indicates whether data type needs to be compared or not.- If
compare_data_types
isTrue
, then data types for features with same name will be verified. - If
compare_data_types
isFalse
, then data types for features with same name will not be verified.
- If
newSchema
new schema to check compatibility with.compareDataTypes
accepts true/false, indicates whether data type needs to be compared or not.- If
compareDataTypes
istrue
, then data types for features with same name will be verified. - If
compareDataTypes
isfalse
, then data types for features with same name will not be verified.
- If
Patching new schema
Patch schema checks for matching features between the 'new schema' and the existing 'fs.schema'. If there is a match, then the meta data such as special_data, description etc are copied into the new_schema
To patch the new schema with feature set's schema, run:
- Python
- Scala
fs = project.feature_sets.get("gdp")
new_schema = client.extract_schema_from_source(<source>)
fs.schema.patch_from(new_schema, compare_data_types=True)
val fs = project.featureSets.get("gdp")
val newSchema = client.extractSchemaFromSource(<source>)
fs.schema.patchFrom(newSchema, compareDataTypes=true)
Parameters explanation:
- Python
- Scala
new_schema
new schema that needs to be patched.compare_data_types
accepts True/False, indicates whether data type are to be compared while patching.- If
compare_data_types
isTrue
, then data type from feature set schema is retained for features with same name and different types. - If
compare_data_types
isFalse
, then data type from new schema is retained for features with same name and different types.
- If
newSchema
new schema that needs to be patched.compareDataTypes
accepts true/false, indicates whether data type are to be compared while patching.- If
compareDataTypes
istrue
, then data type from feature set schema is retained for features with same name and different types. - If
compareDataTypes
isfalse
, then data type from new schema is retained for features with same name and different types.
- If
Offline to online API
To push existing data from offline Feature store into online, run:
Blocking approach:
- Python
- Scala
feature_set.materialize_online()
featureSet.materializeOnline()
Non-Blocking approach:
- Python
- Scala
future = feature_set.materialize_online_async()
future = featureSet.materializeOnlineAsync()
Feature set must have a primary key and time travel column defined in order to materialize the offline store into online.
More information about asynchronous methods is available at Asynchronous methods.
Subsequent calls to materialization only push the new records since the last call to online.
Online to offline API
There is a background process that periodically starts online to offline ingestion, but in case there is a need to push existing data from online Feature store into offline earlier than scheduled, then run:
Blocking approach:
- Python
- Scala
feature_set.start_online_offline_ingestion()
featureSet.startOnlineOfflineIngestion()
Non-Blocking approach:
- Python
- Scala
job = feature_set.start_online_offline_ingestion_async()
val job = featureSet.startOnlineOfflineIngestionAsync()
Feature set jobs API
You can get the list of jobs that are currently processing for the specific feature set by running:
- Python
- Scala
You can also retrieve a specific type of job by specifying the
job_type
parameter.
from featurestore.core.job_types import INGEST, RETRIEVE, EXTRACT_SCHEMA
fs.get_active_jobs()
fs.get_active_jobs(job_type=INGEST)
You can also retrieve specific type of job by specifying the jobType
parameter.
import ai.h2o.featurestore.core.JobTypes.{INGEST, RETRIEVE, EXTRACT_SCHEMA}
fs.getActiveJobs()
fs.getActiveJobs(jobType=INGEST)
Refreshing feature set
To refresh the feature set to contain the latest information, call:
- Python
- Scala
fs.refresh()
fs.refresh()
Getting recommendations
To get recommendations, call:
- Python
- Scala
fs.get_recommendations()
fs.getRecommendations()
The following conditions must hold for recommendations:
- The feature set must have at least one or more classifiers defined.
- The results will be based on the retrieve permissions of the user.
Marking feature as target variable
When feature sets are used to train ML models, it can be beneficial to know which feature was used as model's target variable. In order to communicate this knowledge between different feature set users, there is a possibility to mark/discard a feature as a target variable and list those marked features.
- Python
- Scala
feature_state = fs.features["state"]
feature_state.mark_as_target_variable()
fs.list_features_used_as_target_variable()
feature_state.discard_as_target_variable()
val featureState = fs.features("state")
featureState.markAsTargetVariable()
fs.listFeaturesUsedAsTargetVariable()
featureState.discardAsTargetVariable()
Listing feature set users
From feature set owner's perspective, it may be needed to understand who is actually allowed to access and modify the given feature set. Therefore, there are convenience methods to list feature set users according to their rights. Each of these methods returns list of users that have specified or higher rights, their actual access rights and a resource type (project or feature set) specifying, where the access right permission comes from.
The list method does not return users directly. Instead, it returns an iterator which obtains the users lazily.
- Python
- Scala
# listing users by access rights
fs = project.feature_sets.get("training_fs")
owners = fs.list_owners()
editors = fs.list_editors()
sensitive_consumers = fs.list_sensitive_consumers()
consumers = fs.list_consumers()
viewers = fs.list_viewers()
# accessing returned element
owner = next(owners)
owner.user
owner.access_type
owner.resource_type
// listing users by access rights
val fs = project.featureSets.get("training_fs")
val owners = fs.listOwners()
val editors = fs.listEditors()
val sensitiveConsumers = fs.listSensitiveConsumers()
val consumers = fs.listConsumers()
val viewers = fs.listViewers()
// accessing returned element
val owner = owners.next
owner.user
owner.accessType
owner.resourceType
Artifacts
Refer to the Artifacts API for more information.
Derived feature sets
As mentioned in the beginning, a (derived) feature set can be defined in terms of other features sets and a transformation. There are several convenience methods that help you find out a lineage of a given feature set.
Is the feature set a derived one or not?
- Python
- Scala
fs.is_derived()
fs.isDerived()
Which feature sets were used to define this derived feature set?
- Python
- Scala
parent_feature_sets = fs.get_parent_feature_sets()
val parentFeatureSets = fs.getParentFeatureSets()
To get a list of derived feature set(s) that were build upon this feature set.
- Python
- Scala
derived_feature_sets = fs.get_derived_feature_sets()
val derivedFeatureSets = fs.getDerivedFeatureSets()
Open feature set in Web UI
This method opens the given feature set in Feature Store Web UI.
- Python
- Scala
fs.open_website()
fs.openWebsite()
Optimizing feature set storage (Delta lake backend only)
In special cases, there can be a performance benefit when a feature set's data gets optimized. In order to manually enforce a storage optimization use following call. By default, feature set storage gets optimized by Z-order optimization for primary key(s). In case an optimization for different feature's list is needed, you can specify the optimization explicitly when making the call.
The optimization call returns optimization metrics provided by storage. Furthermore, a new minor feature set version gets created. The updated feature set version contains optimization input as one of its attributes.
- Python
- Scala
# z-order optimization for primary key(s) by default
response = fs.optimize_storage()
# show response details
response.optimization_metrics
# z-order optimization for specific columns
fs.optimize_storage(ZOrderByOptimization(["name", "age"]))
# refresh version and show optimization input
fs.refresh()
fs.storage_optimization
// z-order optimization for primary key(s) by default
val response = fs.optimizeStorage()
// show response details
response.optimizationMetrics
// z-order optimization for specific columns
import ai.h2o.featurestore.core.ZOrderByOptimization
fs.optimizeStorage(ZOrderByOptimization(Seq("name", "age")))
// refresh version and show optimization input
fs.refresh()
fs.storageOptimization
- Submit and view feedback for this page
- Send feedback about H2O Feature Store to cloud-feedback@h2o.ai