Feature set API
Registering a feature set
To register a feature set, you first need to obtain the schema. See Schema API for information on how to create the schema.
- Python
project.feature_sets.register(schema, "feature_set_name", description="", primary_key=None, time_travel_column=None, time_travel_column_format="yyyy-MM-dd HH:mm:ss", partition_by=None, time_travel_column_as_partition=False, flow=None)
If the partition_by
argument is not set, the time travel column will
be used by Feature Store to partition the layout by each ingestion. If
it is defined, time_travel_column_as_partition
can be set to True
to
use time travel based partitioning additionally.
The feature_sets.register
, and feature_set.flow
methods use the enum FeatureSetFlow
. Enum (enumeration) is a fundamental concept in programming languages that allow developers to define a set of named values. They provide a convenient way to group related values and make code more readable and maintainable.
If the flow
argument is set, it will influence where data is stored.
Following values (from enum FeatureSetFlow) are supported:
- FeatureSetFlow.OFFLINE_ONLY - data is stored only in offline feature store. Online ingestion and materialization is disabled.
- FeatureSetFlow.ONLINE_ONLY - data is stored only in online feature store. Offline ingestion and materialization is disabled.
- FeatureSetFlow.OFFLINE_ONLINE_MANUAL - data is stored in both offline and online Feature Store, but automatic materialization to online is disabled. That means propagating data between online to offline is automated, but offline to online is manual and must be triggered by online materialization call.
- FeatureSetFlow.OFFLINE_ONLINE_AUTOMATIC - data is stored in both offline and online Feature Store, and automatic materialization to online is enabled. That means this is used to automatically propagate data between offline - online and online - offline. You don’t have to call
materialize_online
as it is done automatically.
In case primary key or partition by arguments contain same feature multiple times, only distinct values are used.
If value in primary key or partition by or time travel column corresponds to two or more features, most nested is selected by default. In other cases, specific feature can be selected by enclosing the feature name in ``
For example, feature set contains feature named "test.data" and second feature "test" with nested feature "data". But default for value "test.data", nested feature "data" will be selected. If feature with name "test.data" should be selected, value should be changed to "`.test.data'`"
Feature Store is using time format used by Spark. Specification is available here.
If users wants to create feature sets which are accessible only by the owner and users the owner gave permission to, that feature set should be created in a private project.
To see naming conventions for feature set names, please visit Default naming rules.
To register a derived feature set, you first need to obtain the derived schema. See Schema API for information on how to create the schema.
- Python
import featurestore.transformations as t
spark_pipeline_transformation = t.SparkPipeline("...")
derived_schema = client.extract_derived_schema([parent_feature_set], spark_pipeline_transformation)
project.feature_sets.register(derived_schema, "derived_feature_set", description="", primary_key=None, time_travel_column=None, time_travel_column_format="yyyy-MM-dd HH:mm:ss", partition_by=None, time_travel_column_as_partition=False)
Features can be masked by setting Special Data fields in the schema. For further information, please visit Modify special data on a schema.
Setting any of the following attributes to true
marks the feature for
masking:
spi
- Sensitive Personal Informationpci
- Payment Card Industryrpi
- Real Property Inventorydemographic
sensitive
Any of the special data tags would allow for the masking functionality to work and separate sensitive consumer output (e.g. unmasked data) from the masked view that the consumer role sees. Which tag is selected is more bookkeeping than leading to different functionality.
Feature Store does not support registering feature sets with the following characters in column names:
,
;
{
or}
(
or)
new line character
tab character
=
Time travel column selection
You can specify a time travel column during the registration call. If the column is specified, Feature Store will use that column to obtain time travel data and will use it for incremental ingest purposes. The explicitly passed time travel column must be present in the schema passed to the registration call.
If the time travel column is not specified, a virtual one is created, so you can still do time travel on static feature sets. Each ingestion to this feature set is treated as a new batch of data with a new timestamp.
Use the following register method argument to specify the name of the time travel column explicitly:
- Python
time_travel_column
Inferring the data type of date-time columns during feature set registration
File types without schema information: For file types that have no metadata about column types (e.g., CSV), Feature Store detects date-time columns as regular string.
File types containing schema information: For file types that keep information about the data types (e.g., Parquet), Feature Store respects those types. If a date-time column is stored with a type of Timestamp or Date, Feature Store will respect that during the registration.
Listing feature sets within a project
The list method does not return feature sets directly. Instead, it returns an iterator which obtains the feature sets lazily.
- Python
project.feature_sets.list(query=None, advanced_search_options=None)
The query
and advancedSearchOption
arguments are optional and specify which feature sets
should be returned. By default, no filtering options are specified.
To filter feature sets by name, description or tags please use query
parameter.
- Python
project.feature_sets.list(query="My feature")
The advancedSearchOption
allows to filter feature sets by feature name, description or tags.
To provide the 'advancedSearchOption' in your requests, follow these steps:
- Python
from featurestore.core.search_operator import SearchOperator
from featurestore.core.search_field import SearchField
from featurestore import AdvancedSearchOption
search_options = [AdvancedSearchOption(search_operator=SearchOperator.SEARCH_OPERATOR_LIKE, search_field=SearchField.SEARCH_FIELD_FEATURE_NAME, search_value="super feature")]
project.feature_sets.list(advanced_search_options=search_options)
Both parameters could be used together.
You can also list all major versions of the feature set:
- Python
fs.major_versions()
This call shows all major versions of the feature set (the current and previous ones).
You can also list all versions of the feature set:
- Python
fs.list_versions()
This call shows all versions of the feature set (the current and previous ones).
Obtaining a feature set
- Python
fs = project.feature_sets.get("feature_set_name", version=None)
If the version is not specified, the latest version of the feature set is returned.
You can also get the latest minor version of feature set for given major version
- Python
fs = project.feature_sets.get_major_version("feature_set_name", 2)
It is also possible to obtain different version of a feature set from some feature set instance as:
- Python
fs = feature_set.get_version("2.1")
Previewing data
You can preview up to a maximum of 100 rows and 50 features.
- Python
fs.get_preview()
Setting feature set permissions
Refer to Permissions for more information.
Deleting feature sets
- Python
fs = project.feature_sets.get("name")
fs.delete()
Deleting feature set major versions
- Python
fs = project.feature_sets.get("name")
major_versions = fs.major_versions()
major_versions[0].delete()
Updating feature set fields
To update the field, simply call the setter of that field, for example:
- Python
fs = project.feature_sets.get("name")
fs.deprecated = True
fs.time_to_live.offline = 2
fs.special_data.legal.approved = True
fs.special_data.legal.notes = "Legal notes"
fs.features["col"].special_data.legal.approved = True
fs.features["col"].special_data.legal.notes = "Legal notes"
# Add a new tag to the feature set
fs.tags.append("new tag") # This will add the new tag to the list of existing tags
# Add new tags that will overwrite any existing tags
fs.tags = ["new tag 1", "new tag 2"] # This will overwrite the existing tags with the given list of values
# Assigning a string to tags is not supported
fs.tags = "new tag" # This operation is not supported as tags accepts only a list of strings as input
# Add a new value to the data source domains on the feature set
fs.data_source_domains.append("new domain") # This will add the new domain to the list of existing domains
# Add new domains that will overwrite any existing domains
fs.data_source_domains = ["new domain 1", "new domain 2"] # This will overwrite the existing domains with the given list of values
# Assigning a string to domain is not supported
fs.data_source_domains = "new domain" # This operation is not supported as domain accepts only a list of strings as input
Feature type can be changed by:
- Python
from featurestore.core.entities.feature import CATEGORICAL
fs = project.feature_sets.get("name")
feature = fs.features["feature"]
my_feature.profile.feature_type = CATEGORICAL
The following list of fields can be updated on the feature set object:
- Python
- tags
- data_source_domains
- feature_set_type
- description
- application_name
- application_id
- deprecated
- process_interval
- process_interval_unit
- flow
- feature_set_state
- time_to_live.ttl_offline
- time_to_live.ttl_offline_interval
- time_to_live.ttl_online
- time_to_live.ttl_online_interval
- special_data.legal.approved
- special_data.legal.notes
- feature[].status
- feature[].profile.feature_type
- feature[].importance
- feature[].description
- feature[].special
- feature[].monitoring.anomaly_detection
- feature[].classifiers
feature_set_type
has two values,RAW
orENGINEERED
. It denotes whether the feature set was derived from raw or processed data. This classification exists for information purposes and does not affect Feature Store behavior.time_to_live
is currently respected for data in online feature store only. It indicates the duration for which records remain stored before they are evicted.
To retrospectively find out who and when updated a feature set, call:
- Python
fs.last_updated_by
fs.last_updated_date_time
Recommendation and classifiers
Refer to the Recommendation API for more information.
New version API
Refer to the Create new feature set version API for more information.
Feature set schema API
Getting schema
To get feature set's schema, run:
- Python
fs = project.feature_sets.get("gdp")
fs.schema.get()
Checking schema compatibility
To compare feature set's schema with the new data source's schema, run:
- Python
fs = project.feature_sets.get("gdp")
new_schema = client.extract_schema_from_source(<source>)
fs.schema.is_compatible_with(new_schema, compare_data_types=True)
Parameters explanation:
- Python
-
new_schema
new schema to check compatibility with. -
compare_data_types
accepts True/False, indicates whether data type needs to be compared or not.- If
compare_data_types
isTrue
, then data types for features with same name will be verified. - If
compare_data_types
isFalse
, then data types for features with same name will not be verified.
- If
Patching new schema
Patch schema checks for matching features between the 'new schema' and the existing 'fs.schema'. If there is a match, then the meta data such as special_data, description etc are copied into the new_schema
To patch the new schema with feature set's schema, run:
- Python
fs = project.feature_sets.get("gdp")
new_schema = client.extract_schema_from_source(<source>)
fs.schema.patch_from(new_schema, compare_data_types=True)
Parameters explanation:
- Python
-
new_schema
new schema that needs to be patched. -
compare_data_types
accepts True/False, indicates whether data type are to be compared while patching.- If
compare_data_types
isTrue
, then data type from feature set schema is retained for features with same name and different types. - If
compare_data_types
isFalse
, then data type from new schema is retained for features with same name and different types.
- If
Offline to online API
To push existing data from offline Feature store into online, run:
Blocking approach:
- Python
feature_set.materialize_online()
Non-Blocking approach:
- Python
future = feature_set.materialize_online_async()
Feature set must have a primary key and time travel column defined in order to materialize the offline store into online.
More information about asynchronous methods is available at Asynchronous methods.
Subsequent calls to materialization only push the new records since the last call to online.
Online to offline API
There is a background process that periodically starts online to offline ingestion, but in case there is a need to push existing data from online Feature store into offline earlier than scheduled, then run:
Blocking approach:
- Python
feature_set.start_online_offline_ingestion()
Non-Blocking approach:
- Python
job = feature_set.start_online_offline_ingestion_async()
Feature set jobs API
You can get the list of jobs that are currently processing for the specific feature set by running:
- Python
You can also retrieve a specific type of job by specifying the
job_type
parameter.
from featurestore.core.job_types import INGEST, RETRIEVE, EXTRACT_SCHEMA
fs.get_active_jobs()
fs.get_active_jobs(job_type=INGEST)
Refreshing feature set
To refresh the feature set to contain the latest information, call:
- Python
fs.refresh()
Getting recommendations
To get recommendations, call:
- Python
fs.get_recommendations()
The following conditions must hold for recommendations:
- The feature set must have at least one or more classifiers defined.
- The results will be based on the retrieve permissions of the user.
Marking feature as target variable
When feature sets are used to train ML models, it can be beneficial to know which feature was used as model's target variable. In order to communicate this knowledge between different feature set users, there is a possibility to mark/discard a feature as a target variable and list those marked features.
- Python
feature_state = fs.features["state"]
feature_state.mark_as_target_variable()
fs.list_features_used_as_target_variable()
feature_state.discard_as_target_variable()
Listing feature set users
From feature set owner's perspective, it may be needed to understand who is actually allowed to access and modify the given feature set. Therefore, there are convenience methods to list feature set users according to their rights. Each of these methods returns list of users that have specified or higher rights, their actual access rights and a resource type (project or feature set) specifying, where the access right permission comes from.
The list method does not return users directly. Instead, it returns an iterator which obtains the users lazily.
- Python
# listing users by access rights
fs = project.feature_sets.get("training_fs")
owners = fs.list_owners()
editors = fs.list_editors()
sensitive_consumers = fs.list_sensitive_consumers()
consumers = fs.list_consumers()
viewers = fs.list_viewers()
# accessing returned element
owner = next(owners)
owner.user
owner.access_type
owner.resource_type
Artifacts
Refer to the Artifacts API for more information.
Derived feature sets
As mentioned in the beginning, a (derived) feature set can be defined in terms of other features sets and a transformation. There are several convenience methods that help you find out a lineage of a given feature set.
- In the Feature Store, lineage is preserved by tracking the ingest history. This allows users to identify the data source from which the ingest occurred.
- Users can create derived feature sets which are transformations of existing feature sets. This relationship is also preserved within the Feature Store.
Is the feature set a derived one or not?
- Python
fs.is_derived()
Which feature sets were used to define this derived feature set?
- Python
parent_feature_sets = fs.get_parent_feature_sets()
To get a list of derived feature set(s) that were build upon this feature set.
- Python
derived_feature_sets = fs.get_derived_feature_sets()
Open feature set in Web UI
This method opens the given feature set in Feature Store Web UI.
- Python
fs.open_website()
Optimizing feature set storage (Delta lake backend only)
In special cases, there can be a performance benefit when a feature set's data gets optimized. In order to manually enforce a storage optimization use following call. By default, feature set storage gets optimized by Z-order optimization for primary key(s). In case an optimization for different feature's list is needed, you can specify the optimization explicitly when making the call.
The optimization call returns optimization metrics provided by storage. Furthermore, a new minor feature set version gets created. The updated feature set version contains optimization input as one of its attributes.
- Python
# z-order optimization for primary key(s) by default
response = fs.optimize_storage()
# show response details
response.optimization_metrics
# z-order optimization for specific columns
fs.optimize_storage(ZOrderByOptimization(["name", "age"]))
# refresh version and show optimization input
fs.refresh()
fs.storage_optimization
- Submit and view feedback for this page
- Send feedback about H2O Feature Store to cloud-feedback@h2o.ai