Version: v2.1.0

Feature set API

Registering a feature set

To register a feature set, you first need to obtain the schema. See Schema API for information on how to create the schema.

Python

project.feature_sets.register(schema, "feature_set_name", description="", primary_key=None, time_travel_column=None, time_travel_column_format="yyyy-MM-dd HH:mm:ss", partition_by=None, time_travel_column_as_partition=False, flow=None)

If the partition_by argument is not set, the time travel column will be used by Feature Store to partition the layout by each ingestion. If it is defined, time_travel_column_as_partition can be set to True to use time travel based partitioning additionally.

note

The feature_sets.register, and feature_set.flow methods use the enum FeatureSetFlow. Enum (enumeration) is a fundamental concept in programming languages that allow developers to define a set of named values. They provide a convenient way to group related values and make code more readable and maintainable.

If the flow argument is set, it will influence where data is stored. Following values (from enum FeatureSetFlow) are supported:

FeatureSetFlow.OFFLINE_ONLY - data is stored only in offline feature store. Online ingestion and materialization is disabled.
FeatureSetFlow.ONLINE_ONLY - data is stored only in online feature store. Offline ingestion and materialization is disabled.
FeatureSetFlow.OFFLINE_ONLINE_MANUAL - data is stored in both offline and online Feature Store, but automatic materialization to online is disabled. That means propagating data between online to offline is automated, but offline to online is manual and must be triggered by online materialization call.
FeatureSetFlow.OFFLINE_ONLINE_AUTOMATIC - data is stored in both offline and online Feature Store, and automatic materialization to online is enabled. That means this is used to automatically propagate data between offline - online and online - offline. You don’t have to call materialize_online as it is done automatically.

note

In case primary key or partition by arguments contain same feature multiple times, only distinct values are used.

note

If value in primary key or partition by or time travel column corresponds to two or more features, most nested is selected by default. In other cases, specific feature can be selected by enclosing the feature name in ``

For example, feature set contains feature named "test.data" and second feature "test" with nested feature "data". But default for value "test.data", nested feature "data" will be selected. If feature with name "test.data" should be selected, value should be changed to "`.test.data'`"

note

Feature Store is using time format used by Spark. Specification is available here.

note

If users wants to create feature sets which are accessible only by the owner and users the owner gave permission to, that feature set should be created in a private project.

To see naming conventions for feature set names, please visit Default naming rules.

To register a derived feature set, you first need to obtain the derived schema. See Schema API for information on how to create the schema.

Python

import featurestore.transformations as t
spark_pipeline_transformation = t.SparkPipeline("...")

derived_schema = client.extract_derived_schema([parent_feature_set], spark_pipeline_transformation)

project.feature_sets.register(derived_schema, "derived_feature_set", description="", primary_key=None, time_travel_column=None, time_travel_column_format="yyyy-MM-dd HH:mm:ss", partition_by=None, time_travel_column_as_partition=False)

Features can be masked by setting Special Data fields in the schema. For further information, please visit Modify special data on a schema.

Setting any of the following attributes to true marks the feature for masking:

spi - Sensitive Personal Information
pci - Payment Card Industry
rpi - Real Property Inventory
demographic
sensitive

Any of the special data tags would allow for the masking functionality to work and separate sensitive consumer output (e.g. unmasked data) from the masked view that the consumer role sees. Which tag is selected is more bookkeeping than leading to different functionality.

note

Feature Store does not support registering feature sets with the following characters in column names:

,
;
{ or }
( or )
new line character
tab character
=

Time travel column selection

You can specify a time travel column during the registration call. If the column is specified, Feature Store will use that column to obtain time travel data and will use it for incremental ingest purposes. The explicitly passed time travel column must be present in the schema passed to the registration call.

If the time travel column is not specified, a virtual one is created, so you can still do time travel on static feature sets. Each ingestion to this feature set is treated as a new batch of data with a new timestamp.

Use the following register method argument to specify the name of the time travel column explicitly:

Python

time_travel_column

Inferring the data type of date-time columns during feature set registration

File types without schema information: For file types that have no metadata about column types (e.g., CSV), Feature Store detects date-time columns as regular string.

File types containing schema information: For file types that keep information about the data types (e.g., Parquet), Feature Store respects those types. If a date-time column is stored with a type of Timestamp or Date, Feature Store will respect that during the registration.

Listing feature sets within a project

note

The list method does not return feature sets directly. Instead, it returns an iterator which obtains the feature sets lazily.

Python

project.feature_sets.list(query=None, advanced_search_options=None)

The query and advancedSearchOption arguments are optional and specify which feature sets should be returned. By default, no filtering options are specified.

To filter feature sets by name, description or tags please use query parameter.

Python

project.feature_sets.list(query="My feature")

The advancedSearchOption allows to filter feature sets by feature name, description or tags.

To provide the 'advancedSearchOption' in your requests, follow these steps:

Python

from featurestore.core.search_operator import SearchOperator
from featurestore.core.search_field import SearchField
from featurestore import AdvancedSearchOption
search_options = [AdvancedSearchOption(search_operator=SearchOperator.SEARCH_OPERATOR_LIKE, search_field=SearchField.SEARCH_FIELD_FEATURE_NAME, search_value="super feature")]

project.feature_sets.list(advanced_search_options=search_options)

Both parameters could be used together.

You can also list all major versions of the feature set:

Python

fs.major_versions()

This call shows all major versions of the feature set (the current and previous ones).

You can also list all versions of the feature set:

Python

fs.list_versions()

This call shows all versions of the feature set (the current and previous ones).

Obtaining a feature set

Python

fs = project.feature_sets.get("feature_set_name", version=None)

If the version is not specified, the latest version of the feature set is returned.

You can also get the latest minor version of feature set for given major version

Python

fs = project.feature_sets.get_major_version("feature_set_name", 2)

It is also possible to obtain different version of a feature set from some feature set instance as:

Python

fs = feature_set.get_version("2.1")

Previewing data

You can preview up to a maximum of 100 rows and 50 features.

Python

fs.get_preview()

Setting feature set permissions

Refer to Permissions for more information.

Deleting feature sets

Python

fs = project.feature_sets.get("name")
fs.delete()

Deleting feature set major versions

Python

fs = project.feature_sets.get("name")
major_versions = fs.major_versions()
major_versions[0].delete()

Updating feature set fields

To update the field, simply call the setter of that field, for example:

Python

fs = project.feature_sets.get("name")
fs.deprecated = True
fs.time_to_live.offline = 2
fs.special_data.legal.approved = True
fs.special_data.legal.notes = "Legal notes"
fs.features["col"].special_data.legal.approved = True
fs.features["col"].special_data.legal.notes = "Legal notes"
# Add a new tag to the feature set
fs.tags.append("new tag") # This will add the new tag to the list of existing tags
# Add new tags that will overwrite any existing tags
fs.tags = ["new tag 1", "new tag 2"] # This will overwrite the existing tags with the given list of values
# Assigning a string to tags is not supported
fs.tags = "new tag" # This operation is not supported as tags accepts only a list of strings as input
# Add a new value to the data source domains on the feature set
fs.data_source_domains.append("new domain") # This will add the new domain to the list of existing domains
# Add new domains that will overwrite any existing domains
fs.data_source_domains = ["new domain 1", "new domain 2"] # This will overwrite the existing domains with the given list of values
# Assigning a string to domain is not supported
fs.data_source_domains = "new domain" # This operation is not supported as domain accepts only a list of strings as input

Feature type can be changed by:

Python

from featurestore.core.entities.feature import CATEGORICAL
fs = project.feature_sets.get("name")
feature = fs.features["feature"]
my_feature.profile.feature_type = CATEGORICAL

The following list of fields can be updated on the feature set object:

Python

- tags
- data_source_domains
- feature_set_type
- description
- application_name
- application_id
- deprecated
- process_interval
- process_interval_unit
- flow
- feature_set_state
- time_to_live.ttl_offline
- time_to_live.ttl_offline_interval
- time_to_live.ttl_online
- time_to_live.ttl_online_interval
- special_data.legal.approved
- special_data.legal.notes
- feature[].status
- feature[].profile.feature_type
- feature[].importance
- feature[].description
- feature[].special
- feature[].monitoring.anomaly_detection
- feature[].classifiers

note

feature_set_type has two values, RAW or ENGINEERED. It denotes whether the feature set was derived from raw or processed data. This classification exists for information purposes and does not affect Feature Store behavior.
time_to_live is currently respected for data in online feature store only. It indicates the duration for which records remain stored before they are evicted.

To retrospectively find out who and when updated a feature set, call:

Python

fs.last_updated_by
fs.last_updated_date_time

Recommendation and classifiers

Refer to the Recommendation API for more information.

New version API

Refer to the Create new feature set version API for more information.

Feature set schema API

Getting schema

To get feature set's schema, run:

Python

fs = project.feature_sets.get("gdp")
fs.schema.get()

Checking schema compatibility

To compare feature set's schema with the new data source's schema, run:

Python

fs = project.feature_sets.get("gdp")
new_schema = client.extract_schema_from_source(<source>)
fs.schema.is_compatible_with(new_schema, compare_data_types=True)

Parameters explanation:

Python

new_schema new schema to check compatibility with.
compare_data_types accepts True/False, indicates whether data type needs to be compared or not.
- If compare_data_types is True, then data types for features with same name will be verified.
- If compare_data_types is False, then data types for features with same name will not be verified.

Patching new schema

Patch schema checks for matching features between the 'new schema' and the existing 'fs.schema'. If there is a match, then the meta data such as special_data, description etc are copied into the new_schema

To patch the new schema with feature set's schema, run:

Python

fs = project.feature_sets.get("gdp")
new_schema = client.extract_schema_from_source(<source>)
fs.schema.patch_from(new_schema, compare_data_types=True)

Parameters explanation:

Python

new_schema new schema that needs to be patched.
compare_data_types accepts True/False, indicates whether data type are to be compared while patching.
- If compare_data_types is True, then data type from feature set schema is retained for features with same name and different types.
- If compare_data_types is False, then data type from new schema is retained for features with same name and different types.

Offline to online API

To push existing data from offline Feature store into online, run:

Blocking approach:

Python

feature_set.materialize_online()

Non-Blocking approach:

Python

future = feature_set.materialize_online_async()

note

Feature set must have a primary key and time travel column defined in order to materialize the offline store into online.

More information about asynchronous methods is available at Asynchronous methods.

Subsequent calls to materialization only push the new records since the last call to online.

Online to offline API

There is a background process that periodically starts online to offline ingestion, but in case there is a need to push existing data from online Feature store into offline earlier than scheduled, then run:

Blocking approach:

Python

feature_set.start_online_offline_ingestion()

Non-Blocking approach:

Python

job = feature_set.start_online_offline_ingestion_async()

Feature set jobs API

You can get the list of jobs that are currently processing for the specific feature set by running:

Python

You can also retrieve a specific type of job by specifying the job_type parameter.

from featurestore.core.job_types import INGEST, RETRIEVE, EXTRACT_SCHEMA
fs.get_active_jobs()
fs.get_active_jobs(job_type=INGEST)

Refreshing feature set

To refresh the feature set to contain the latest information, call:

Python

fs.refresh()

Getting recommendations

To get recommendations, call:

Python

fs.get_recommendations()

The following conditions must hold for recommendations:

The feature set must have at least one or more classifiers defined.
The results will be based on the retrieve permissions of the user.

Marking feature as target variable

When feature sets are used to train ML models, it can be beneficial to know which feature was used as model's target variable. In order to communicate this knowledge between different feature set users, there is a possibility to mark/discard a feature as a target variable and list those marked features.

Python

feature_state = fs.features["state"]
feature_state.mark_as_target_variable()

fs.list_features_used_as_target_variable()

feature_state.discard_as_target_variable()

Listing feature set users

From feature set owner's perspective, it may be needed to understand who is actually allowed to access and modify the given feature set. Therefore, there are convenience methods to list feature set users according to their rights. Each of these methods returns list of users that have specified or higher rights, their actual access rights and a resource type (project or feature set) specifying, where the access right permission comes from.

note

The list method does not return users directly. Instead, it returns an iterator which obtains the users lazily.

Python

# listing users by access rights
fs = project.feature_sets.get("training_fs")
owners = fs.list_owners()
editors = fs.list_editors()
sensitive_consumers = fs.list_sensitive_consumers()
consumers = fs.list_consumers()
viewers = fs.list_viewers()

# accessing returned element
owner = next(owners)
owner.user
owner.access_type
owner.resource_type

Artifacts

Refer to the Artifacts API for more information.

Derived feature sets

As mentioned in the beginning, a (derived) feature set can be defined in terms of other features sets and a transformation. There are several convenience methods that help you find out a lineage of a given feature set.

note

In the Feature Store, lineage is preserved by tracking the ingest history. This allows users to identify the data source from which the ingest occurred.
Users can create derived feature sets which are transformations of existing feature sets. This relationship is also preserved within the Feature Store.

Is the feature set a derived one or not?

Python

fs.is_derived()

Which feature sets were used to define this derived feature set?

Python

parent_feature_sets = fs.get_parent_feature_sets()

To get a list of derived feature set(s) that were build upon this feature set.

Python

derived_feature_sets = fs.get_derived_feature_sets()

Open feature set in Web UI

This method opens the given feature set in Feature Store Web UI.

Python

fs.open_website()

Optimizing feature set storage (Delta lake backend only)

In special cases, there can be a performance benefit when a feature set's data gets optimized. In order to manually enforce a storage optimization use following call. By default, feature set storage gets optimized by Z-order optimization for primary key(s). In case an optimization for different feature's list is needed, you can specify the optimization explicitly when making the call.

The optimization call returns optimization metrics provided by storage. Furthermore, a new minor feature set version gets created. The updated feature set version contains optimization input as one of its attributes.

Python

# z-order optimization for primary key(s) by default
response = fs.optimize_storage()

# show response details
response.optimization_metrics

# z-order optimization for specific columns
fs.optimize_storage(ZOrderByOptimization(["name", "age"]))

# refresh version and show optimization input
fs.refresh()
fs.storage_optimization

Feedback

Submit and view feedback for this page
Send feedback about H2O Feature Store to cloud-feedback@h2o.ai

Registering a feature set​

Time travel column selection​

Inferring the data type of date-time columns during feature set registration​

Listing feature sets within a project​

Obtaining a feature set​

Previewing data​

Setting feature set permissions​

Deleting feature sets​

Deleting feature set major versions​

Updating feature set fields​

Recommendation and classifiers​

New version API​

Feature set schema API​

Getting schema​

Checking schema compatibility​

Patching new schema​

Offline to online API​

Online to offline API​

Feature set jobs API​

Refreshing feature set​

Getting recommendations​

Marking feature as target variable​

Listing feature set users​

Artifacts​

Derived feature sets​

Open feature set in Web UI

Optimizing feature set storage (Delta lake backend only)

Registering a feature set

Time travel column selection

Inferring the data type of date-time columns during feature set registration

Listing feature sets within a project

Obtaining a feature set

Previewing data

Setting feature set permissions

Deleting feature sets

Deleting feature set major versions

Updating feature set fields

Recommendation and classifiers

New version API

Feature set schema API

Getting schema

Checking schema compatibility

Patching new schema

Offline to online API

Online to offline API

Feature set jobs API

Refreshing feature set

Getting recommendations

Marking feature as target variable

Listing feature set users

Artifacts

Derived feature sets