Skip to main content
Version: 0.19.3

Schema API

A schema is extracted from a data source. The schema represents the features of the feature set.

Creating the schema

  • create_from is available on the Schema class and is used to create a schema instance from a string formatted schema
  • create_derived_from is available on the Schema class and is used to create a derived schema instance from a string formatted schema and parent feature set along with transformation
  • to_string is available on a schema instance and is used to serialise the schema object to string format

Usage

Create a schema from a string

A schema can be created from a string format:

from featurestore import Schema
schema = "col1 string, col2 string, col3 integer"
schema = Schema.create_from(schema)

Create a derived schema from a string

from featurestore import Schema
import featurestore.transformations as t
spark_pipeline_transformation = t.SparkPipeline("...")
schema_str = "id INT, text STRING, label DOUBLE, state STRING, date STRING, words ARRAY<STRING>"
schema = Schema.create_derived_from(schema_str, [parent_feature_set], spark_pipeline_transformation)

Create a schema from a data source

A schema can also be created from a data source. To see all supported data sources, see Supported data sources.

schema = client.extract_schema_from_source(source)
schema = Client.extract_schema_from_source(source, credentials)
note

An optional parameter, credentials , can be specified. If specified, these credentials are used instead of environmental variables.

Create a schema from a feature set

feature_set = project.feature_sets.get("example")
schema = Schema.create_from(feature_set)

Create a derived schema from a parent feature set with applied transformation

A derived schema can be created from an existing feature set using selected transformation. To see all supported transformations, see Supported derived transformation.

import featurestore.transformations as t
spark_pipeline_transformation = t.SparkPipeline("...")

schema = client.extract_derived_schema([parent_feature_set], spark_pipeline_transformation)

Load schema from a feature set

You can also load a schema from an existing feature set:

schema = feature_set.schema.get()

Create a new schema by changing the data type of the current schema

from featurestore.core.data_types import STRING
schema["col"].data_type = STRING
# nested columns
schema["col1"].schema["col2"].data_type = STRING

Create a new schema by column selection

schema.select(features)
schema.exclude(features)

Create a new schema by adding a new feature schema

from featurestore.core.data_types import STRING
from featurestore import FeatureSchema
new_feature_schema = FeatureSchema("new_name", STRING)
# Append
schema.append(new_feature_schema) # Append to the end
schema.append(new_feature_schema, schema["old"]) # Append after old
# Prepend
new_schema = schema.prepend(new_feature_schema) # Prepend to the beginning
new_schema = schema.prepend(new_feature_schema, schema["old"]) # Prepend before old

Modify special data on a schema

schema["col1"].special_data.sensitive = True
schema["col2"].special_data.spi = True
# Nested feature modification
schema["col3"].schema["col4"].special_data.pci = True
note

Available special data fields on the Schema object are spi, pci, rpi, demographic and sensitive. These are boolean fields and can be either set with true/false.

Modify feature type

from featurestore.core.entities.feature import *
schema["col1"].feature_type = NUMERICAL
schema["col2"].feature_type = AUTOMATIC_DISCOVERY
# Nested feature modification
schema["col3"].schema["col4"].feature_type = TEXT

The AUTOMATIC_DISCOVERY means that the feature type will be determined on the backend side based on the feature data type automatically. AUTOMATIC_DISCOVERY is the default value for all the schema's feature types.

Set feature description

It is also possible to provide a description for a feature schema. This description is propagated to the feature.

schema["col1"].description = "The best feature"

Set feature classifier

Features in a feature set can be tagged by a classifier from a predefined list. The classifier on the feature denotes the type of data stored in the feature.

client.classifiers.list()  # this returns all configured classifiers on the backend
schema["col1"].classifier = "emailId"

Save schema as string

A schema can be serialized to string format:

str_schema = schema.to_string()

Feedback