Schema API
A schema is extracted from a data source. The schema represents the features of the feature set.
Creating the schema
- Python
- Scala
create_from
is available on theSchema
class and is used to create a schema instance from a string formatted schemacreate_derived_from
is available on theSchema
class and is used to create a derived schema instance from a string formatted schema and parent feature set along with transformationto_string
is available on a schema instance and is used to serialise the schema object to string format
createFrom
is available on theSchema
class and is used to create a schema instance from a string formatted schemacreateDerivedFrom
is available on theSchema
class and is used to create a derived schema instance from a string formatted schema and parent feature set along with transformationtoString
is available on a schema instance and is used to serialise the schema object to string format
Usage
Create a schema from a string
A schema can be created from a string format:
- Python
- Scala
from featurestore import Schema
schema = "col1 string, col2 string, col3 integer"
schema = Schema.create_from(schema)
import ai.h2o.featurestore.core.Schema
val schema = "col1 string, col2 string, col3 integer"
val schema = Schema.createFrom(schema)
Create a derived schema from a string
- Python
- Scala
from featurestore import Schema
import featurestore.transformations as t
spark_pipeline_transformation = t.SparkPipeline("...")
schema_str = "id INT, text STRING, label DOUBLE, state STRING, date STRING, words ARRAY<STRING>"
schema = Schema.create_derived_from(schema_str, [parent_feature_set], spark_pipeline_transformation)
import ai.h2o.featurestore.core.Schema
import ai.h2o.featurestore.core.transformations.SparkPipeline
sparkPipelineTransformation = t.SparkPipeline("...")
schemaStr = "id INT, text STRING, label DOUBLE, state STRING, date STRING, words ARRAY<STRING>"
schema = Schema.createDerivedFrom(schemaStr, Seq(parentFeatureSet), sparkPipelineTransformation)
Create a schema from a data source
A schema can also be created from a data source. To see all supported data sources, see Supported data sources.
- Python
- Scala
schema = client.extract_schema_from_source(source)
schema = Client.extract_schema_from_source(source, credentials)
val schema = client.extractSchemaFromSource(source)
val schema = client.extractSchemaFromSource(source, credentials)
An optional parameter, credentials
, can be specified. If specified,
these credentials are used instead of environmental variables.
Create a schema from a feature set
- Python
- Scala
feature_set = project.feature_sets.get("example")
schema = Schema.create_from(feature_set)
val feature_set = project.featureSets.get("example")
val schema = Schema.createFrom(feature_set)
Create a derived schema from a parent feature set with applied transformation
A derived schema can be created from an existing feature set using selected transformation. To see all supported transformations, see Supported derived transformation.
- Python
- Scala
import featurestore.transformations as t
spark_pipeline_transformation = t.SparkPipeline("...")
schema = client.extract_derived_schema([parent_feature_set], spark_pipeline_transformation)
import ai.h2o.featurestore.core.transformations.SparkPipeline
val sparkPipelineTransformation = t.SparkPipeline("...")
val schema = client.extractDerivedSchema(Seq(parentFeatureSet), sparkPipelineTransformation)
Load schema from a feature set
You can also load a schema from an existing feature set:
- Python
- Scala
schema = feature_set.schema.get()
schema = featureSet.schema.get()
Create a new schema by changing the data type of the current schema
- Python
- Scala
from featurestore.core.data_types import STRING
schema["col"].data_type = STRING
# nested columns
schema["col1"].schema["col2"].data_type = STRING
import ai.h2o.featurestore.core.DataTypes.STRING
schema("col").dataType = STRING
# nested columns
schema("col1").schema("col2").dataType = STRING
Create a new schema by column selection
- Python
- Scala
schema.select(features)
schema.exclude(features)
schema.select(features)
schema.exclude(features)
Create a new schema by adding a new feature schema
- Python
- Scala
from featurestore.core.data_types import STRING
from featurestore import FeatureSchema
new_feature_schema = FeatureSchema("new_name", STRING)
# Append
schema.append(new_feature_schema) # Append to the end
schema.append(new_feature_schema, schema["old"]) # Append after old
# Prepend
new_schema = schema.prepend(new_feature_schema) # Prepend to the beginning
new_schema = schema.prepend(new_feature_schema, schema["old"]) # Prepend before old
import ai.h2o.featurestore.core.DataTypes.STRING
import ai.h2o.featurestore.core.FeatureSchema
val newFeatureSchema = FeatureSchema("new_name", STRING)
// Append
schema.append(newFeatureSchema) // Append to the end
schema.append(newFeatureSchema, schema("old")) // Append after old
// Prepend
schema.prepend(newFeatureSchema) // Prepend to the beginning
schema.prepend(newFeatureSchema, schema("old")) // Prepend before old
Modify special data on a schema
- Python
- Scala
schema["col1"].special_data.sensitive = True
schema["col2"].special_data.spi = True
# Nested feature modification
schema["col3"].schema["col4"].special_data.pci = True
schema("col1").specialData.sensitive = true
schema("col2").specialData.spi = true
// Nested feature modification
schema("col3").schema("col4").specialData.pci = true
Available special data fields on the Schema object are spi
, pci
,
rpi
, demographic
and sensitive
. These are boolean fields and can
be either set with true/false.
Modify feature type
- Python
- Scala
from featurestore.core.entities.feature import *
schema["col1"].feature_type = NUMERICAL
schema["col2"].feature_type = AUTOMATIC_DISCOVERY
# Nested feature modification
schema["col3"].schema["col4"].feature_type = TEXT
import ai.h2o.featurestore.core.entities.Feature._
schema("col1").featureType = NUMERICAL
schema("col2").featureType = AUTOMATIC_DISCOVERY
// Nested feature modification
schema("col3").schema("col4").featureType = TEXT
The AUTOMATIC_DISCOVERY
means that the feature type will be determined
on the backend side based on the feature data type automatically.
AUTOMATIC_DISCOVERY
is the default value for all the schema's feature
types.
Set feature description
It is also possible to provide a description for a feature schema. This description is propagated to the feature.
- Python
- Scala
schema["col1"].description = "The best feature"
schema("col1").description = "The best feature"
Set feature classifier
Features in a feature set can be tagged by a classifier from a predefined list. The classifier on the feature denotes the type of data stored in the feature.
- Python
- Scala
client.classifiers.list() # this returns all configured classifiers on the backend
schema["col1"].classifier = "emailId"
client.classifiers.list() // this returns all configured classifiers on the backend
schema("col1").classifier = "emailId"
Save schema as string
A schema can be serialized to string format:
- Python
- Scala
str_schema = schema.to_string()
val strSchema = schema.toString()
- Submit and view feedback for this page
- Send feedback about H2O Feature Store to cloud-feedback@h2o.ai