Skip to main content
Version: 1.2.0

Supported data sources

Data must first be ingested into Feature Store before it can be used. Ingesting is the act of uploading data into Feature Store.

Feature Store supports reading data from the following protocols:

  • s3 (internally reusing s3a client)
  • s3a
  • wasbs (encrypted) and wasb (legacy)
  • abfss (encrypted) and abfs (legacy)
  • http/https (data gets uploaded to internal storage)
  • drive (to read files from H2O Drive)

CSV

CSV file format. Supported path locations are S3 bucket, Azure Blob Storage, HTTP/HTTPS URL and H20 Drive.

User API:

Parameters:

  • path: String - path to csv file
  • delimiter: String - values delimiter
source = CSVFile(path=..., delimiter=...)

CSV folder

CSV Folder source. Supported path locations are S3 bucket and Azure Blob Storage.

User API:

Parameters:

  • root_folder: String - path to the root folder

  • delimiter: String - values delimiter

  • filter_pattern: String - Pattern to locate the files. To match the files at depth "N", the filter pattern must contain N expressions separated by "/" where each string is either an exact string or a regex pattern.

    • For example: filter_pattern="data/.*/.*/.*comp/.*" will match this file "data/1996-03-03/1/1679-comp/hello.json".
source = CSVFolder(root_folder=..., delimiter=..., filter_pattern=...)

Parquet

Parquet file format. Supported path locations are S3 bucket, Azure Blob Storage, HTTP/HTTPS URL and H20 Drive.

User API:

Parameters:

  • path: String - path to parquet file
source = ParquetFile(path=...)

Parquet folder

Parquet folder source. Supported path locations are S3 bucket and Azure Blob Storage.

User API:

Parameters:

  • root_folder: String - path to the root folder

  • filter_pattern: String - Pattern to locate the files. To match the files at depth "N", the filter pattern must contain N expressions separated by "/" where each string is either an exact string or a regex pattern.

    • For example: filter_pattern="data/.*/.*/.*comp/.*" will match this file "data/1996-03-03/1/1679-comp/hello.json".
source = ParquetFolder(root_folder=..., filter_pattern=...)

JSON

JSON file format. Supported path locations are S3 bucket, Azure Blob Storage, HTTP/HTTPS URL and H20 Drive. Different types of JSON formats are supported. Read more here to learn what types of JSON files are supported. By default multiline is set to False.

User API:

Parameters:

  • path: String - path to JSON file
  • multiline: Boolean - True whether the input is JSON where one entry is on multiple lines, otherwise False.
source = JSONFile(path=..., multiline=...)
note

Please keep in mind that a JSON object is an unordered set of name/value pairs. This means that using JSON files for extracting schema can produce a schema with a different order of features than that used in the file.

JSON folder

JSON folder source. Supported path locations are S3 bucket and Azure Blob Storage.

User API:

Parameters:

  • root_folder: String - path to the root folder

  • multiline: Boolean - True whether the input is JSON where one entry is on multiple lines, otherwise False.

  • filter_pattern: String - Pattern to locate the files. To match the files at depth "N", the filter pattern must contain N expressions separated by "/" where each string is either an exact string or a regex pattern.

    • For example: filter_pattern="data/.*/.*/.*comp/.*" will match this file "data/1996-03-03/1/1679-comp/hello.json".
source = JSONFolder(root_folder=..., multiline=..., filter_pattern=...)

Please keep in mind that a JSON object is an unordered set of name/value pairs. This means that using JSON files for extracting schema can produce a schema with a different order of features than that used in the file.

MongoDB

Data stored in a MongoDb can be accessed by Feature Store as well. For a MongoDb authentication, environment variables

  • MONGODB_USER
  • MONGODB_PASSWORD will be used to provide user information.
  • User API:

Parameters:

  • connection_uri: String - a MongoDb server URI
    • E.g. connection_uri="mongodb+srv://my_cluster.mongodb.net/test"
  • database: String - Name of a database on the server
    • E.g. database="sample_guides"
  • collection: String - Name of a collection to read the data from
    • E.g. collection="planets"
source = MongoDbCollection(connection_uri=..., database= ..., collection = ...)

Delta table

Delta table format. Table can be stored in either S3 or Azure Blob Storage.

User API:

Parameters:

  • path: String - path to delta table
  • version: Int - (Optional) - version of the delta table
  • timestamp: String - (Optional) - timestamp of the data in the table
  • filter: DeltaTableFilter - (Optional) - Filter on the delta table
source = DeltaTable(path=..., version=..., timestamp=..., filter=...)

DeltaTableFilter API:

Parameters:

  • column: String - name of the column
  • operator: String - operator to be applied
  • value: String|Double|Boolean - value to be applied on the filter
delta_filter = DeltaTableFilter(column=..., operator=..., value=...)

Supported operators

The following are the supported operators : ==, <, >, , and .

Valid parameter combinations

  1. Path
  2. Path, Version
  3. Path, Version, Filter
  4. Path, Timestamp
  5. Path, Timestamp, Filter
  6. Path, Filter

JDBC

JDBC table format. Currently, we support the following JDBC connections:

  • PostgreSQL
  • Teradata

User API:

Parameters:

  • connection_url: String - connection string including the database name

  • table: String - table to fetch data from

  • query: String - query to fetch data from

  • partition_options: PartitionOptions - (Optional) parameters to enable parallel execution. These are applicable only when table is specified

    • PartitionOptions constitutes : num_partitions, partition_column, lower_bound, upper_bound, fetch_size
source = JdbcTable(connection_url=..., table=..., partition_options=PartitionOptions(num_partitions = ..., partition_column = ..., lower_bound = ..., upper_bound = ..., fetch_size=...))
source = JdbcTable(connection_url=..., query=...)

The format of the connection URL is a standard JDBC connection string, such as:

  • For Teradata, jdbc:teradata://host:port/database
  • For PostgreSQL, jdbc:postgresql://host:port/database

The database is a mandatory part of the connection string in the case of Feature Store. Note that only one of table or query is supported at the same time. Additionally, PartitionOptions can only be specified with table. These options must all be specified if any of them is specified. They describe how to partition the table when reading in parallel from multiple workers. partitionColumn must be a numeric, date, or timestamp column from the table in question. Notice that lowerBound and upperBound are just used to decide the partition stride, not for filtering the rows in table. All rows in the table will be partitioned and returned. This option applies only to reading.

Snowflake table

Extract data from Snowflake tables or queries.

User API:

Parameters:

  • table: String - table to fetch data from
  • database: String - Snowflake database
  • url: String - url to Snowflake instance
  • query: String - query to fetch data from
  • warehouse: String - Snowflake warehouse
  • schema: String - Snowflake schema
  • insecure: Boolean - if True, Snowflake will not perform SSL verification
  • proxy: Proxy object - proxy specification
  • role: String - Snowflake role
  • account: String - Snowflake account name
note

table and query parameters cannot be configured simultaneously.

from featurestore import *
proxy = Proxy(host..., port=..., user=..., password=...)
source = SnowflakeTable(table=..., database=..., url=..., query=..., warehouse=..., schema=..., insecure=...,
proxy=..., role=..., account=...)
note

A proxy is an optional argument in the Snowflake data source API. If a proxy is not being used, the proxy configuration can simply be set to None.

The use of a proxy is possible for users only if the proxy feature is enabled by the administrator of the Snowflake account. Therefore, it is important to confirm whether proxy support is enabled before attempting to configure a proxy in the Snowflake data source API.

Snowflake Cursor object

Extract data from Snowflake tables or queries.

User API:

The Snowflake Cursor object is currently only supported in the Python client.

Parameters:

  • database: String - Snowflake database
  • url: String - url to Snowflake instance
  • warehouse: String - Snowflake warehouse
  • schema: String - Snowflake schema
  • snowflake_cursor: Object - Snowflake cursor
  • insecure: Boolean - if True, Snowflake will not perform SSL verification
  • proxy: Proxy object - proxy specification
  • role: String - Snowflake role
  • account: String - Snowflake account name
source = SnowflakeCursor(database=..., url=..., warehouse=..., schema=..., snowflake_cursor=..., insecure=...,
proxy=..., role=..., account=...)

Database snippet:

Internally, the Snowflake Cursor is converted to SnowflakeTable with query and is therefore saved in the same format in the database.

Spark Data Frame

When using Spark Data Frame as the source, several conditions must be met first. Read about the Spark dependencies to understand these requirements.

User API:

Parameters:

  • dataframe: DataFrame - Spark Data Frame instance
source = SparkDataFrame(dataframe...)

Accessing H2O Drive Data

When H2O Drive application is running in the same cloud environment as Feature Store, then user is able to access files that he/she uploaded into H2O Drive. To refer to those files, let's specify the scheme as drive. However, due to technical limitations access to H2O Drive files is currently not possible when user is authenticated to Feature Store via PAT token.

Examples

source_1 = CSVFile("drive://example-file-1.csv")
source_2 = CSVFile("drive://my-subdirectory/example-file-2.csv")

Feedback