Import Dataset with Databricks Connector¶
Note: The Databricks connector is supported in Driverless AI 2.0.0 and later.
First, we'll initialize a client with our server credentials and store it in the variable dai.
In [1]:
Copied!
import driverlessai
dai = driverlessai.Client(address='http://localhost:12345', username="py", password="py")
import driverlessai
dai = driverlessai.Client(address='http://localhost:12345', username="py", password="py")
You can verify that the Databricks connector is enabled.
In [2]:
Copied!
dai.connectors.list()
dai.connectors.list()
Out[2]:
['upload', 'file', 'hdfs', 's3', 'recipe_file', 'recipe_url', 'databricks']
Driverless AI considers the Databricks connector an advanced connector. Thus, the create method requires a data_source_config argument to use it.
Use Catalog and Schema specified in the query¶
In this approach, you specify the full table path (catalog.schema.table) directly in the SQL query.
In [ ]:
Copied!
dataset = dai.datasets.create(
data='SELECT * FROM my_catalog.my_schema.my_table',
data_source='databricks',
data_source_config=dict(
databricks_warehouse_id='/sql/1.0/warehouses/your_warehouse_id',
databricks_workspace_instance_name='your_workspace.azuredatabricks.net',
databricks_personal_access_token='<your-databricks-pat>',
),
name='my-databricks-dataset',
force=True,
)
dataset.head()
dataset = dai.datasets.create(
data='SELECT * FROM my_catalog.my_schema.my_table',
data_source='databricks',
data_source_config=dict(
databricks_warehouse_id='/sql/1.0/warehouses/your_warehouse_id',
databricks_workspace_instance_name='your_workspace.azuredatabricks.net',
databricks_personal_access_token='',
),
name='my-databricks-dataset',
force=True,
)
dataset.head()
Complete 100.00% - [4/4] Computed stats for column tweet_id
Out[ ]:
| _unit_id | _golden | _unit_state | _trusted_judgments | _last_judgment_at | airline_sentiment | airline_sentiment:confidence | negativereason | negativereason:confidence | airline | airline_sentiment_gold | name | negativereason_gold | retweet_count | text | tweet_coord | tweet_created | tweet_id | tweet_location | user_timezone |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6.8168e+08 | False | finalized | 3 | 2/25/15 19:12 | positive | 0.6401 | American | HybridMovementC | 0 | @AmericanAir Mad love http://t.co/4ojrSDWPkK NYC- | 2/22/15 13:05 | 5.69604e+17 | everywhere, all the time. | ||||||
| 6.81678e+08 | False | finalized | 3 | 2/25/15 18:49 | negative | 1 | Lost Luggage | 0.708 | American | j_baucom | 0 | @AmericanAir After being patient about my "delayed" bags for 5 weeks, I was told that it could take another 3 to 5 weeks. Unacceptable! | 2/23/15 12:39 | 5.6996e+17 | Atlantic Time (Canada) | ||||
| 6.81461e+08 | False | finalized | 3 | 2/25/15 1:51 | negative | 0.6837 | Late Flight | 0.3585 | US Airways | nhbonedoc | 0 | @USAirways don't worry. Used own initiative and arriving much earlier than you planned for me | 2/23/15 10:37 | 5.69929e+17 | |||||
| 6.81456e+08 | False | finalized | 3 | 2/25/15 5:48 | negative | 0.6636 | Bad Flight | 0.6636 | Southwest | hwulczyn | 0 | @SouthwestAir lets chat about flights to and from BWI and the Carolina area... Can a girl get break? I've had to resort to USAir _Ù÷Ð 2 a day?! | 2/19/15 16:11 | 5.68564e+17 | Eastern Time (US & Canada) | ||||
| 6.81457e+08 | False | finalized | 3 | 2/25/15 0:43 | neutral | 1 | Southwest | cassidychiarito | 0 | @SouthwestAir when will you have flights for October 2015 available? | [41.28597403, -72.44456912] | 2/18/15 5:42 | 5.68043e+17 | Atlantic Time (Canada) |
Use user defined Catalog and Schema¶
Here you manually specify the catalog and schema via data_source_config argument.
In [ ]:
Copied!
dataset_with_catalog = dai.datasets.create(
data='SELECT * FROM my_table',
data_source='databricks',
data_source_config=dict(
databricks_warehouse_id='/sql/1.0/warehouses/your_warehouse_id',
databricks_workspace_instance_name='your_workspace.azuredatabricks.net',
databricks_personal_access_token='<your-databricks-pat>',
databricks_catalog='my_catalog',
databricks_schema='my_schema',
),
name='my-databricks-catalog-dataset',
force=True,
)
dataset_with_catalog.head()
dataset_with_catalog = dai.datasets.create(
data='SELECT * FROM my_table',
data_source='databricks',
data_source_config=dict(
databricks_warehouse_id='/sql/1.0/warehouses/your_warehouse_id',
databricks_workspace_instance_name='your_workspace.azuredatabricks.net',
databricks_personal_access_token='',
databricks_catalog='my_catalog',
databricks_schema='my_schema',
),
name='my-databricks-catalog-dataset',
force=True,
)
dataset_with_catalog.head()
Complete 100.00% - [4/4] Computed stats for column tweet_id
Out[ ]:
| _unit_id | _golden | _unit_state | _trusted_judgments | _last_judgment_at | airline_sentiment | airline_sentiment:confidence | negativereason | negativereason:confidence | airline | airline_sentiment_gold | name | negativereason_gold | retweet_count | text | tweet_coord | tweet_created | tweet_id | tweet_location | user_timezone |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6.8168e+08 | False | finalized | 3 | 2/25/15 19:12 | positive | 0.6401 | American | HybridMovementC | 0 | @AmericanAir Mad love http://t.co/4ojrSDWPkK NYC- | 2/22/15 13:05 | 5.69604e+17 | everywhere, all the time. | ||||||
| 6.81678e+08 | False | finalized | 3 | 2/25/15 18:49 | negative | 1 | Lost Luggage | 0.708 | American | j_baucom | 0 | @AmericanAir After being patient about my "delayed" bags for 5 weeks, I was told that it could take another 3 to 5 weeks. Unacceptable! | 2/23/15 12:39 | 5.6996e+17 | Atlantic Time (Canada) | ||||
| 6.81461e+08 | False | finalized | 3 | 2/25/15 1:51 | negative | 0.6837 | Late Flight | 0.3585 | US Airways | nhbonedoc | 0 | @USAirways don't worry. Used own initiative and arriving much earlier than you planned for me | 2/23/15 10:37 | 5.69929e+17 | |||||
| 6.81456e+08 | False | finalized | 3 | 2/25/15 5:48 | negative | 0.6636 | Bad Flight | 0.6636 | Southwest | hwulczyn | 0 | @SouthwestAir lets chat about flights to and from BWI and the Carolina area... Can a girl get break? I've had to resort to USAir _Ù÷Ð 2 a day?! | 2/19/15 16:11 | 5.68564e+17 | Eastern Time (US & Canada) | ||||
| 6.81457e+08 | False | finalized | 3 | 2/25/15 0:43 | neutral | 1 | Southwest | cassidychiarito | 0 | @SouthwestAir when will you have flights for October 2015 available? | [41.28597403, -72.44456912] | 2/18/15 5:42 | 5.68043e+17 | Atlantic Time (Canada) |