Import Dataset with Hive Connector¶
First, we'll initialize a client with our server credentials and store it in the variable dai
.
In [22]:
Copied!
import driverlessai
dai = driverlessai.Client(address='http://localhost:12345', username="py", password="py")
import driverlessai
dai = driverlessai.Client(address='http://localhost:12345', username="py", password="py")
We can check that the Hive connector has been enabled on the Driverless AI server.
In [23]:
Copied!
dai.connectors.list()
dai.connectors.list()
Out[23]:
['upload', 'file', 'hdfs', 's3', 'recipe_file', 'recipe_url', 'hive']
The Hive connector is considered an advanced connector. Thus, the create methods require a data_source_config
argument to use them.
User Defined Hive Configuration¶
Here we manually specify the Hive configurations directory, authentication type, keytab file path, and Kerberos principal.
In [24]:
Copied!
dataset_from_hive = dai.datasets.create(
data_source="hive",
name="From Hive user defined config",
data="SELECT * FROM AirlinesTest WHERE distance > 0",
data_source_config=dict(
hive_conf_path="/opt/hive/current/conf",
hive_auth_type="keytab",
hive_keytab_path="/opt/hive/current/hive.keytab",
hive_principal_user="kadmin/admin@KDC.LOCAL",
),
force=True,
)
dataset_from_hive.head()
dataset_from_hive = dai.datasets.create(
data_source="hive",
name="From Hive user defined config",
data="SELECT * FROM AirlinesTest WHERE distance > 0",
data_source_config=dict(
hive_conf_path="/opt/hive/current/conf",
hive_auth_type="keytab",
hive_keytab_path="/opt/hive/current/hive.keytab",
hive_principal_user="kadmin/admin@KDC.LOCAL",
),
force=True,
)
dataset_from_hive.head()
Complete 100.00% - [4/4] Computed stats for column isdepdelayed_rec
Out[24]:
fyear | fmonth | fdayofmonth | fdayofweek | deptime | arrtime | uniquecarrier | origin | dest | distance | isdepdelayed | isdepdelayed_rec |
---|---|---|---|---|---|---|---|---|---|---|---|
"f1987" | "f10" | "f15" | "f4" | 729 | 903 | "PS" | "SAN" | "SFO" | 447 | "NO" | -1 |
"f1987" | "f10" | "f17" | "f6" | 741 | 918 | "PS" | "SAN" | "SFO" | 447 | "YES" | 1 |
"f1987" | "f10" | "f22" | "f4" | 728 | 852 | "PS" | "SAN" | "SFO" | 447 | "NO" | -1 |
"f1987" | "f10" | "f24" | "f6" | 929 | 1052 | "PS" | "SFO" | "RNO" | 192 | "YES" | 1 |
"f1987" | "f10" | "f6" | "f2" | 1505 | 1607 | "PS" | "BUR" | "OAK" | 325 | "NO" | -1 |
Predefined Hive Configuration¶
Here we use a predefined configuration that was setup on the Driverless AI server. We only need to specify the configuration name along with authentication type.
In [25]:
Copied!
dataset_from_hive = dai.datasets.create(
data_source="hive",
name="From Hive pre-defined config",
data="SELECT * FROM AirlinesTest WHERE distance > 0",
data_source_config=dict(
hive_default_config="kerberized",
hive_auth_type="keytab",
),
force=True,
)
dataset_from_hive.head()
dataset_from_hive = dai.datasets.create(
data_source="hive",
name="From Hive pre-defined config",
data="SELECT * FROM AirlinesTest WHERE distance > 0",
data_source_config=dict(
hive_default_config="kerberized",
hive_auth_type="keytab",
),
force=True,
)
dataset_from_hive.head()
Complete 100.00% - [4/4] Computed stats for column isdepdelayed_rec
Out[25]:
fyear | fmonth | fdayofmonth | fdayofweek | deptime | arrtime | uniquecarrier | origin | dest | distance | isdepdelayed | isdepdelayed_rec |
---|---|---|---|---|---|---|---|---|---|---|---|
"f1987" | "f10" | "f15" | "f4" | 729 | 903 | "PS" | "SAN" | "SFO" | 447 | "NO" | -1 |
"f1987" | "f10" | "f17" | "f6" | 741 | 918 | "PS" | "SAN" | "SFO" | 447 | "YES" | 1 |
"f1987" | "f10" | "f22" | "f4" | 728 | 852 | "PS" | "SAN" | "SFO" | 447 | "NO" | -1 |
"f1987" | "f10" | "f24" | "f6" | 929 | 1052 | "PS" | "SFO" | "RNO" | 192 | "YES" | 1 |
"f1987" | "f10" | "f6" | "f2" | 1505 | 1607 | "PS" | "BUR" | "OAK" | 325 | "NO" | -1 |