Driverless AI NLP 데모 - 항공사 감성 데이터 세트¶

이 노트북에서는 Driverless AI Python client를 사용하여 항공사 감성 트위터 데이터 세트를 사용하여 텍스트 분류 모델을 구축하는 방법에 관해 알아봅니다.

Driverless AI 클라이언트를 포함한 상태에서 시작하기 위해 필요한 Python 모듈을 가져옵니다. 아직 설치되어 있지 않은 경우 Driverless AI GUI에서 Python client를 다운로드하여 설치하십시오.

다음은 Python Client Documentation 입니다.

[1]:

import pandas as pd
from sklearn import model_selection
import driverlessai

첫 번째 단계는 Client 를 사용하여 Driverless AI에 연결하는 것입니다. 자격 증명 및 URL 주소를 입력합니다.

[2]:

address = 'http://ip_where_driverless_is_running:12345'
username = 'username'
password = 'password'
dai = driverlessai.Client(address = address, username = username, password = password)
# # make sure to use the same user name and password when signing in through the GUI

datasets.create 명령을 통해 항공사 파일을 Driverless AI에 업로드합니다. 그 후, 데이터를 학습 및 테스트 세트로 분할합니다.

[3]:

airlines = dai.datasets.create(data='https://h2o-public-test-data.s3.amazonaws.com/dai_release_testing/datasets/airline_sentiment_tweets.csv',
                               data_source='s3')

ds_split = airlines.split_to_train_test(train_size=0.7,
                                        train_name='train',
                                        test_name='test')

Complete 100.00% - [4/4] Computing column statistics
Complete

이제 해당 데이터 세트에 관한 몇 가지 기본 정보를 살펴보도록 하겠습니다.

[9]:

print('Train Dataset: ', train.shape)
print('Test Dataset: ', test.shape)

ids = [c for c in train.columns]
print(ids)

Train Dataset:  (11712, 15)
Test Dataset:  (2928, 15)

실험에는 오직 두 개의 열만 필요합니다. 이것은 바로 트윗의 텍스트를 포함하는 text 및 트윗의 감성을 포함하는 airline_sentiment (대상 열)입니다. 해당 실험을 위해 나머지 열을 삭제할 수 있습니다.

CNN 기반 텍스트 특성을 이용하기 위해 tensorflow 모델 및 변환을 활성화합니다.

[12]:

exp_preview = dai.experiments.preview(train_dataset=train,
                                      target_column='airline_sentiment',
                                      task='classification',
                                      drop_columns=["_unit_id", "_golden", "_unit_state", "_trusted_judgments", "_last_judgment_at",
                                      "airline_sentiment:confidence", "negativereason", "negativereason:confidence", "airline",
                                      "airline_sentiment_gold", "name", "negativereason_gold", "retweet_count",
                                      "tweet_coord", "tweet_created", "tweet_id", "tweet_location", "user_timezone"],
                                      config_overrides="""
                                      enable_tensorflow='on'
                                      enable_tensorflow_charcnn='on'
                                      enable_tensorflow_textcnn='on'
                                      enable_tensorflow_textbigru='on'
                                      """)

ACCURACY [7/10]:
- Training data size: *11,712 rows, 4 cols*
- Feature evolution: *[Constant, DecisionTree, LightGBM, TensorFlow, XGBoostGBM]*, *3-fold CV**, 2 reps*
- Final pipeline: *Ensemble (6 models), 3-fold CV*

TIME [2/10]:
- Feature evolution: *8 individuals*, up to *42 iterations*
- Early stopping: After *5* iterations of no improvement

INTERPRETABILITY [8/10]:
- Feature pre-pruning strategy: Permutation Importance FS
- Monotonicity constraints: enabled
- Feature engineering search space: [CVCatNumEncode, CVTargetEncode, CatOriginal, Cat, Frequent, Interactions, NumCatTE, Original, TextBiGRU, TextCNN, TextCharCNN, Text]

[Constant, DecisionTree, LightGBM, TensorFlow, XGBoostGBM] models to train:
- Model and feature tuning: *192*
- Feature evolution: *288*
- Final pipeline: *6*

Estimated runtime: *minutes*
Auto-click Finish/Abort if not done in: *1 day*/*7 days*

이 실험을 위해 Text 및 TextCNN 특성이 활성화되어 있습니다.

이제 실험을 시작할 수 있습니다.

[13]:

model = dai.experiments.create(train_dataset=train,
                               target_column='airline_sentiment',
                               task='classification',
                               name="nlp_airline_sentiment_beta",
                               scorer='F1',
                               drop_columns=["tweet_id", "airline_sentiment_confidence", "negativereason", "negativereason_confidence", "airline", "airline_sentiment_gold", "name", "negativereason_gold", "retweet_count", "tweet_coord", "tweet_created", "tweet_location", "user_timezone", "airline_sentiment.negative", "airline_sentiment.neutral", "airline_sentiment.positive"],
                               accuracy=6,
                               time=2,
                               interpretability=5)

Experiment launched at: http://localhost:12345/#experiment?key=b971fe8a-e317-11ea-9088-0242ac110002
Complete 100.00% - Status: Complete

[14]:

print('Modeling completed for model ' + model.key)

Modeling completed for model b971fe8a-e317-11ea-9088-0242ac110002

[15]:

logs = model.log.download(dst_dir = '.', overwrite = True)
#logs = dai.datasets.download(model.log_file_path, '.')
print('Logs available at', logs)

Downloaded './h2oai_experiment_logs_b971fe8a-e317-11ea-9088-0242ac110002.zip'
Logs available at ./h2oai_experiment_logs_b971fe8a-e317-11ea-9088-0242ac110002.zip

해당 예측을 현재 폴더에 다운로드할 수 있습니다.

[16]:

test_preds = model.predict(dataset = test, include_columns = ids).download(dst_dir = '.', overwrite = True)
print('Test set predictions available at', test_preds)

Complete
Downloaded './b971fe8a-e317-11ea-9088-0242ac110002_preds_9f438fac.csv'
Test set predictions available at ./b971fe8a-e317-11ea-9088-0242ac110002_preds_9f438fac.csv

[ ]: