Skip to main content

Task 2: Explore datasets

Overview

The synthesized dataset includes historical data for 45 Walmart stores located in different regions of the United States from February 5, 2010, to November 1, 2012. Let's explore the training and test datasets.

Training dataset

Training dataset content

Let's explore the training dataset content.

  1. In the Datasets table, click walmart_tts_small_train.csv.
  2. Select DETAILS.
  3. Click DATASET ROWS.
ColumnDescription
StoreThis column represents the identifier for the store.
DeptThis column represents the identifier for the department within the store.
DateThis feature represents the column that identifies the end date of a week (in other words, the end date of the week's sales record).
Weekly_SalesThis column represents the sales for a given department in a Walmart store.
MarkDown1This column represents anonymized data related to Walmart's promotional markdowns. Markdown data is only available after November 2011 and is not always available for all stores. Any missing value is marked with an NA. A value of -1 might indicate anonymized or unspecified markdown data.
MarkDown2This column represents anonymized data related to Walmart's promotional markdowns. Markdown data is only available after November 2011 and is not always available for all stores. Any missing value is marked with an NA. A value of -1 might indicate anonymized or unspecified markdown data.
MarkDown3This column represents anonymized data related to Walmart's promotional markdowns. Markdown data is only available after November 2011 and is not always available for all stores. Any missing value is marked with an NA. A value of -1 might indicate anonymized or unspecified markdown data.
MarkDown4This column represents anonymized data related to Walmart's promotional markdowns. Markdown data is only available after November 2011 and is not always available for all stores. Any missing value is marked with an NA. A value of -1 might indicate anonymized or unspecified markdown data.
MarkDown5This column represents anonymized data related to Walmart's promotional markdowns. Markdown data is only available after November 2011 and is not always available for all stores. Any missing value is marked with an NA. A value of -1 might indicate anonymized or unspecified markdown data.
IsHolidayThis column represents a binary indicator (0 or 1) where 1 indicates the week contains a major holiday.
sample_weightThis column represents a weight assigned to each sample, emphasizing that the week (sample) contains a holiday. In other words, if the IsHoliday column is True, the sample_weight will be 5.

Training dataset structure

When structuring the training dataset for a time series model, consider the following major points:

  • Time column

    A single column that represents the time component. This should be in a recognized date format (Date). For example, 2012-05-04 (YYYY-MM-DD).

    • The training dataset should be time-sorted, meaning that the observations should be ordered by the time variable (Date) in ascending order.
    • H2O Driverless AI expects the data to be in a long format for time series experiments, where each row represents a single observation at a specific point in time. This is different from the wide format, where each row represents a group of observations at different points in time.
  • Target column

    A column that represents the variable you want to predict. This is your dependent variable (Weekly_Sales).

  • Feature column(s)

    Any other independent variable that may influence the target variable. These can include categorical or numerical features (Store, Dept, MarkDown1, MarkDown2, MarkDown3, MarkDown4, MarkDown5, and IsHoliday).

  • Optional: Sample weight column

    A column that indicates the observation weight or row weight for each sample in your dataset (sample_weight). This weight must be numeric and have values greater than or equal to zero.

    • The sample weight column can be beneficial during the training phase to enhance learning for specific periods (like holidays); it is not mandatory and is often excluded from the test dataset to ensure fair evaluation of the model. Whether to include sample_weight depends on the specific goals of your modeling task and how you want your model to handle the significance of holiday weeks during training. In our case, we want to emphasize a week containing a holiday because it can be the case that the company might want to prepare for times of high demand.

Test dataset

Test dataset content

Let's explore the test dataset.

  1. In the H2O Drivereless AI navigation menu, click DATASETS.
  2. In the Datasets table, click walmart_tts_small_test.csv.
  3. Select DETAILS.
  4. Click DATASET ROWS.

Test dataset structure

When structuring the test dataset for a time series model, consider the following major points:

  • The test dataset should be a contiguous subset of the data after the training dataset and contain the same columns as the training dataset. In other words, the test dataset's observations should be later than the training dataset's observations. 
  • Depending on your use case, there could be a gap between the training and test datasets. This tutorial has no gap between the training and test datasets. We will discuss this further in the following section (Time gap between the training and test dataset).
  • The time column (Date) format in the test dataset should be identical to the one in the training dataset.
  • The test dataset does not need to contain the sample weight column. A sample_weight column is used as auxiliary data to aid the training process, and therefore, it only needs to be present in the training dataset.
    note

    The provided test dataset contains the weight column, but we will observe later how H2O Driverless AI makes it unavailable for testing (a feature of the automatic machine learning process).

  • It is recommended to have at least 20-30% of the data as the test dataset.

Time gap between the training and test dataset

You can introduce a time gap between the end of the training dataset and the start of the test dataset for the following reasons:

  • Simulating real-world scenarios: By introducing a delay between the training and test datasets, the gap mimics the real-world scenario where there is often a delay between when a model is trained and when it is deployed. This helps ensure that the model performs well on truly unseen data, which it will encounter in practical use.

  • Avoiding temporal correlation: The gap helps to mitigate any short-term temporal correlations that might artificially enhance model performance during validation. This ensures that the model's performance metrics are reliable and will hold in real-world forecasting scenarios.

For the purposes of this tutorial, there is no gap between the training and test datasets.

  • Training dataset: Starts on 02/05/2010 and stops on 04/27/2012 (inclusive).
    • The training dataset with inclusive Dates contains information for 117 weeks because the first date (02/05/2010) represents the end of the first week on the training dataset.
  • Test dataset: Starts on 05/04/2012 and stops on 10/26/2012 (inclusive).
    • The test dataset with inclusive Dates contains information for 26 weeks because the first date (05/04/2012) represents the end of the first week on the test dataset.
  • Gap period: With the Dates being inclusive in the datasets, there's no time gap between the training and test datasets.

With the above in mind, we will build the time series model for this tutorial, assuming that we can always obtain the latest available data.


Feedback