Machine Learning Solutions
上QQ阅读APP看书,第一时间看更新

Training the baseline model

As you know, we have selected the RandomForestRegressor algorithm. We will be using the scikit-learn library to train the model. These are the steps we need to follow:

  1. Splitting the training and testing dataset
  2. Splitting prediction labels for the training and testing dataset
  3. Converting sentiment scores into the numpy array
  4. Training the ML model

So, let's implement each of these steps one by one.

Splitting the training and testing dataset

We have 10 years of data values. So for training purposes, we will be using 8 years of the data, which means the dataset from 2007 to 2014. For testing purposes, we will be using 2 years of the data, which means data from 2015 and 2016. You can refer to the code snippet in the following screenshot to implement this:

Splitting the training and testing dataset

Figure 2.22: Splitting the training and testing dataset

As you can see from the preceding screenshot, our training dataset has been stored in the train dataframe and our testing dataset has been stored in the test dataframe.

Splitting prediction labels for the training and testing datasets

As we split the training and testing dataset, we also need to store the adj close price separately because we need to predict these adj close prices (indicated in the code as prices); these price values are labels for our training data, and this training becomes supervised training as we will provide the actual price in the form of labels. You can refer to the following code for the implementation:

Splitting prediction labels for the training and testing datasets

Figure 2.23: Splitting the prediction labels for training and testing datasets

Here, all attributes except the price are given in a feature vector format and the price is in the form of labels. The ML algorithm takes this feature vector, labels the pair, learns the necessary pattern, and predicts the price for the unseen data.

Converting sentiment scores into the numpy array

Before we start the training, there is one last, necessary point that we need to keep in mind: we are converting the sentiment analysis scores into the numpy array format. This is because once we set the price attribute as a prediction label, our features vector will contain only the sentiment scores and date. So in order to generate a proper feature vector, we have converted the sentiment score into a numpy array. The code snippet to implement this is provided in the following screenshot:

Converting sentiment scores into the numpy array

Figure 2.24: Code snippet for converting sentiment analysis score into the numpy array

As you can see from the code snippet, we have performed the same conversion operation for both training the dataset and testing the dataset.

Note

Note that if you get a value error, check the dataset because there may be a chance that a column in the dataset has a blank or null value.

Now, let's train our model!

Training of the ML model

In the first iteration, we use the RandomForestRegressor algorithm, which is provided as part of the scikit-learn dependency. You can find the code for this in the following screenshot:

Training of the ML model

Figure 2.25: Code snippet for training using RandomForestRegressor

As you can see from the preceding screenshot, we have used all the default values for our hyperparameters. For a more detailed description regarding hyperparameters, you can refer to http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html.

Now that our model has been trained, we need to test it using our testing dataset. Before we test, let's discuss the approach we will take to test our model.