Feature engineering
As discussed earlier, we want to predict the close price for the DJIA index for a particular trading day. In this section, we will do feature selection based on our intuition for our basic prediction model for stock prices. We have already generated the training dataset. So, now we will load the saved .pkl format dataset and perform feature selection as well as minor data processing. We will also generate the sentiment score for each of the filtered NYTimes news articles and will use this sentiment score to train our baseline model. We will use the following Python dependencies:
- numpy
- pandas
- nltk
This section has the following steps:
- Loading the dataset
- Minor preprocessing
- Feature selection
- Sentiment analysis
So, let's begin coding!
Loading the dataset
We have saved the data in the pickle format, and now we need to load data from it. You can refer to the following code snippet:
You can refer to the code by clicking on this GitHub link: https://github.com/jalajthanaki/stock_price_prediction/blob/master/Stock_Price_Prediction.ipynb.
As you can see, in the dataframe output, there is a dot (.) before every article headline in the entire dataset, so we need to remove these dots. We will execute this change in the next section.
Minor preprocessing
As a part of minor preprocessing, we will be performing the following two changes:
- Converting the adj close prices into the integer format
- Removing the leftmost dot (.) from news headlines
Converting adj close price into the integer format
We know that the adj close price is in the form of a float format. So, here we will convert float values into the integer format as well as store the converted values as price attributes in our pandas dataframe. Now, you may wonder why we consider only the adj close prices. Bear with me for a while, and I will give you the reason for that. You can find the convergence code snippet in the following screenshot:
Tip
You can refer to the code at this GitHub link: https://github.com/jalajthanaki/stock_price_prediction/blob/master/Stock_Price_Prediction.ipynb.
Removing the leftmost dot from news headlines
In this section, we will see the implementation for removing the leftmost dot. We will be using the lstrip()
function to remove the dot. You can refer to the code snippet in the following screenshot:
Tip
You can refer to the code at this GitHub link: https://github.com/jalajthanaki/stock_price_prediction/blob/master/Stock_Price_Prediction.ipynb.
Now, let's move on to our next section, which is feature engineering.
Feature engineering
Feature selection is one of the most important aspects of feature engineering and any Machine Learning (ML) application. So, here we will focus on feature selection. In the previous section, I raised the question of why we select only the adj close price and not the close price. The answer to this question lies in the feature selection. We select the adj close prices because these prices give us a better idea about what the last price of the DJIA index is, including the stock, mutual funds, dividends, and so on. In our dataset, close prices are mostly the same as the adj close price and in future, if we consider the close price for unseen data records, we can't derive the adj close price because it may be equal to the close price or higher than the close price, The adj close price for DJIA index may higher than the close price because it will include stocks, mutual funds, dividend and so on. but we don't know how much higher it will be for unseen dataset where we have just considered close price. So if we consider the adj close price, then we will know that the close price may be less than or equal to the adj close price, but not more than the adj close price. The adj close price is kind of maximum possible value for closing price. So, we have considered the adj close price for the development. For the baseline model, we will be considering the adj close price. We have renamed the column to price. You can refer to the following code snippet:
As a next step, we will now perform sentiment analysis on the news article dataset. We can use the sentiment score when we train our model. So, let's move on to the sentiment analysis part.
Sentiment analysis of NYTimes news articles
In order to implement sentiment analysis, we are using the nltk inbuilt sentiment analysis module. We will obtain negative, positive, and compound sentiment scores. We have used a lexicon-based approach. In the lexicon-based approach, words of each sentence are analyzed, and based on the sentiwordnet
score, each word is given a specific sentiment score; then, the aggregate sentence level score is decided.
Note
Sentiwordnet is the dictionary which contain sentiment score for words.
We will cover details related to sentiment analysis in Chapter 5, Sentiment Analysis. You can refer to the following sentiment analysis code snippet:
All scores generated by the preceding code are stored in the dataframe, so you can see the aggregate score of news article headlines in the following screenshot:
By the end of this section, we will obtain the sentiment score for the NYTimes news articles dataset and combine these sentiment scores as part of the training dataset. So far, we have done minor preprocessing, selected the data attribute as per our intuition, and generated the sentiment score. Now, we will select the machine learning algorithm and try to build the baseline model. So, let's move on to the next section.