Machine Learning Solutions
上QQ阅读APP看书,第一时间看更新

Understanding the dataset

In this section, we will understand the meaning of data attributes, which will help us understand what kind of dataset we are going to deal with and what kind of preprocessing is needed for the dataset. We understand our dataset in two sections, and those sections are given as follows:

  • Understanding the DJIA dataset
  • Understanding the NYTimes news article dataset

Understanding the DJIA dataset

In the DJIA dataset, we have seven data attributes. They are quite easy to understand, so let's look at each of them one by one:

  • Date: The first column indicates the date in the YYYY-MM-DD format when you see data in the .csv file.
  • Open: This indicates the price at which the market opens, so it is the opening value for the DJIA index for that particular trading day.
  • High: This is the highest price for the DJIA index for a particular trading day.
  • Low: This is the lowest price for DJIA index for a particular trading day.
  • Close: The price of DJIA index at the close of the trading day.
  • Adj close: The adjusted closing price (adj close price) uses the closing price as a starting point and takes into account components such as dividends, stock splits, and new stock offerings. The adj close price represents the true reflection of the DJIA index. Let me give you an example so that you can understand the adj close price better: if a company offers a dividend of $5 per share, and if the closing price of that company share is $100, then the adj close price will become $95. So, the adj close price considers various factors and, based on them, generates the true value of the company's stock. Here, we are looking at the DJIA index value so, most of the time, the closing price and the adj close price are the same.
  • Volume: These values indicate the number of index traded on exchange for a particular trading day.

These are the basic details of the DJIA index dataset. We use historical data and try to predict future movement in the DJIA index.

In the next section, we will look at the NYTimes news article dataset.

Understanding the NYTimes news article dataset

We have used the NYTimes developer API and collected the news articles in a JSON form, so, here, we will look at the JSON response so we can identify the data attributes that are the most important and that we can focus on. In the next figure, you can see the JSON response that we get from the NYTimes:

Understanding the NYTimes news article dataset

Figure 2.4: JSON response for news articles using the NYTimes developer tool

In this figure, we can see the JSON response for a single news article. As you can see, there is a main data attribute response that carries all other data attributes. We will focus on the data attributes that are given inside the docs array. Don't worry; we will not use all the data attributes. Here, we will focus on the following data attributes:

  • type_of_material: This attribute indicates that a particular news article is derived from a particular kind of source, whether it's a blog, a news article, analysis, and so on.
  • headlines: The headline data attribute has the two sub-data attributes. The main data attribute contains the actual headline of the news and the kicker data attribute is convey the highlight of the article.
  • pub_date: This data attribute indicates the publication of the news article. You can find this attribute in the second-last section of the doc array.
  • section_name: This data attribute appeared in the preceding image in the last section. It provides the category of the news article.
  • news_desk: This data attribute also indicates the news category. When section_name is absent in a response, we will refer to this attribute.

As we understand data attributes properly, we should move on to the next section, which is the data preprocessing and data analysis part.