The Machine Learning Workshop
上QQ阅读APP看书,第一时间看更新

Data Representation

The main objective of ML is to build models by interpreting data. To do so, it is highly important to feed the data in a way that is readable by the computer. To feed data into a scikit-learn model, it must be represented as a table or matrix of the required dimensions, which we will discuss in the following section.

Tables of Data

Most tables that are fed into ML problems are two-dimensional, meaning that they contain rows and columns. Conventionally, each row represents an observation (an instance), whereas each column represents a characteristic (feature) of each observation.

The following table is a fragment of a sample dataset of scikit-learn. The purpose of the dataset is to differentiate from among three types of iris plants based on their characteristics. Hence, in the following table, each row embodies a plant and each column denotes the value of that feature for every plant:

Figure 1.2: A table showing the first 10 instances of the iris dataset

From the preceding explanation, by reviewing the first row of the preceding table, it is possible to determine that the observation corresponds to that of a plant with a sepal length of 5.1, a sepal width of 3.5, a petal length of 1.4, and a petal width of 0.2. The plant belongs to the setosa species.

Note

When feeding images to a model, the tables become three-dimensional, where the rows and columns represent the dimensions of the image in pixels, while the depth represents its color scheme. If you are interested, feel free to find out more about convolutional neural networks.

Data in tables are also known as structured data. Unstructured data, on the other hand, refers to everything else that cannot be stored in a table-like database (that is, in rows and columns). This includes images, audio, videos, and text (such as emails or reviews). To be able to feed unstructured data into an ML algorithm, the first step should be to transform it into a format that the algorithm can understand (tables of data). For instance, images are converted into matrices of pixels, and text is encoded into numeric values.

Features and Target Matrices

For many data problems, one of the features of your dataset will be used as a label. This means that out of all the other features, this one is the target that the model should generalize the data to. For example, in the preceding table, we might choose the species as the target feature, so we would like the model to find patterns based on the other features to determine whether a plant belongs to the setosa species. Therefore, it is important to learn how to separate the target matrix from the features matrix.

Features Matrix: The features matrix comprises data from each instance for all features, except the target. It can be either created using a NumPy array or a Pandas DataFrame, and its dimensions are [n_i, n_f], where n_i denotes the number of instances (such as the universe of persons in the dataset) and n_f denotes the number of features (such as the demographics of each person). Generally, the features matrix is stored in a variable named X.

Note

Pandas is an open source library built for Python. It was created to tackle different tasks related to data manipulation and analysis. Likewise, NumPy an open source Python library and is used to manipulate large multi-dimensional arrays. It was also created with a large set of mathematical functions to operate over such arrays.

Target Matrix: Different to the features matrix, the target matrix is usually one-dimensional since it only carries one feature for all instances, meaning that its length is n_i (the number of instances). Nevertheless, there are some occasions where multiple targets are required, so the dimensions of the matrix become [n_i, n_t], where n_t is the number of targets to consider.

Similar to the features matrix, the target matrix is usually created as a NumPy array or a Pandas series. The values of the target array may be discrete or continuous. Generally, the target matrix is stored in a variable named Y.

Exercise 1.01: Loading a Sample Dataset and Creating the Features and Target Matrices

Note

All of the exercises and activities in this book will be primarily developed in Jupyter Notebooks. It is recommended to keep a separate Notebook for different assignments, unless advised otherwise. Also, to load a sample dataset, the seaborn library will be used, as it displays the data as a table. Other ways to load data will be explained in later sections.

In this exercise, we will be loading the tips dataset from the seaborn library and creating features and target matrices using it. Follow these steps to complete this exercise:

Note

For the exercises and activities within this chapter, ensure that you have Python 3.7, Seaborn 0.9, Jupyter 6.0, Matplotlib 3.1, NumPy 1.18, and Pandas 0.25 installed on your system.

  1. Open a Jupyter Notebook to complete this exercise. In the Command Prompt or Terminal, navigate to the desired path and use the following command:

    jupyter notebook

  2. Load the tips dataset using the seaborn library. To do so, you need to import the seaborn library and then use the load_dataset() function, as shown in the following code:

    import seaborn as sns

    tips = sns.load_dataset('tips')

    As we can see from the preceding code, after importing the library, a nickname is given to facilitate its use with the script.

    The load_dataset() function loads datasets from an online repository. The data from the dataset is stored in a variable named tips.

  3. Create a variable, X, to store the features. Use the drop() function to include all of the features but the target, which in this case is named tip. Then, print out the top 10 instances of the variable:

    X = tips.drop('tip', axis=1)

    X.head(10)

    Note

    The axis parameter in the preceding snippet denotes whether you want to drop the label from rows (axis = 0) or columns (axis = 1).

    The printed output should look as follows:

    Figure 1.3: A table showing the first 10 instances of the features matrix

  4. Print the shape of your new variable using the X.shape command:

    X.shape

    The output is as follows:

    (244, 6)

    The first value indicates the number of instances in the dataset (244), while the second value represents the number of features (6).

  5. Create a variable, Y, that will store the target values. There is no need to use a function for this. Use indexing to grab only the desired column. Indexing allows you to access a section of a larger element. In this case, we want to grab the column named tip. Then, we need to print out the top 10 values of the variable:

    Y = tips['tip']

    Y.head(10)

    The printed output should look as follows:

    Figure 1.4: A screenshot showing the first 10 instances of the target matrix

  6. Print the shape of your new variable using the Y.shape command:

    Y.shape

    The output is as follows:

    (244,)

    The shape should be one-dimensional with a length equal to the number of instances (244).

    Note

    To access the source code for this specific section, please refer to https://packt.live/2Y5dgZH.

    You can also run this example online at https://packt.live/3d0Hsco. You must execute the entire Notebook in order to get the desired result.

With that, you have successfully created the features and target matrices of a dataset.

Generally, the preferred way to represent data is by using two-dimensional tables, where the rows represent the number of observations, also known as instances, and the columns represent the characteristics of those instances, commonly known as features.

For data problems that require target labels, the data table needs to be partitioned into a features matrix and a target matrix. The features matrix will contain the values of all features but the target, for each instance, making it a two-dimensional matrix. On the other hand, the target matrix will only contain the value of the target feature for all entries, making it a one-dimensional matrix.

Activity 1.01: Selecting a Target Feature and Creating a Target Matrix

You want to analyze the Titanic dataset to see the survival rate of the passengers on different decks and see if you can prove a hypothesis stating that passengers on the lower decks were less likely to survive. In this activity, we will attempt to load a dataset and create the features and target matrices by choosing the appropriate target feature for the objective at hand.

Note

To choose the target feature, remember that the target should be the outcome that we want to interpret the data for. For instance, if we want to know what features play a role in determining a plant's species, the species should be the target value.

Follow these steps to complete this activity:

  1. Load the titanic dataset using the seaborn library. The first couple of rows should look like this:

    Figure 1.5: A table showing the first 10 instances of the Titanic dataset

  2. Select your preferred target feature for the goal of this activity.
  3. Create both the features matrix and the target matrix. Make sure that you store the data from the features matrix in a variable, X, and the data from the target matrix in another variable, Y.
  4. Print out the shape of each of the matrices, which should match the following values:

    Features matrix: (891, 14)

    Target matrix: (891,)

    Note

    The solution for this activity can be found on page 210.