Building Machine Learning Systems with Python
上QQ阅读APP看书,第一时间看更新

Classifying with Real-World Examples

The topic of this chapter is classification. In this setting of machine learning, you provide the system with examples of different classes of objects that you are interested in and then ask it to generalize to new examples where the class is not known. This may seem abstract, but you have probably already used this form of machine learning as a consumer, even if you were not aware of it: your email system will likely have the ability to automatically detect spam. That is, the system will analyze all incoming emails and mark them as either spam or not spam. Often, you, the end user, will be able to manually tag emails as spam or not, in order to improve its spam detection ability. This is exactly what we mean by classification: you provide examples of spam and and non-spam emails and then use an automated system to classify incoming emails. This is one of the most important machine learning modes and is the topic of this chapter.

Working with text such as emails requires a specific set of techniques and skills, and we discuss those later in the book. For the moment, we will work with a smaller, easier-to-handle dataset. The example question for this chapter is: can a machine distinguish between flower species based on images? We will use two datasets where measurements of flower morphology are recorded along with the species for several specimens.

We will explore these small datasets in order to focus on the high-level concepts. The important elements of this chapter are the following:

  • What classification is
  • How scikit-learn can be used for classification and which classifier is a good solution for most problems
  • How to strictly evaluate a classifier and avoid fooling ourselves