The Machine Learning Workshop
上QQ阅读APP看书,第一时间看更新

Scikit-Learn

Created in 2007 by David Cournapeau as part of a Google Summer of Code project, scikit-learn is an open source Python library made to facilitate the process of building models based on built-in ML and statistical algorithms, without the need for hardcoding. The main reasons for its popular use are its complete documentation, its easy-to-use API, and the many collaborators who work every day to improve the library.

Note

You can find the documentation for scikit-learn at http://scikit-learn.org.

Scikit-learn is mainly used to model data, and not as much to manipulate or summarize data. It offers its users an easy-to-use, uniform API to apply different models with little learning effort, and no real knowledge of the math behind it is required.

Note

Some of the math topics that you need to know about to understand the models are linear algebra, probability theory, and multivariate calculus. For more information on these models, visit https://towardsdatascience.com/the-mathematics-of-machine-learning-894f046c568.

The models that are available in the scikit-learn library fall into two categories, that is, supervised and unsupervised, both of which will be explained in depth later in this chapter. This form of category classification will help to determine which model to use for a particular dataset to get the most information out of it.

Besides its main use for predicting future behavior in supervised learning problems and clustering data in unsupervised learning problems, scikit-learn is also used for the following reasons:

  • To carry out cross-validation and performance metrics analysis to understand the results that have been obtained from the model, and thereby improve its performance
  • To obtain sample datasets to test algorithms on them
  • To perform feature extraction to extract features from images or text data

Although scikit-learn is considered the preferred Python library for beginners in the world of ML, there are several large companies around the world that use it because it allows them to improve their products or services by applying the models to already existing developments. It also permits them to quickly implement tests on new ideas.

Some of the leading companies that are using scikit-learn are as follows:

  • Spotify: One of the most popular music streaming applications, Spotify makes use of scikit-learn mainly due to the wide variety of algorithms that the framework offers, as well as how easy it is to implement the new models into their current developments. Scikit-learn has been used as part of its music recommendation model.
  • Booking.com: From developing recommendation systems to preventing fraudulent activities, among many other solutions, this travel metasearch engine has been able to use scikit-learn to explore a large number of algorithms that allow the creation of state-of-the-art models.
  • Evernote: This note-taking and management app uses scikit-learn to tackle several of the steps required to train a classification model, from data exploration to model evaluation.
  • Change.org: Thanks to the framework's ease of use and variety of algorithms, this non-profit organization has been able to create email marketing campaigns that reach millions of readers around the world.

    Note

    You can visit http://scikit-learn.org/stable/testimonials/testimonials.html to discover other companies that are using scikit-learn and see what they are using it for.

In conclusion, scikit-learn is an open source Python library that uses an API to apply most ML tasks (both supervised and unsupervised) to data problems. Its main use is for modeling data so that predictions can be made about unseen observations; nevertheless, it should not be limited to that as the library also allows users to predict outcomes based on the model being trained, as well as to analyze the performance of the model, among other features.

Advantages of Scikit-Learn

The following is a list of the main advantages of using scikit-learn for ML purposes:

  • Ease of use: Scikit-learn is characterized by a clean API, with a small learning curve in comparison to other libraries, such as TensorFlow or Keras. The API is popular for its uniformity and straightforward approach. Users of scikit-learn do not necessarily need to understand the math behind the models.
  • Uniformity: Its uniform API makes it very easy to switch from model to model as the basic syntax that's required for one model is the same for others.
  • Documentation/tutorials: The library is completely backed up by documentation, which is effortlessly accessible and easy to understand. Additionally, it also offers step-by-step tutorials that cover all of the topics required to develop any ML project.
  • Reliability and collaborations: As an open source library, scikit-learn benefits from the input of multiple collaborators who work each day to improve its performance. This participation from many experts from different contexts helps to develop not only a more complete library but also a more reliable one.
  • Coverage: As you scan the list of components that the library has, you will discover that it covers most ML tasks, ranging from supervised models such as performing a regression task to unsupervised models such as the ones used to cluster data into subgroups. Moreover, due to its many collaborators, new models tend to be added in relatively short amounts of time.

Disadvantages of Scikit-Learn

The following is a list of the main disadvantages of using scikit-learn for ML purposes:

  • Inflexibility: Due to its ease of use, the library tends to be inflexible. This means that users do not have much liberty in parameter tuning or model architecture, such as with the Gradient Boost algorithm and neural networks. This becomes an issue as beginners move to more complex projects.
  • Not good for deep learning: The performance of the library falls short when tackling complex ML projects. This is especially true for deep learning, as scikit-learn does not support deep neural networks with the necessary architecture or power.

    Note

    Deep learning is a part of ML and is based on the concept of artificial neural networks. It uses a sequence of layers to extract valuable information (features) from the input data. In subsequent sections of this book, you will learn about neural networks, which is the starting point of being able to develop deep learning solutions.

In general terms, scikit-learn is an excellent beginner's library as it requires little effort to learn its use and has many complementary materials thought to facilitate its application. Due to the contributions of several collaborators, the library stays up to date and is applicable to most current data problems.

On the other hand, it is a simple library that's not fit for more complex data problems such as deep learning. Likewise, it is not recommended for users who wish to take their abilities to a higher level by playing with the different parameters that are available in each model.

Other Frameworks

Other popular ML frameworks are as follows:

  • TensorFlow: Google's open source framework for ML, which to this day is still the most popular among data scientists. It is typically integrated with Python and is very good for developing deep learning solutions. Due to its popularity, the information that's available on the internet about the framework makes it very easy to develop different solutions, not to mention that it is backed by Google.
  • PyTorch: This was primarily developed by Facebook's AI Research lab as an open source deep learning framework. Although it is a fairly new framework (released in 2017), it has grown in popularity due to its ease of use and Pythonic nature. It allows easy code debugging thanks to the use of dynamic graph computations.
  • Keras: This is an open source deep learning framework that's typically good for those who are just starting out. Due to its simplicity, it is less flexible but ideal for prototyping simple concepts. Similar to scikit-learn, it has its own easy-to-use API.