Machine Learning with Scala Quick Start Guide
上QQ阅读APP看书,第一时间看更新

Description of the dataset

We will use a recently added cryotherapy dataset from the UCI machine learning repository. The dataset can be downloaded from http://archive.ics.uci.edu/ml/datasets/Cryotherapy+Dataset+#.

This dataset contains information about wart treatment results of 90 patients using cryotherapy. In case you don't know, a wart is a kind of skin problem caused by infection with a type of human papillomavirus. Warts are typically small, rough, and hard growths that are similar in color to the rest of the skin.

There are two available treatments for this problem:

  • Salicylic acid: A type of gel containing salicylic acid used in medicated band-aids.
  • Cryotherapy: A freezing liquid (usually nitrogen) is sprayed onto the wart. It will destroy the cells in the affected area. After the cryotherapy, usually, a blister develops, which eventually turns into a scab and falls off after a week or so.

There are 90 samples or instances that were either recommended to go through cryotherapy or be discharged without cryotherapy. There are seven attributes in the dataset:

  • sex: Patient gender, characterized by 1 (male) or 0 (female).
  • age: Patient age.
  • Time: Observation and treatment time in hours.
  • Number_of_Warts: Number of warts.
  • Type: Types of warts.
  • Area: The amount of affected area.
  • Result_of_Treatment: The recommended result of the treatment, characterized by either 1 (yes) or 0 (no). It is also the target column.

As you can understand, it is a classification problem because we will have to predict discrete labels. More specifically, it is a binary classification problem. Since this is a small dataset with only six features, we can start with a very basic classification algorithm called logistic regression, where the logistic function is applied to the regression to get the probabilities of it belonging in either class. We will learn more details about logistic regression and other classification algorithms in Chapter 3, Scala for Learning Classification. For this, we use the Spark ML-based implementation of logistic regression in Scala.