Machine Learning with Scala Quick Start Guide
上QQ阅读APP看书,第一时间看更新

Preprocessing and feature engineering

As per the dataset description on the UCI machine learning repository, there are no null values. Also, the Spark ML-based classifiers expect numeric values to model them. The good thing is that, as seen in the schema, all the required fields are numeric (that is, either integers or floating point values). Also, the Spark ML algorithms expect a label column, which in our case is Result_of_Treatment. Let's rename it to label using the Spark-provided withColumnRenamed() method:

//Spark ML algorithm expect a 'label' column, which is in our case 'Survived". Let's rename it to 'label'
CryotherapyDF = CryotherapyDF.withColumnRenamed("Result_of_Treatment", "label")
CryotherapyDF.printSchema()

All the Spark ML-based classifiers expect training data containing two objects called label (which we already have) and features. We have seen that we have six features. However, those features have to be assembled to create a feature vector. This can be done using the VectorAssembler() method. It is one kind of transformer from the Spark ML library. But first we need to select all the columns except the label column:

val selectedCols = Array("sex", "age", "Time", "Number_of_Warts", "Type", "Area")

Then we instantiate a VectorAssembler() transformer and transform as follows:

val vectorAssembler = new VectorAssembler()
.setInputCols(selectedCols)
.setOutputCol("features")
val numericDF = vectorAssembler.transform(CryotherapyDF)
.select("label", "features")
numericDF.show()

As expected, the last line of the preceding code segment shows the assembled DataFrame having label and features, which are needed to train an ML algorithm: