上QQ阅读APP看书，第一时间看更新

Titanic survival revisited with DL4J

In the preceding chapter, we solved the Titanic survival prediction problem using Spark-based MLP. We also saw that by using Spark-based MLP, the user has very little transparency of using the layering structure. Moreover, it was not explicit to define hyperparameters and so on.

Therefore, what I have done is used the training dataset and then performed some preprocessing and feature engineering. Then I randomly split the pre-processed dataset into training and testing (to be precise, 70% for training and 30% for testing). First, we create the Spark session as follows:

SparkSession spark = SparkSession.builder()
                  .master("local[*]")
                  .config("spark.sql.warehouse.dir", "temp/")// change accordingly
                  .appName("TitanicSurvivalPrediction")
                  .getOrCreate();

In this chapter, we have seen that there are two CSV files. However, test.csv one does not provide any ground truth. Therefore, I decided to use only the training.csv one, so that we can compare the model's performance. So let's read the training dataset using the spark read() API:

Dataset<Row> df = spark.sqlContext()
                .read()
                .format("com.databricks.spark.csv")
                .option("header", "true") // Use first line of all files as header
                .option("inferSchema", "true") // Automatically infer data types
                .load("data/train.csv");

We have seen in Chapter 1, Getting Started with Deep Learning that the Age and Fare columns have many null values. So, instead of writing UDF for each column, here I just replace the missing values of the age and fare columns by their mean:

Map<String, Object> m = new HashMap<String, Object>();
m.put("Age", 30);
m.put("Fare", 32.2);
Dataset<Row> trainingDF1 = df2.na().fill(m);

To get more detailed insights into handling missing/null values and machine learning, interested readers can take a look at Boyan Angelov's blog at https://towardsdatascience.com/working-with-missing-data-in-machine-learning-9c0a430df4ce.

For simplicity, we can drop a few more columns too, such as "PassengerId", "Name", "Ticket", and "Cabin":

Dataset<Row> trainingDF2 = trainingDF1.drop("PassengerId", "Name", "Ticket", "Cabin");

Now, here comes the tricky part. Similar to Spark ML-based estimators, DL4J-based networks also need training data in numeric form. Therefore, we now have to convert the categorical features into numerics. For that, we can use a StringIndexer() transformer. What we will do is we will create two that is, StringIndexer for the "Sex" and "Embarked" columns:

StringIndexer sexIndexer = new StringIndexer()
                                    .setInputCol("Sex")
                                    .setOutputCol("sexIndex")
                                    .setHandleInvalid("skip");//// we skip column having nulls
        
StringIndexer embarkedIndexer = new StringIndexer()
                                    .setInputCol("Embarked")
                                    .setOutputCol("embarkedIndex")
                                    .setHandleInvalid("skip");//// we skip column having nulls

Then we will chain them into a single pipeline. Next, we will perform the transformation operation:

Pipeline pipeline = new Pipeline().setStages(new PipelineStage[] {sexIndexer, embarkedIndexer});

Then we will fit the pipeline, transform, and drop both the "Sex" and "Embarked" columns to get the transformed dataset:

Dataset<Row> trainingDF3 = pipeline.fit(trainingDF2).transform(trainingDF2).drop("Sex", "Embarked");

Then our final pre-processed dataset will have only the numerical features. Note that DL4J considers the last column as the label column. That means DL4J will consider "Pclass", "Age", "SibSp", "Parch", "Fare", "sexIndex", and "embarkedIndex" as features. Therefore, I placed the "Survived" column as the last column:

Dataset<Row> finalDF = trainingDF3.select("Pclass", "Age", "SibSp","Parch", "Fare",                                                                   
                                           "sexIndex","embarkedIndex", "Survived");
finalDF.show();

Then we randomly split the dataset into training and testing as 70% and 30%, respectively. That is, we used 70% for training and the rest to evaluate the model:

Dataset<Row>[] splits = finalDF.randomSplit(new double[] {0.7, 0.3}); 
Dataset<Row> trainingData = splits[0]; 
Dataset<Row> testData = splits[1];

Finally, we have both the DataFrames as separate CSV files to be used by DL4J:

trainingData
      .coalesce(1)// coalesce(1) writes DF in a single CSV
      .write() 
      .format("com.databricks.spark.csv")
      .option("header", "false") // don't write the header
      .option("delimiter", ",") // comma separated
      .save("data/Titanic_Train.csv"); // save location

testData
      .coalesce(1)// coalesce(1) writes DF in a single CSV
      .write() 
      .format("com.databricks.spark.csv")
      .option("header", "false") // don't write the header
      .option("delimiter", ",") // comma separated
      .save("data/Titanic_Test.csv"); // save location

Additionally, DL4J does not support the header info in the training set, so I intentionally skipped writing the header.