Phishing detection with logistic regression
In this section, we are going to build a phishing detector from scratch with a logistic regression algorithm. Logistic regression is a well-known statistical technique used to make binomial predictions (two classes).
Like in every machine learning project, we will need data to feed our machine learning model. For our model, we are going to use the UCI Machine Learning Repository (Phishing Websites Data Set). You can check it out at https://archive.ics.uci.edu/ml/datasets/Phishing+Websites:
The dataset is provided as an arff file:
The following is a snapshot from the dataset:
For better manipulation, we have organized the dataset into a csv file:
As you probably noticed from the attributes, each line of the dataset is represented in the following format – {30 Attributes (having_IP_Address URL_Length, abnormal_URL and so on)} + {1 Attribute (Result)}:
For our model, we are going to import two machine learning libraries, NumPy and scikit-learn, which we already installed in Chapter 1, Introduction to Machine Learning in Pentesting.
Let's open the Python environment and load the required libraries:
>>> import numpy as np
>>> from sklearn import *
>>> from sklearn.linear_model import LogisticRegression
>>> from sklearn.metrics import accuracy_score
Next, load the data:
training_data = np.genfromtxt('dataset.csv', delimiter=',', dtype=np.int32)
Identify the inputs (all of the attributes, except for the last one) and the outputs (the last attribute):
>>> inputs = training_data[:,:-1]
>>> outputs = training_data[:, -1]
In the previous chapter, we discussed how we need to pide the dataset into training data and testing data:
training_inputs = inputs[:2000]
training_outputs = outputs[:2000]
testing_inputs = inputs[2000:]
testing_outputs = outputs[2000:]
Create the scikit-learn logistic regression classifier:
classifier = LogisticRegression()
Train the classifier:
classifier.fit(training_inputs, training_outputs)
Make predictions:
predictions = classifier.predict(testing_inputs)
Let's print out the accuracy of our phishing detector model:
accuracy = 100.0 * accuracy_score(testing_outputs, predictions)
print ("The accuracy of your Logistic Regression on testing data is: " + str(accuracy))
The accuracy of our model is approximately 85%. This is a good accuracy, since our model detected 85 phishing URLs out of 100. But let's try to make an even better model with decision trees, using the same data.