![Python Machine Learning Cookbook(Second Edition)](https://wfqqreader-1252317822.image.myqcloud.com/cover/720/36698720/b_36698720.jpg)
上QQ阅读APP看书,第一时间看更新
How to do it...
Let's see how to tackle class imbalance:
- Let's import the libraries:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC
import utilities
- Let's load the data (data_multivar_imbalance.txt):
input_file = 'data_multivar_imbalance.txt' X, y = utilities.load_data(input_file)
- Let's visualize the data. The code for visualization is exactly the same as it was in the previous recipe. You can also find it in the file named svm_imbalance.py, already provided to you:
# Separate the data into classes based on 'y'
class_0 = np.array([X[i] for i in range(len(X)) if y[i]==0])
class_1 = np.array([X[i] for i in range(len(X)) if y[i]==1])
# Plot the input data
plt.figure()
plt.scatter(class_0[:,0], class_0[:,1], facecolors='black', edgecolors='black', marker='s')
plt.scatter(class_1[:,0], class_1[:,1], facecolors='None', edgecolors='black', marker='s')
plt.title('Input data')
plt.show()
- If you run it, you will see the following:
![](https://epubservercos.yuewen.com/715797/19470380108815806/epubprivate/OEBPS/Images/77569d03-74a2-4fdb-a9c2-12a3e2e1870d.png?sign=1734484965-IlHo51iE0HqfYLxWi301MhOL27Ns7Rvx-0-f1a3a3b92d0c87e9cae77c3f740ed70f)
- Let's build an SVM with a linear kernel. The code is the same as it was in the previous recipe, Building a nonlinear classifier using SVMs:
from sklearn import model_selection
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.25, random_state=5)
params = {'kernel': 'linear'}
classifier = SVC(**params, gamma='auto')
classifier.fit(X_train, y_train)
utilities.plot_classifier(classifier, X_train, y_train, 'Training dataset')
plt.show()
- Let's print a classification report:
from sklearn.metrics import classification_report
target_names = ['Class-' + str(int(i)) for i in set(y)]
print("\n" + "#"*30)
print("\nClassifier performance on training dataset\n")
print(classification_report(y_train, classifier.predict(X_train), target_names=target_names))
print("#"*30 + "\n")
print("#"*30)
print("\nClassification report on test dataset\n")
print(classification_report(y_test, y_test_pred, target_names=target_names))
print("#"*30 + "\n")
- If you run it, you will see the following:
![](https://epubservercos.yuewen.com/715797/19470380108815806/epubprivate/OEBPS/Images/1b10f85f-4ba2-4ac0-a432-98bbfddb5d4f.png?sign=1734484965-QEQeoUWQIxsFqXMUUJdSsXlz160Ust1x-0-688934c9e1b266f98530fba6d6d3e4c9)
- You might wonder why there's no boundary here! Well, this is because the classifier is unable to separate the two classes at all, resulting in 0% accuracy for Class-0. You will also see a classification report printed on your Terminal, as shown in the following screenshot:
![](https://epubservercos.yuewen.com/715797/19470380108815806/epubprivate/OEBPS/Images/2cb87d6a-af82-46c4-aef3-87d7d634ec35.png?sign=1734484965-J9TPGJHq93B1iT4XUuRJ7kojbddNpVkr-0-828c02e57c62aa7e530330ce3eb4f9e9)
- As we expected, Class-0 has 0% precision, so let's go ahead and fix this! In the Python file, search for the following line:
params = {'kernel': 'linear'}
- Replace the preceding line with the following:
params = {'kernel': 'linear', 'class_weight': 'balanced'}
- The class_weight parameter will count the number of datapoints in each class to adjust the weights so that the imbalance doesn't adversely affect the performance.
- You will get the following output once you run this code:
![](https://epubservercos.yuewen.com/715797/19470380108815806/epubprivate/OEBPS/Images/716845e6-f486-4b75-9457-99b29dc43757.png?sign=1734484965-WyD4mwxrzF1hyUxQKcmoCH44WtNPbFjF-0-0f998a8f6aefee06eb558f8005d5f4ee)
- Let's look at the classification report:
![](https://epubservercos.yuewen.com/715797/19470380108815806/epubprivate/OEBPS/Images/b3380aa6-bae7-451e-a82c-82dba65509fe.png?sign=1734484965-EM4eWtvCM3lfi4qfJfDrRGhXh8DHXdkY-0-995150076ec88b75ebcd70a544948a03)
- As we can see, Class-0 is now detected with nonzero percentage accuracy.