Machine Learning for Cybersecurity Cookbook

上QQ阅读APP看书，第一时间看更新

How to do it…

In the next steps, we will convert a corpus of text data into numerical form, amenable to machine learning algorithms:

First, import a textual dataset:

with open("anonops_short.txt", encoding="utf8") as f:
    anonops_chat_logs = f.readlines()

Next, count the words in the text using the hash vectorizer and then perform weighting using tf-idf:

from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

my_vector = HashingVectorizer(input="content", ngram_range=(1, 2))
X_train_counts = my_vector.fit_transform(anonops_chat_logs,)
tf_transformer = TfidfTransformer(use_idf=True,).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)

The end result is a sparse matrix with each row being a vector representing one of the texts:

X_train_tf

<180830 x 1048576 sparse matrix of type <class 'numpy.float64'>' with 3158166 stored elements in Compressed Sparse Row format>

print(X_train_tf)

The following is the output: