Machine Learning for Cybersecurity Cookbook
上QQ阅读APP看书,第一时间看更新

How to do it...

In the following steps, we curate a dataset and then use it to create a classifier to determine the file type. For demonstration purposes, we show how to obtain a collection of PowerShell scripts, Python scripts, and JavaScript files by scraping GitHub. A collection of samples obtained in this way can be found in the accompanying repository as PowerShellSamples.7z, PythonSamples.7z, and JavascriptSamples.7z. First, we will write the code for the JavaScript scraper:

  1. Begin by importing the PyGitHub library in order to be able to call the GitHub API. We also import the base64 module for decoding the base64 encoded files:
import os
from github import Github
import base64
  1. We must supply our credentials, and then specify a query—in this case, for JavaScript—to select our repositories:
username = "your_github_username"
password = "your_password"
target_dir = "/path/to/JavascriptSamples/"
g = Github(username, password)
repositories = g.search_repositories(query='language:javascript')
n = 5
i = 0
  1. We loop over the repositories matching our criteria:
for repo in repositories:
repo_name = repo.name
target_dir_of_repo = target_dir+"\\"+repo_name
print(repo_name)
try:

  1. We create a directory for each repository matching our search criteria, and then read in its contents:
        os.mkdir(target_dir_of_repo)
i += 1
contents = repo.get_contents("")
  1. We add all directories of the repository to a queue in order to list all of the files contained within the directories:
        while len(contents) > 1:
file_content = contents.pop(0)
if file_content.type == "dir":
contents.extend(repo.get_contents(file_content.path))
else:
  1. If we find a non-directory file, we check whether its extension is .js:
                st = str(file_content)
filename = st.split("\"")[1].split("\"")[0]
extension = filename.split(".")[-1]
if extension == "js":
  1. If the extension is .js, we write out a copy of the file:
                    file_contents = repo.get_contents(file_content.path)
file_data = base64.b64decode(file_contents.content)
filename = filename.split("/")[-1]
file_out = open(target_dir_of_repo+"/"+filename, "wb")
file_out.write(file_data)
except:
pass
if i==n:
break
  1. Once finished, it is convenient to move all the JavaScript files into one folder.

    To obtain PowerShell samples, run the same code, changing the following:
target_dir = "/path/to/JavascriptSamples/"
repositories = g.search_repositories(query='language:javascript')

To the following:

target_dir = "/path/to/PowerShellSamples/"
repositories = g.search_repositories(query='language:powershell').

Similarly, for Python files, we do the following:

target_dir = "/path/to/PythonSamples/"
repositories = g.search_repositories(query='language:python').