Downloading the JavaScript
There's one more step before we can point this at a site – we need to download the actual JavaScript! Before analyzing the source code using our scanjs wrapper, we need to pull it from the target page. Pulling the code once in a single, discrete process (and from a single URL) means that, even as we develop more tooling around attack-surface reconnaissance, we can hook this script up to other services: it could pull the JavaScript from a URL supplied by a crawler, it could feed JavaScript or other assets into other analysis tools, or it could analyze other page metrics.
So the simplest version of this script should be: the script takes a URL, looks at the source code for that page to find all JavaScript libraries, and then downloads those files to the specified location.
The first thing we need to do is grab the HTML from the URL of the page we're inspecting. Let's add some code that accepts the url and directory CLI arguments, and defines our target and where to store the downloaded JavaScript. Then, let's use the requests library to pull the data and Beautiful Soup to make the HTML string a searchable object:
#!/usr/bin/env python2.7
import os, sys
import requests
from bs4 import BeautifulSoup
url = sys.argv[1]
directory = sys.argv[2]
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
Then we need to iterate over each script tag and use the src attribute data to download the file to a directory within our current root:
for script in soup.find_all('script'):
if script.get('src'): download_script(script.get('src'))
That download_script() function might not ring a bell because we haven't written it yet. But that's what we want – a function that takes the src attribute path, builds the link to the resource, and downloads it into the directory we've specified:
def download_script(uri):
address = url + uri if uri[0] == '/' else uri
filename = address[address.rfind("/")+1:address.rfind("js")+2]
req = requests.get(url)
with open(directory + '/' + filename, 'wb') as file:
file.write(req.content)
Each line is pretty direct. After the function definition, the HTTP address of the script is created using a Python ternary. If the src attribute starts with /, it's a relative path and can just be appended onto the hostname; if it doesn't, it must be a full/absolute link. Ternaries can be funky but also powerfully expressive once you get the hang of them.
The second line of the function creates the filename of the JavaScript library link by finding the character index of the last forward slash (address.rfind("/")) and the index of the js file extension, plus 2 to avoid slicing off the js part (address.rfind("js")+2)), and then uses the [begin:end] list-slicing syntax to create a new string from just the specified indices.
Then, in the third line, the script pulls data from the assembled address using requests, creates a new file using a context manager, and writes the page source code to /directory/filename.js. Now you have a location, the path passed in as an argument, and all of the JavaScript from a particular page saved inside of it.