上QQ阅读APP看书，第一时间看更新

Get and cleanup the data

You can get a CSV file of the data from https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ASE_2014_00CSA01&prodType=table. Just hit the Download button and click OK. The result is a CSV file that has lots of interesting information in it. If you open it, though, it doesn't really look like an easy-to-use data file.

A single data row looks like this:

00100000US,,United States,,00,Total for all sectors,,001,All firms,001,All firms,00,All firms,003,Equally veteran-/nonveteran-owned,319,Firms with 4 to 5 years in business,2014,12174,11571648,107722,2746052,6.3,15.3,17.8,16.4

So, we'll sanitize the data a bit before we start processing it with D3. There are many different ways you can do this. You can open the file in Excel and select the files you want, you can use some command-line filtering utilities to get the required data, or even write a simple Python or R script to return the data you want. Since we're already working with JavaScript and we've installed Node.js in Chapter 1, Getting Started with D3, let's write a simple script that filters our data. We'll not filter too much, let's just get rid of the data we're not interested in:

We're not interested in the data for a specific industry sector, so we start by filtering out all the rows that don't have the value Total for all sectors set to Y.
Next, we'll filter out the columns that aren't interesting for us. What we want are the columns that indicate gender, ethnic group, race, veteran status, time in business, and finally, the rows that contain the number of businesses.

We use the following simple Node.js script for that:

var d3 = require('d3'); 
var fs = require('fs'); 
 
// read the data 
fs.readFile('./ASE_2014_00CSA02.csv', function (err, fileData) { 
    var rows = d3.csvParse(fileData.toString()); 
 
    // filter out the sector specific stuff 
    var allSectors = rows.filter(function (row) { 
        return row['NAICS.id'] === '00' 
    }); 
 
    // remove unused columns, and make nice headers 
    var mapped = allSectors.map( function(el) { 
        return { 
            sex: el['SEX.id'], 
            sexLabel: el['SEX.display-label'], 
            ethnicGroup: el['ETH_GROUP.id'], 
            ethnicGroupLabel: el['ETH_GROUP.display-label'], 
            raceGroup: el['RACE_GROUP.id'], 
            raceGroupLabel: el['RACE_GROUP.display-label'], 
            vetGroup: el['VET_GROUP.id'], 
            vetGroupLabel: el['VET_GROUP.display-label'], 
            yearsInBusiness:  el['YIBSZFI.id'], 
            yearsInBusinessLabel:  el['YIBSZFI.display-label'], 
            count: el['FIRMPDEMP'] 
        } 
    }); 
 
    fs.writeFile('./businessFiltered.csv',d3.csvFormat(mapped)); 
});

What happens in this script is that we use the fs.readFile API of Node.js to read the file we downloaded from the filesystem, and then use D3 to parse the CSV file. After parsing, we filter out the elements we don't want, and use map to convert each element to a simple one. Finally, we use the fs.writeFile API call to output the converted data as a CSV again using the d3.csvFormat function. To run this script yourself, navigate to the <DVD3>/src/chapter-02/data/ directory and run the ./cleanBusinesses.js node. The result of this is that now we have a very clean and easy-to-understand CSV to process in our visualization:

sex,sexLabel,ethnicGroup,ethnicGroupLabel,raceGroup,raceGroupLabel, ... 
001,All firms,001,All firms,00,All firms, ... 
001,All firms,001,All firms,00,All firms, ...

With this data, we can now very easily select specific groups to visualize by just filtering on the sex, ethnicGroup, raceGroup, and vetGroup properties.