Java:Data Science Made Easy
上QQ阅读APP看书,第一时间看更新

Subsetting data

It is not always practical or desirable to work with an entire set of data. In these cases, we may want to retrieve a subset of data to either work with or remove entirely from the dataset. There are a few ways of doing this supported by the standard Java libraries. First, we will use the subSet method of the SortedSet interface. We will begin by storing a list of numbers in a TreeSet. We then create a new TreeSet object to hold the subset retrieved from the list. Next, we print out our original list:

Integer[] nums = {12, 46, 52, 34, 87, 123, 14, 44}; 
TreeSet<Integer> fullNumsList = new TreeSet<Integer>(new
ArrayList<>(Arrays.asList(nums)));
SortedSet<Integer> partNumsList;
out.println("Original List: " + fullNumsList.toString()
+ " " + fullNumsList.last());

The subSet method takes two parameters, which specify the range of integers within the data we want to retrieve. The first parameter is included in the results while the second is exclusive. In our example that follows, we want to retrieve a subset of all numbers between the first number in our array 12 and 46:

partNumsList = fullNumsList.subSet(fullNumsList.first(), 46); 
out.println("SubSet of List: " + partNumsList.toString()
+ " " + partNumsList.size());

Our output follows:

Original List: [12, 14, 34, 44, 46, 52, 87, 123] 
SubSet of List: [12, 14, 34, 44]

Another option is to use the stream method in conjunction with the skip method. The stream method returns a Java 8 Stream instance which iterates over the set. We will use the same numsList as in the previous example, but this time we will specify how many elements to skip with the skip method. We will also use the collect method to create a new Set to hold the new elements:

out.println("Original List: " + numsList.toString()); 
Set<Integer> fullNumsList = new TreeSet<Integer>(numsList);
Set<Integer> partNumsList = fullNumsList
.stream()
.skip(5)
.collect(toCollection(TreeSet::new));
out.println("SubSet of List: " + partNumsList.toString());

When we print out the new subset, we get the following output where the first five elements of the sorted set are skipped. Because it is a SortedSet, we will actually be omitting the five lowest numbers:

Original List: [12, 46, 52, 34, 87, 123, 14, 44]
SubSet of List: [52, 87, 123]

At times, data will begin with blank lines or header lines that we wish to remove from our dataset to be analyzed. In our final example, we will read data from a file and remove all blank lines. We use a BufferedReader to read our data and employ a lambda expression to test for a blank line. If the line is not blank, we print the line to the screen:

try (BufferedReader br = new BufferedReader(new FileReader("C:\\text.txt"))) { 
br
.lines()
.filter(s -> !s.equals(""))
.forEach(s -> out.println(s));
} catch (IOException ex) {
// Handle exceptions
}