
Simple text cleaning
We will use the string shown before from Moby Dick to demonstrate some of the basic String class methods. Notice the use of the toLowerCase and trim methods. Datasets often have non-standard casing and extra leading or trailing spaces. These methods ensure uniformity of our dataset. We also use the replaceAll method twice. In the first instance, we use a regular expression to replace all numbers and anything that is not a word or whitespace character with a single space. The second instance replaces all back-to-back whitespace characters with a single space:
out.println(dirtyText);
dirtyText = dirtyText.toLowerCase().replaceAll("[\\d[^\\w\\s]]+", " ");
dirtyText = dirtyText.trim();
while(dirtyText.contains(" ")){
dirtyText = dirtyText.replaceAll(" ", " ");
}
out.println(dirtyText);
When executed, the code produces the following output, truncated:
Call me Ishmael. Some years ago- never mind how long precisely -
call me ishmael some years ago never mind how long precisely
Our next example produces the same result but approaches the problem with regular expressions. In this case, we replace all of the numbers and other special characters first. Then we use method chaining to standardize our casing, remove leading and trailing spaces, and split our words into a String array. The split method allows you to break apart text on a given delimiter. In this case, we chose to use the regular expression \\W, which represents anything that is not a word character:
out.println(dirtyText);
dirtyText = dirtyText.replaceAll("[\\d[^\\w\\s]]+", "");
String[] cleanText = dirtyText.toLowerCase().trim().split("[\\W]+");
for(String clean : cleanText){
out.print(clean + " ");
}
This code produces the same output as shown previously.
Although arrays are useful for many applications, it is often important to recombine text after cleaning. In the next example, we employ the join method to combine our words once we have cleaned them. We use the same chained methods as shown previously to clean and split our text. The join method joins every word in the array words and inserts a space between each word:
out.println(dirtyText);
String[] words = dirtyText.toLowerCase().trim().split("[\\W\\d]+");
String cleanText = String.join(" ", words);
out.println(cleanText);
Again, this code produces the same output as shown previously. An alternate version of the join method is available using Google Guava. Here is a simple implementation of the same process we used before, but using the Guava Joiner class:
out.println(dirtyText);
String[] words = dirtyText.toLowerCase().trim().split("[\\W\\d]+");
String cleanText = Joiner.on(" ").skipNulls().join(words);
out.println(cleanText);
This version provides additional options, including skipping nulls, as shown before. The output remains the same.