
The nitty gritty of cleaning text
Strings are used to support text processing so using a good string library is important. Unfortunately, the java.lang.String class has some limitations. To address these limitations, you can either implement your own special string functions as needed or you can use a third-party library.
Creating your own library can be useful, but you will basically be reinventing the wheel. It may be faster to write a simple code sequence to implement some functionality, but to do things right, you will need to test them. Third-party libraries have already been tested and have been used on hundreds of projects. They provide a more efficient way of processing text.
There are several text processing APIs in addition to those found in Java. We will demonstrate two of these:
- Apache Commons: https://commons.apache.org/
- Guava: https://github.com/google/guava
Java provides many supports for cleaning text data, including methods in the String class. These methods are ideal for simple text cleaning and small amounts of data but can also be efficient with larger, complex datasets. We will demonstrate several String class methods in a moment. Some of the most helpful String class methods are summarized in the following table:

Many text operations are simplified by the use of regular expressions. Regular expressions use standardized syntax to represent patterns in text, which can be used to locate and manipulate text matching the pattern.
A regular expression is simply a string itself. For example, the string Hello, my name is Sally can be used as a regular expression to find those exact words within a given text. This is very specific and not broadly applicable, but we can use a different regular expression to make our code more effective. Hello, my name is \\w will match any text that starts with Hello, my name is and ends with a word character.
We will use several examples of more complex regular expressions, and some of the more useful syntax options are summarized in the following table. Note each must be double-escaped when used in a Java application.

The size and source of text data varies wildly from application to application but the methods used to transform the data remain the same. You may actually need to read data from a file, but for simplicity's sake, we will be using a string containing the beginning sentences of Herman Melville's Moby Dick for several examples within this chapter. Unless otherwise specified, the text will assumed to be as shown next:
String dirtyText = "Call me Ishmael. Some years ago- never mind how";
dirtyText += " long precisely - having little or no money in my purse,";
dirtyText += " and nothing particular to interest me on shore, I thought";
dirtyText += " I would sail about a little and see the watery part of the world.";