
Third-party tokenizers and libraries
Apache Commons consists of sets of open source Java classes and methods. These provide reusable code that complements the standard Java APIs. One popular class included in the Commons is StrTokenizer. This class provides more advanced support than the standard StringTokenizer class, specifically more control and flexibility. The following is a simple implementation of the StrTokenizer:
StrTokenizer tokenizer = new StrTokenizer(text);
while (tokenizer.hasNext()) {
out.print(tokenizer.next() + " ");
}
This operates in a similar fashion to StringTokenizer and by default parses tokens on spaces. The constructor can specify the delimiter as well as how to handle double quotes contained in data.
When we use the string from Moby Dick, shown previously, the first tokenizer implementation produces the following truncated output:
Call me Ishmael. Some years ago- never mind how long precisely - having little or no money in my purse...
We can modify our constructor as follows:
StrTokenizer tokenizer = new StrTokenizer(text,",");
The output for this implementation is:
Call me Ishmael. Some years ago- never mind how long precisely - having little or no money in my purse
and nothing particular to interest me on shore
I thought I would sail about a little and see the watery part of the world.
Notice how each line is split where commas existed in the original text. This delimiter can be a simple char, as we have shown, or a more complex StrMatcher object.
Google Guava is an open source set of utility Java classes and methods. The primary goal of Guava, as with many APIs, is to relieve the burden of writing basic Java utilities so developers can focus on business processes. We are going to talk about two main tools in Guava in this chapter: the Joiner class and the Splitter class. Tokenization is accomplished in Guava using its Splitter class's split method. The following is a simple example:
Splitter simpleSplit = Splitter.on(',').omitEmptyStrings().trimResults();
Iterable<String> words = simpleSplit.split(dirtyText);
for(String token: words){
out.print(token);
}
This splits each token on commas and produces output like our last example. We can modify the parameter of the on method to split on the character of our choosing. Notice the method chaining which allows us to omit empty strings and trim leading and trailing spaces. For these reasons, and other advanced capabilities, Google Guava is considered by some to be the best tokenizer available for Java.
LingPipe is a linguistical toolkit available for language processing in Java. It provides more specialized support for text splitting with its TokenizerFactory interface. We implement a LingPipe IndoEuropeanTokenizerFactory tokenizer in the Simple text cleaning section.