
Java core tokenizers
StringTokenizer was the first and most basic tokenizer and has been available since Java 1. It is not recommended for use in new development as the String class's split method is considered more efficient. While it does provide a speed advantage for files with narrowly defined and set delimiters, it is less flexible than other tokenizer options. The following is a simple implementation of the StringTokenizer class that splits a string on spaces:
StringTokenizer tokenizer = new StringTokenizer(dirtyText," ");
while(tokenizer.hasMoreTokens()){
out.print(tokenizer.nextToken() + " ");
}
When we set the dirtyText variable to hold our text from Moby Dick, shown previously, we get the following truncated output:
Call me Ishmael. Some years ago- never mind how long precisely...
StreamTokenizer is another core Java tokenizer. StreamTokenizer grants more information about the tokens retrieved, and allows the user to specify data types to parse, but is considered more difficult to use than StreamTokenizer or the split method. The String class split method is the simplest way to split strings up based on a delimiter, but it does not provide a way to parse the split strings and you can only specify one delimiter for the entire string. For these reasons, it is not a true tokenizer, but it can be useful for data cleaning.
The Scanner class is designed to allow you to parse strings into different data types. We used it previously in the Handling CSV data section and we will address it again in the Removing stop words section.