Go Machine Learning Projects
上QQ阅读APP看书,第一时间看更新

Tokenization

When dealing with natural language sentences, the first activity is typically to tokenize the sentence. Given a sentence that reads such as The child was learning a new word and was using it excessively. "Shan't!", she cried. We need to split the sentence into the components that make up the sentence. We call each component a token, hence the name of the process is tokenization. Here's one possible tokenization method, in which we do a simple strings.Split(a, " ").

Here's a simple program:

func main() {
a := "The child was learning a new word and was using it excessively. \"shan't!\", she cried"
dict := make(map[string]struct{})
words := strings.Split(a, " ")
for _, word := range words{
fmt.Println(word)
dict[word] = struct{}{} // add the word to the set of words already seen before.
}
}

This is the output we will get:

The
child
was
learning
a
new
word
and
was
using
it
excessively.
"shan't!",
she
cried

Now think about this in the context of adding words to a dictionary to learn. Let's say we want to use the same set of English words to form a new sentence: she shan't be learning excessively. (Forgive the poor implications in the sentence). We add it to our program, and see if it shows up in the dictionary:

func main() {
a := "The child was learning a new word and was using it excessively. \"shan't!\", she cried"
dict := make(map[string]struct{})
words := strings.Split(a, " ")
for _, word := range words{
dict[word] = struct{}{} // add the word to the set of words already seen before.
}

b := "she shan't be learning excessively."
words = strings.Split(b, " ")
for _, word := range words {
_, ok := dict[word]
fmt.Printf("Word: %v - %v\n", word, ok)
}
}

This leads to the following result:

Word: she - true
Word: shan't - false
Word: be - false
Word: learning - true
Word: excessively. - true

A superior tokenization algorithm would yield a result as follows:

The
child
was
learning
a
new
word
and
was
using
it
excessively
.
"
sha
n't
!
"
,
she
cried

A particular thing to note is that the symbols and punctuation are now tokens. Another particular thing to note is shan't is now split into two tokens: sha and n't. The word shan't is a contraction of shall and not; therefore, it is tokenized into two words. This is a tokenization strategy that is unique to English. Another unique point of English is that words are separated by a boundary marker—the humble space. In languages where there are no word boundary markers, such as Chinese or Japanese, the process of tokenization becomes significantly more complicated. Add to that languages such as Vietnamese, where there are markers for boundaries of syllables, but not words, and you have a very complicated tokenizer at hand.

The details of a good tokenization algorithm are fairly complicated, and tokenization is worthy of a book to itself, so we shan't cover it here.

The best part about the LingSpam corpus is that the tokenization has already been done. Some notes such as compound words and contractions are not tokenized into different tokens such as the example of shan't. They are treated as a single word. For the purposes of a spam classifier, this is fine. However, when working with different types of NLP projects, the reader might want to consider better tokenization strategies.

Here is a final note about tokenization strategies: English is not a particularly regular language. Despite this, regular expressions are useful for small datasets. For  this project, you may get away with the following regular expression: 
const re = `([A-Z])(\.[A-Z])+\.?|\w+(-\w+)*|\$?\d+(\.\d+)?%?|\.\.\.|[][.,;"'?():-_` + "`]"