
Decoding the reviews
If you're curious (which we are), we can of course map out the exact words that these numbers correspond to so that we can read what the review actually says. To do this, we must back up our labels. While this step is not essential, it is useful if we want to visually verify our network's predictions later on:
#backup labels, so we can verify our networks prediction after vectorization
xtrain = x_train
xtest = x_test
Then, we need to recover the words corresponding to the integers representing a review, which we saw earlier. The dictionary of words that were used to encode these reviews is included with the IMDB dataset. We will simply recover them as the word_index variable and reverse their order of storage. This basically allows us to map each integer index to its corresponding word:
word_index =imdb.get_word_index()
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])
The following function takes two arguments. The first one (n) denotes an integer referring to the nth review in a set. The second argument defines whether the nth review is taken from our training or test data. Then, it simply returns the string version of the review we specify.
This allows us to read out what a reviewer actually wrote. As we can see, in our function, we are required to adjust the position of indices, which are offset by three positions. This is simply how the designers of the IMDB dataset chose to implement their coding scheme, and so this is not of practical relevance for other tasks. The offset of the three positions in question occurs because positions 0, 1, and 2 are occupied by indices for padding, denoting the start of a sequence, and denoting unknown values, respectively:
def decode_review(n, split= 'train'):
if split=='train':
decoded_review=' '.join([reverse_word_index.get(i-3,'?')for i in
ctrain[n]])
elif split=='test':
decoded_review=' '.join([reverse_word_index.get(i-3,'?')for i in
xtest[n]])
return decoded_review
Using this function, we can decode review number five from our training set, as shown in the following code. It turns out that this is a negative review, as denoted by its training label, and inferred by its content. Note that the question marks are simply an indication of unknown values. Unknown values can occur inherently in the review (due to the use of emojis, for example) or due to the restrictions we have imposed (that is, if a word is not in the top 12,000 most frequent words that were used in the corpus, as stated earlier):
print('Training label:',y_train[5])
decode_review(5, split='train'),
Training label: 0.0