Important Python libraries_Hands-On Python Natural Language Processing-QQ阅读男生都市网

上QQ阅读APP看书，第一时间看更新

Important Python libraries

We will now discuss some of the most important Python libraries for NLP. We will delve deeper into some of these libraries in subsequent chapters.

NLTK

The Natural Language Toolkit library (NLTK) is one of the most popular Python libraries for natural language processing. It was developed by Steven Bird and Edward Loper of the University of Pennsylvania. Developed by academics and researchers, this library is intended to support research in NLP and comes with a suite of pedagogical resources that provide us with an excellent way to learn NLP. We will be using NLTK throughout this book, but first, let's explore some of the features of NLTK.

However, before we do anything, we need toinstall the library by running the following command in the Anaconda Prompt:

          pip install nltk

NLTK corpora

A corpus is a large body of text or linguistic data and is very important in NLP research for application development and testing. NLTK allows users to access over 50 corpora and lexical resources (many of them mapped to ML-based applications). We can import any of the available corpora into our program and use NLTK functions to analyze the text in the imported corpus. More details about each corpus could be found here: http://www.nltk.org/book/ch02.html

Text processing

As discussed previously, a key part of NLP is transforming text into mathematical objects. NLTK provides various functions that help us transform the text into vectors. The most basic NLTK function for this purpose is tokenization, which splits a document into a list of units. These units could be words, alphabets, or sentences.

Refer to the following code snippet to perform tokenization using the NLTK library:

import nltk

text = "Who would have thought that computer programs would be analyzing human sentiments"

from nltk.tokenize import word_tokenize

tokens = word_tokenize(text)
print(tokens)

Here's the output:

['Who', 'would', 'have', 'thought', 'that', 'computer', 'programs', 'would', 'be', 'analyzing', 'human', 'sentiments']

We have tokenized the preceding sentence using the word_tokenize() function of NLTK, which is simply splitting the sentence by white space. The output is a list, which is the first step toward vectorization.

In our earlier discussion, we touched upon the computationally intensive nature of the vectorization approach due to the sheer size of the vectors. More words in a vector mean more dimensions that we need to work with. Therefore, we should strive to rationalize our vectors, and we can do that using some of the other useful NLTK functions such as stopwords, lemmatization, and stemming.

The following is a partial list of English stop words in NLTK. Stop words are mostly connector words that do not contribute much to the meaning of the sentence:

import nltk

stopwords = nltk.corpus.stopwords.words('english')
print(stopwords)

Here's the output:

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

Since NLTK provides us with a list of stop words, we can simply look up this list and filter out stop words from our word list:

newtokens=[word for word in tokens if word not in stopwords]

Here's the output:

['Who',
 'would',
 'thought',
 'computer',
 'programs',
 'would',
 'analyzing',
 'human',
 'sentiments']

We can further modify our vector by using lemmatization and stemming, which are techniques that are used to reduce words to their root form. The rationale behind this step is that the imaginary n-dimensional space that we are navigating doesn't need to have separate axes for a word and that word's inflected form (for example, eat and eating don't need to be two separate axes). Therefore, we should reduce each word's inflected form to its root form. However, this approach has its critics because, in many cases, inflected word forms give a different meaning than the root word. For example, the sentences My manager promised me promotion and He is a promising prospect use the inflected form of the root word promise but in entirely different contexts. Therefore, you must perform stemming and lemmatization after considering its pros and cons.

The following code snippet shows an example of performing lemmatization using the NLTK library's WordNetlemmatizer module:

from nltk.stem import WordNetLemmatizer

text = "Who would have thought that computer programs would be analyzing human sentiments"
tokens = word_tokenize(text)
lemmatizer = WordNetLemmatizer()
tokens=[lemmatizer.lemmatize(word) for word in tokens]
print(tokens)

Here's the output:

['Who', 'would', 'have', 'thought', 'that', 'computer', 'program', 'would', 'be', 'analyzing', 'human', 'sentiment']

Lemmatization is performed by looking up a word in WordNet's inbuilt root word map. If the word is not found, it returns the input word unchanged. However, we can see that the performance of the lemmatizer was not good and it was only able to reduce programs and sentiments from their plural forms. This shows that the lemmatizer is highly dependent on the root word mapping and is highly susceptible to incorrect root word transformation.

Stemming is similar to lemmatization but instead of looking up root words in a pre-built dictionary, it defines some rules based on which words are reduced to their root form. For example, it has a rule that states that any word with ing as a suffix will be reduced by removing the suffix.

The following code snippet shows an example of performing stemming using the NLTK library's PorterStemmer module:

from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

text = "Who would have thought that computer programs would be analyzing human sentiments"
tokens=word_tokenize(text.lower())
ps = PorterStemmer()
tokens=[ps.stem(word) for word in tokens]
print(tokens)

Here's the output:

['who', 'would', 'have', 'thought', 'that', 'comput', 'program', 'would', 'be', 'analyz', 'human', 'sentiment']

As per the preceding output, stemming was able to transform more words than lemmatizing, but even this is far from perfect. In addition, you will notice that some stemmed words are not even English words. For example, analyz was derived from analyzing as it blindly applied the rule of removing ing.

The preceding examples show the challenges of reducing words correctly to their respective root forms using NLTK tools. Nevertheless, these techniques are quite popular for text preprocessing and vectorization. You can also create more sophisticated solutions by building on these basic functions to create your own lemmatizer and stemmer. In addition to these tools, NLTK has other features that are used for preprocessing, all of which we will discuss in subsequent chapters.

Part of speech tagging

Part of speech tagging (POS tagging) identifies the part of speech (noun, verb, adverb, and so on) of each word in a sentence. It is a crucial step for many NLP applications since, by identifying the POS of a word, we can deduce its contextual meaning. For example, the meaning of the word ground is different when it is used as a noun; for example, The ground was sodden due to rain, compared to when it is used as an adjective, for example, The restaurant's ground meat recipe is quite popular. We will get into the details of POS tagging and its applications, such as Named Entity Recognizer (NER), in subsequent chapters.

Refer to the following code snippets to perform POS tagging using NLTK:

nltk.pos_tag(["your"])
Out[148]: [('your', 'PRP$')]

nltk.pos_tag(["beautiful"])
Out[149]: [('beautiful', 'NN')]

nltk.pos_tag(["eat"])
Out[150]: [('eat', 'NN')]

We can pass a word as a list to the pos_tag() function, which outputs the word and its part of speech. We can generate POS for each word of a sentence by iterating over the token list and applying the pos_tag() function individually. The following code is an example of how POS tagging can be done iteratively:

from nltk.tokenize import word_tokenize
text = "Usain Bolt is the fastest runner in the world"
tokens = word_tokenize(text)
[nltk.pos_tag([word]) for word in tokens]

Here's the output:

[[('Usain', 'NN')],
 [('Bolt', 'NN')],
 [('is', 'VBZ')],
 [('the', 'DT')],
 [('fastest', 'JJS')],
 [('runner', 'NN')],
 [('in', 'IN')],
 [('the', 'DT')],
 [('world', 'NN')]]

The exhaustive list of NLTK POS tags can be accessed using the upenn_tagset() function of NLTK:

import nltk

nltk.download('tagsets') # need to download first time
nltk.help.upenn_tagset()

Here is a partial screenshot of the output:

Textblob

Textblob is a popular library used for sentiment analysis, part of speech tagging, translation, and so on. It is built on top of other libraries, including NLTK, and provides a very easy-to-use interface, making it a must-have for NLP beginners. In this section, we would like you to dip your toes into this very easy-to-use, yet very versatile library. You can refer to Textblob's documentation, https://textblob.readthedocs.io/en/dev/, or visit its GitHub page, https://github.com/sloria/TextBlob, to get started with this library.

Sentiment analysis

Sentiment analysis is an important area of research within NLP that aims to analyze text and assess its sentiment. The Textblob library allows users to analyze the sentiment of a given piece of text in a very convenient way. Textblob library's documentation (https://textblob.readthedocs.io/en/dev/) is quite detailed, easy to read, and contains tutorials as well.

We can install the textblob library and download the associated corpora by running the following commands in the Anaconda Prompt:

pip install -U textblob
python -m textblob.download_corpora

Refer to the following code snippet to see how conveniently the library can be used to calculate sentiment:

from textblob import TextBlob

TextBlob("I love pizza").sentiment

Here's the output:

Sentiment(polarity=0.5, subjectivity=0.6)

Once the TextBlob library has been imported, all we need to do to calculate the sentiment is to pass the text that needs to be analyzed and use the sentiment module of the library. The sentiment module outputs a tuple with the polarity score and subjectivity score. The polarity score ranges from -1 to 1, with -1 being the most negative sentiment and 1 being the most positive statement. The subjectivity score ranges from 0 to 1, with a score of 0 implying that the statement is factual, whereas a score of 1 implies a highly subjective statement.

For the preceding statement, I love pizza, we get a polarity score of 0.5, implying a positive sentiment. The subjectivity of the preceding statement is also calculated as high, which seems correct. Let's analyze the sentiment of other sentences using Textblob:

TextBlob("The weather is excellent").sentiment

Here's the output:

Sentiment(polarity=1.0, subjectivity=1.0)

The polarity of the preceding statement was calculated as 1 due to the word excellent.

Now, let's look at an example of a highly negative statement. Here, the polarity score of -1 is due to the word terrible:

TextBlob("What a terrible thing to say").sentiment

Here's the output:

Sentiment(polarity=-1.0, subjectivity=1.0)

It also appears that polarity and subjectivity have a high correlation.

Machine translation

Textblob uses Google Translator's API to provide a very simple interface for translating text. Simply use the translate() function to translate a given text into the desired language (from Google's catalog of languages). The to parameter in the translate() function determines the language that the text will be translated into. The output of the translate() function will be the same as what you will get in Google Translate.

Here, we have translated a piece of text into three languages (French, Mandarin, and Hindi). The list of language codes can be obtained from https://cloud.google.com/translate/docs/basic/translating-text#language-params:

from textblob import TextBlob

languages = ['fr','zh-CN','hi']
for language in languages:
    print(TextBlob("Who knew translation could be fun").translate(to=language))

Here's the output:

Part of speech tagging

Textblob's POS tagging functionality is built on top of NLTK's tagging function, but with some modifications. You can refer to NLTK's documentation on POS tagging for more details: https://www.nltk.org/book/ch05.html

The tags function performs POS tagging like so:

TextBlob("The global economy is expected to grow this year").tags

Here's the output:

[('The', 'DT'),
 ('global', 'JJ'),
 ('economy', 'NN'),
 ('is', 'VBZ'),
 ('expected', 'VBN'),
 ('to', 'TO'),
 ('grow', 'VB'),
 ('this', 'DT'),
 ('year', 'NN')]

Since Textblob uses NLTK for POS tagging, the POS tags are the same as NLTK. This list can be accessed using the upenn_tagset() function of NLTK:

import nltk

nltk.download('tagsets') # need to download first time
nltk.help.upenn_tagset()

These are just a few popular applications of Textblob and they demonstrate the ease of use and versatility of the program. There are many other applications of Textblob, and you are encouraged to explore them. A good place to start your Textblob journey and familiarize yourself with other Textblob applications would be the Textblob tutorial, which can be accessed at https://textblob.readthedocs.io/en/dev/quickstart.html.

VADER

Valence Aware Dictionary and sEntiment Reasoner (VADER) is a recently developed lexicon-based sentiment analysis tool whose accuracy is shown to be much greater than the existing lexicon-based sentiment analyzers. This model was developed by computer science professors from Georgia Tech and they have published the methodology of building the lexicon in their very easy-to-read paper (http://comp.social.gatech.edu/papers/icwsm14.vader.hutto.pdf). It improves on other sentiment analyzers by including colloquial language terms, emoticons, slang, acronyms, and so on, which are used generously in social media. It also factors in the intensity of words rather than classifying them as simply positive or negative.

We can install VADER by running the following command in the Anaconda Prompt:

          pip install vaderSentiment

The following is an example of VADER in action:

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyser = SentimentIntensityAnalyzer()

First, we need to import SentimentIntensityAnalyzer module from the vaderSentiment library and create an analyser object of the SentimentIntensityAnalyzer class. We can now pass any text into this object and it will return the sentiment analysis score. Refer to the following example:

analyser.polarity_scores("This book is very good")

Here's the output:

{'neg': 0.0, 'neu': 0.556, 'pos': 0.444, 'compound': 0.4927}

Here, we can see that VADER outputs the negative score, neutral score, and positive score and then aggregates them to calculate the compound score. The compound score is what we are interested in. Any score greater than 0.05 is considered positive, while less than -0.05 is considered negative:

analyser.polarity_scores("OMG! The book is so cool")

Here's the output:

{'neg': 0.0, 'neu': 0.604, 'pos': 0.396, 'compound': 0.5079}

While analyzing the preceding sentence, VADER correctly interpreted the colloquial terms (OMG and cool) and was able to quantify the excitement of the statement. The compound score is greater than the previous statement, which seems reasonable.