Python Natural Language Processing
上QQ阅读APP看书,第一时间看更新

Categorical or qualitative data attributes

Categorical or qualitative data attributes are as follows:

  • These kinds of data attributes are more descriptive
  • Examples are our written notes, corpora provided by nltk, a corpus that has recorded different types of breeds of dogs, such as collie, shepherd, and terrier

There are two sub-types of categorical data attributes:

  • Ordinal data:
    • This type of data attribute is used to measure non-numeric concepts such as satisfaction level, happiness level, discomfort level, and so on.
    • Consider the following questions, for example, which you're to answer from the options given:
      • Question 1: How do you feel today?
      • Options for Question 1:
        • Very bad
        • Bad
        • Good
        • Happy
        • Very happy
      • Now you will choose any of the given options. Suppose you choose Good, nobody can convert how good you feel to a numeric value.
    • All the preceding options are non-numeric concepts. Hence, they lie under the category of ordinal data.
      • Question 2: How would you rate our hotel service?
      • Options for Question 2:
        • Bad
        • Average
        • Above average
        • Good
        • Great
    • Now suppose you choose any of the given options. All the aforementioned options will measure your satisfaction level, and it is difficult to convert your answer to a numeric value because answers will vary from person to person.
    • Because one person says Good and another person says Above average, there may be a chance that they both feel the same about the hotel service but give different responses. In simple words, you can say that the difference between one option and the other is unknown. So you can't precisely decide the numerical values for these kinds of data.
  • Nominal data:
    • This type of data attribute is used to record data that doesn't overlap.
    • Example: What is your gender? The answer is either male or female, and the answers are not overlapping.
    • Take another example: What is the color of your eyes? The answer is either black, brown, blue, or gray. (By the way, we are not considering the color lenses available in the market!)

In NLP-related applications, we will mainly deal with categorical data attributes. So, to derive appropriate data points from a corpus that has categorical data attributes is part of feature engineering. We will see more on this in Chapter 5, Feature Engineering and NLP Algorithms.

Some corpora contain both sub-types of categorical data.