Applied Supervised Learning with R
上QQ阅读APP看书,第一时间看更新

Reading Data from Various Data Formats

Data from digital systems is generated in various forms: browsing history on an e-commerce website, clickstream data, the purchase history of a customer, social media interactions, footfalls in a retail store, images from satellite and drones, and numerous other formats and types of data. We are living in an exciting time when technology is significantly changing lives, and enterprises are leveraging it to create their next data strategy to make better decisions.

It is not enough to be able to collect a huge amount of different types of data; we also need to leverage value out of it. A CCTV footage captured throughout a day will help the law and order teams of the government in improving the real-time surveillance of public places. The challenge remains in how we will process a large volume of heterogeneous data formats within a single system.

Transaction data in the Customer Relationship Management (CRM) application would mostly be tabular and feed in social media is mostly text, audio, video, and images.

We can categorize the data formats as structured—tabular data such as CSV and database tables; unstructured—textual data such as tweets, FB posts, and word documents; and semi-structured. Unlike textual, which is hard for machines to process and understand, semi-structured provides associated metadata, which makes it easy for computers to process it. It's popularly used with many web applications for data exchange, and JSON is an example of the semi-structured data format.

In this section, we will see how to load, process, and transform various data formats in R. Within the scope of this book, we will work with CSV, text, and JSON data.

CSV Files

CSV files are the most common type of data storage and exchange formats for structured data. R provides a method called read.csv() for reading data from a CSV file. It will read the data into a data.frame (more about it in the next section). There are many arguments that the method takes; the two required arguments are a path to the filename and sep, which specifies the character that separates the column values. The summary() method describes the six summary statistics, min, first quartile, median, mean, third quartile, and max.

In the following exercise, we'll read a CSV file and summarize its column.

Exercise 2: Reading a CSV File and Summarizing its Column

In this exercise, we will read the previously extracted CSV file and use the summary function to print the min, max, mean, median, 1st quartile, and 3rd quartile values of numeric variables and count the categories of the categorical variable.

Carry out these steps to read a CSV file and later summarize its columns:

  1. First, use the read.csv method and load the bank-full.csv into a DataFrame:

    df_bank_detail <- read.csv("bank-full.csv", sep = ';')

  2. Print the summary of the DataFrame:

    summary(df_bank_detail)

    The output is as follows:

    ## age job marital education

    ## Min. :18.00 blue-collar:9732 divorced: 5207 primary : 6851

    ## 1st Qu.:33.00 management :9458 married :27214 secondary:23202

    ## Median :39.00 technician :7597 single :12790 tertiary :13301

    ## Mean :40.94 admin. :5171 unknown : 1857

    ## 3rd Qu.:48.00 services :4154

    ## Max. :95.00 retired :2264

JSON

JSON is the next most commonly used data format for sharing and storing data. It is unlike CSV files, which only deal with rows and columns of data where each has a definite number of columns. For example, in the e-commerce data of the customers, each row could be representing a customer with their information stored in separate columns. For a customer, if a column has no value, the field is stored as NULL.

JSON provides an added flexibility of having a varying number of fields for each customer. This type of flexibility relieves the developer from the burden of maintaining a schema as we have in traditional relational databases, wherein the same customer data might be spread across multiple tables to optimize for storage and querying time.

JSON is more of a key-value store type of storage, where all we care about is the keys (such as the name, age, and DOB) and their corresponding values. While this sounds flexible, proper care has to be taken, otherwise manageability might at times, go out of control. Fortunately, with the advent of big data technologies in recent days, many document stores (a subclass of the key-value store), popularly also known as NoSQL databases, are available for storing, retrieving, and processing data in such formats.

In the following exercise, the JSON file has data for cardamom (spices and condiments) cultivation district-wise in Tamil Nadu, India, for the year 2015-16. The keys include area (hectare), production (in quintals), and productivity (average yield per hectare).

The jsonlite package provides an implementation to read and convert a JSON file into DataFrame, which makes the analysis simpler. The fromJSON method reads a JSON file, and if the flatten argument in the fromJSON function is set to TRUE, it gives a DataFrame.

Exercise 3: Reading a JSON file and Storing the Data in DataFrame

In this exercise, we will read a JSON file and store the data in the DataFrame.

Perform the following steps to complete the exercise:

  1. Download the data from https://data.gov.in/catalog/area-production-productivity-spices-condiments-district-wise-tamil-nadu-year-2015-16.
  2. First, use the following command to install the packages required for the system of read the JSON file:

    install jsonlite package

    install.packages("jsonlite")

    library(jsonlite)

  3. Next, read the JSON file using the fromJSON method, as illustrated here:

    json_file <- "crop.json"

    json_data <- jsonlite::fromJSON(json_file, flatten = TRUE)

  4. The second element in the list contains the DataFrame with crop production value. Retrieve it from json_data and store as a DataFrame named crop_production:

    crop_production <- data.frame(json_data[[2]])

  5. Next, use the following command to rename the columns:

    colnames(crop_production) <- c("S.No","District","Area","Production","PTY")

  6. Now, print the top six rows using the head() function:

    head(crop_production)

    The output is as follows:

    ## S.No District Area Production PTY

    ## 1 1 Ariyalur NA NA NA

    ## 2 2 Coimbatore 808 26 0.03

    ## 3 3 Cuddalore NA NA NA

    ## 4 4 Dharmapuri NA NA NA

    ## 5 5 Dindigul 231 2 0.01

    ## 6 6 Erode NA NA NA

Text

Unstructured data is the language of the web. All the social media, blogs, web pages, and many other sources of information are textual and untidy to extract any meaningful information. An increasing amount of research work is coming out from the Natural Language Processing (NLP) field, wherein computers are becoming better in understanding not only the meaning of the word but also the context in which it's used in a sentence. The rise of computer chatbot, which responds to a human query, is the most sophisticated form of understanding textual information.

In R, we will use the tm text mining package to show how to read, process, and retrieve meaningful information from text data. We will use a small sample of the Amazon Food Review dataset in Kaggle (https://www.kaggle.com/snap/amazon-fine-food-reviews) for the exercise in this section.

In the tm package, collections of text documents are called Corpus. One implementation of Corpus in the tm package is VCorpus (volatile corpus). Volatile corpus is named after the fact that it's stored in-memory for fast processing. To check the metadata information of the VCorpus object, we can use the inspect() method. The following exercise uses the lapply method for looping through the first two reviews and casting the text as a character. You will learn more about the apply family of function in the The Apply Family of Functions section.

Exercise 4: Reading a CSV File with Text Column and Storing the Data in VCorpus

In this exercise, we will read a CSV file with the text column and store the data in VCorpus.

Perform the following steps to complete the exercise:

  1. First, let's load the text mining package from the R into the system to read the text file:

    library(tm)

  2. Now, read the first top 10 reviews from the file:

    review_top_10 <- read.csv("Reviews_Only_Top_10_Records.csv")

  3. To store the text column in VCorpus, use the following command:

    review_corpus <- VCorpus(VectorSource(review_top_10$Text))

  4. To inspect the structure of first two reviews, execute the following command:

    inspect(review_corpus[1:2])

    The output is as follows:

    ## <<VCorpus>>

    ## Metadata: corpus specific: 0, document level (indexed): 0

    ## Content: documents: 2

    ## [[1]]

    ## <<PlainTextDocument>>

    ## Metadata: 7

    ## Content: chars: 263

    ## [[2]]

    ## <<PlainTextDocument>>

    ## Metadata: 7

    ## Content: chars: 190

  5. Using lapply, cast the first review as character and print:

    lapply(review_corpus[1:2], as.character)

    ## $'1'

    ## [1] "I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than most."

    ## $'2'

    ## [1] "Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as \"Jumbo\".

We will revisit the review_corpus dataset in a later section to show how to convert the unstructured textual information to structured tabular data.

Apart from CSV, Text, and JSON, there are numerous other data formats depending upon the source of data and its usage. R has a rich collection of libraries that helps with many formats. R can import not only the standard formats (apart from the previous three) such as HTML tables and XML but also formats specific to an analytical tool such as SAS and SPSS. This democratization led to a significant migration of industry experts who were earlier working in the propriety tools (costly and often found with only the large corporations) to open source analytical programming languages such as R and Python.