Practical Data Analysis
上QQ阅读APP看书,第一时间看更新

Datasource

Datasource is a term used for all the technology related to the extraction and storage of data. A datasource can be anything from a simple text file to a big database. The raw data can come from observation logs, sensors, transactions, or user's behavior.

In this section we will take a look into the most common forms for datasource and datasets.

A dataset is a collection of data, usually presented in tabular form. Each column represents a particular variable, and each row corresponds to a given member of the data, as is shown in the following figure:

A dataset represents a physical implementation of a datasource; the common features of a dataset are as follows:

  • Dataset characteristics (such as multivariate or univariate)
  • Number of instances
  • Area (for example life, business, and so on)
  • Attribute characteristics (namely, real, categorical, and nominal)
  • Number of attributes
  • Associated tasks (such as classification or clustering)
  • Missing Values

Open data

Open data is data that can be used, re-use, and redistributed freely by anyone for any purpose. Following is a short list of repositories and databases for open data:

Tip

Other interesting sources of data come from the data mining and knowledge discovery competitions such as ACM-KDD Cup or Kaggle platform, in most cases the datasets are still available, even after the competition is closed.

Check out the ACM-KDD Cup at the link http://www.sigkdd.org/kddcup/index.php.

And Kaggle available at http://www.kaggle.com/competitions.

Text files

The text files are commonly used for storage of data, because it is easy to transform into different formats, and it is often easier to recover and continue processing the remaining contents than with other formats. Large amounts of data come in text format from logs, sensors, e-mails, and transactions. There are several formats for text files such as CSV (comma delimited), TSV (tab delimited), Extensible Markup Language (XML) and (JSON) (see the Data formats section).

Excel files

MS-Excel is probably the most used and also the most underrated data analysis tool. In fact Excel has some good points such as filtering, aggregation functions, and using Visual Basis for Application you can make Structured Query Language (SQL)—such as queries with the sheets or with an external database.

Excel provides us with some visualization tools and we can extend the analysis capabilities of Excel (Version 2010) by installing the Analysis ToolPak that includes functions for Regression, Correlation, Covariance, Fourier Analysis, and so on. For more information about the Analysis ToolPak check the link http://bit.ly/ZQKwSa.

Some Excel disadvantages are that missing values are handled inconsistently and there is no record of how an analysis was accomplished. In the case of the Analysis ToolPak, it can only work with one sheet at a time. That's why Excel is a poor choice for statistical analysis beyond the basic examples.

We can easily transform Excel files (.xls) into another text file format such as CSV, TSV, or even XML. To export the Excel sheet just go to File menu, select the option Save & Send, and in Change File Type select your preferred format such as CSV (Comma delimited).

SQL databases

A database is an organized collection of data. SQL is a database language for managing and manipulating data in Relational Database Management Systems (RDBMS). The Database Management Systems (DBMS) are responsible for maintaining the integrity and security of stored data, and for recovering information if the system fails. SQL Language is split into two subsets of instructions, the Data Definition Language (DDL) and Data Manipulation Language (DML).

The data is organized in schemas (database) and divided into tables related by logical relationships, where we can retrieve the data by making queries to the main schema, as is shown in the following screenshot:

DDL allows us to create, delete, and alter database tables. We can also define keys to specify relationships between tables, and implement constraints between database tables.

  • CREATE TABLE: This command creates a new table
  • ALTER TABLE: This command alters a table
  • DROP TABLE: This command deletes a table

DML is a language which enables users to access and manipulate data.

  • SELECT: This command retrieves data from the database
  • INSERT INTO: This command inserts new data into the database
  • UPDATE: This command modifies data in the database
  • DELETE: This command deletes data in the database

NoSQL databases

Not only SQL (NoSQL) is a term used in several technologies where the nature of the data does not require a relational model. NoSQL technologies allow working with a huge quantity of data, higher availability, scalability, and performance.

See Chapter 12, Data Processing and Aggregation with MongoDB and Chapter 13, Working with MapReduce, for extended examples of document store database MongoDB.

The most common types of NoSQL data stores are:

  • Document store: Data is stored and organized as a collection of documents. The model schema is flexible and each collection can handle any number of fields. For example, MongoDB uses a document of type BSON (binary format of JSON) and CouchDB uses a JSON document.
  • Key-value store: Data is stored as key-value pairs without a predefined schema. Values are retrieved from their keys. For example, Apache Cassandra, Dynamo, HBase, and Amazon SimpleDB.
  • Graph-based store: Data is stored in graph structures with nodes, edges, and properties using the computer science graph theory for storing and retrieving data. These kinds of databases are excellent to represent social network relationships. For example, Neo4js, InfoGrid, and Horton.

For more information about NoSQL see the following link:

http://nosql-database.org/

Multimedia

The increasing number of mobile devices makes it a priority of data analysis to acquire the ability to extract semantic information from multimedia datasources. Datasources include directly perceivable media such as audio, image, and video. Some of the applications for these kinds of datasources are as follows:

  • Content-based image retrieval
  • Content-based video retrieval
  • Movie and video classification
  • Face recognition
  • Speech recognition
  • Audio and music classification

In Chapter 5, Similarity-based Image Retrieval, we present a similarity-based image search engine using Caltech256 that is an image dataset with over 30,600 images.

Web scraping

When we want to obtain data, a good place to start is in the web. Web scraping refers to an application that processes the HTML of a web page to extract data for manipulation. Web scraping applications will simulate a person viewing a website with a browser. In the following example, we assume we want to get the current gold price from the website www.gold.org, as is shown in the following screenshot:

Then we need to inspect the Gold Spot Price element in the website, where we will find the following HTML tag:

<td class="value" id="spotpriceCellAsk">1,573.85</td>

We can observe an id, spotpriceCellAsk in the td tag; this is the element we will get with the next Python code.

Tip

For this example, we will use the library BeautifulSoup Version 4, in Linux we can install it from the system package manager, we need to open a Terminal and execute the next command:

$ apt-get install python-bs4

For windows we need to download the library from the following link:

http://crummy.com/software/BeautifulSoup/bs4/download/

To install it, just execute in the command line:

$ python setup.py install
  1. First we need to import the libraries BeautifulSoup and urllib.request
    from bs4 import BeautifulSoup
    import urllib.request
    from time import sleep
    from datetime import datetime
  2. Then we use the function getGoldPrice to retrieve the current price from the website, in order to do this we need to provide the URL to make the request and read the entire page.
    req = urllib.request.urlopen(url)
    page = req.read()
  3. Next, we use BeautifulSoup to parse the page (creating a list of all the elements of the page) and ask for the element td with the id, spotpriceCellAsk:
    scraping = BeautifulSoup(page)
    price= scraping.findAll("td",attrs={"id":"spotpriceCellAsk"})[0].text
  4. Now we return the variable price with the current gold price, this value changes every minute on the website, in this case, we want all the values in an hour, so we call the function getGoldPrice in a for loop 60 times, making the script wait 59 seconds between each call.
    for x in range(0,60):
    ...
      sleep(59)
  5. Finally, we save the result in a file goldPrice.out and include the current date time in the format HH:MM:SS (A.M. or P.M.), for example, 11:35:42PM, separated by a comma.
    with open("goldPrice.out","w") as f:
    ...
            sNow = datetime.now().strftime("%I:%M:%S%p")
            f.write("{0}, {1} \n ".format(sNow, getGoldPrice()))

The function datetime.now().strftime creates a string representing the time under the control of an explicit format string "%I:%M:%S%p", where %I represents hour as decimal number from 0 to 12, %M represents minute as a decimal number from 00 to 59, %S represents second as a decimal number from 00 to 61, and %p represent either A.M. or P.M.

A list of complete format directives can be found on the following link:

http://docs.python.org/3.2/library/datetime.html

The following is the full script:

from bs4 import BeautifulSoup
import urllib.request
from time import sleep
from datetime import datetime
def getGoldPrice():
    url = "http://gold.org"
    req = urllib.request.urlopen(url)
    page = req.read()
    scraping = BeautifulSoup(page)
    price= scraping.findAll("td",attrs={"id":"spotpriceCellAsk"})[0]
    .text
    return price

with open("goldPrice.out","w") as f:
    for x in range(0,60):
        sNow = datetime.now().strftime("%I:%M:%S%p")
        f.write("{0}, {1} \n ".format(sNow, getGoldPrice()))
        sleep(59)
Tip

You can download the full script (WebScraping.py) from the author's GitHub repository, which is available at https://github.com/hmcuesta/PDA_Book/tree/master/Chapter2

The output file, goldPrice.out, will look as follows:

11:35:02AM, 1481.25
11:36:03AM, 1481.26
11:37:02AM, 1481.28
11:38:04AM, 1481.25
11:39:03AM, 1481.22