
Datasource
Datasource is a term used for all the technology related to the extraction and storage of data. A datasource can be anything from a simple text file to a big database. The raw data can come from observation logs, sensors, transactions, or user's behavior.
In this section we will take a look into the most common forms for datasource and datasets.
A dataset is a collection of data, usually presented in tabular form. Each column represents a particular variable, and each row corresponds to a given member of the data, as is shown in the following figure:

A dataset represents a physical implementation of a datasource; the common features of a dataset are as follows:
- Dataset characteristics (such as multivariate or univariate)
- Number of instances
- Area (for example life, business, and so on)
- Attribute characteristics (namely, real, categorical, and nominal)
- Number of attributes
- Associated tasks (such as classification or clustering)
- Missing Values
Open data
Open data is data that can be used, re-use, and redistributed freely by anyone for any purpose. Following is a short list of repositories and databases for open data:
- Datahub is available at http://datahub.io/
- Book-Crossing Dataset is available at http://www.informatik.uni-freiburg.de/~cziegler/BX/
- World Health Organization is available at http://www.who.int/research/en/
- The World Bank is available at http://data.worldbank.org/
- NASA is available at http://data.nasa.gov/
- United States Government is available at http://www.data.gov/
- Machine Learning Datasets is available at http://bitly.com/bundles/bigmlcom/2
- Scientific Data from University of Muenster is available at http://data.uni-muenster.de/
- Hilary Mason research-quality datasets is available at https://bitly.com/bundles/hmason/1
Tip
Other interesting sources of data come from the data mining and knowledge discovery competitions such as ACM-KDD Cup or Kaggle platform, in most cases the datasets are still available, even after the competition is closed.
Check out the ACM-KDD Cup at the link http://www.sigkdd.org/kddcup/index.php.
And Kaggle available at http://www.kaggle.com/competitions.
Text files
The text files are commonly used for storage of data, because it is easy to transform into different formats, and it is often easier to recover and continue processing the remaining contents than with other formats. Large amounts of data come in text format from logs, sensors, e-mails, and transactions. There are several formats for text files such as CSV (comma delimited), TSV (tab delimited), Extensible Markup Language (XML) and (JSON) (see the Data formats section).
Excel files
MS-Excel is probably the most used and also the most underrated data analysis tool. In fact Excel has some good points such as filtering, aggregation functions, and using Visual Basis for Application you can make Structured Query Language (SQL)—such as queries with the sheets or with an external database.

Excel provides us with some visualization tools and we can extend the analysis capabilities of Excel (Version 2010) by installing the Analysis ToolPak that includes functions for Regression, Correlation, Covariance, Fourier Analysis, and so on. For more information about the Analysis ToolPak check the link http://bit.ly/ZQKwSa.
Some Excel disadvantages are that missing values are handled inconsistently and there is no record of how an analysis was accomplished. In the case of the Analysis ToolPak, it can only work with one sheet at a time. That's why Excel is a poor choice for statistical analysis beyond the basic examples.
We can easily transform Excel files (.xls) into another text file format such as CSV, TSV, or even XML. To export the Excel sheet just go to File menu, select the option Save & Send, and in Change File Type select your preferred format such as CSV (Comma delimited).
SQL databases
A database is an organized collection of data. SQL is a database language for managing and manipulating data in Relational Database Management Systems (RDBMS). The Database Management Systems (DBMS) are responsible for maintaining the integrity and security of stored data, and for recovering information if the system fails. SQL Language is split into two subsets of instructions, the Data Definition Language (DDL) and Data Manipulation Language (DML).
The data is organized in schemas (database) and divided into tables related by logical relationships, where we can retrieve the data by making queries to the main schema, as is shown in the following screenshot:

DDL allows us to create, delete, and alter database tables. We can also define keys to specify relationships between tables, and implement constraints between database tables.
CREATE TABLE
: This command creates a new tableALTER TABLE
: This command alters a tableDROP TABLE
: This command deletes a table
DML is a language which enables users to access and manipulate data.
SELECT
: This command retrieves data from the databaseINSERT INTO
: This command inserts new data into the databaseUPDATE
: This command modifies data in the databaseDELETE
: This command deletes data in the database
NoSQL databases
Not only SQL (NoSQL) is a term used in several technologies where the nature of the data does not require a relational model. NoSQL technologies allow working with a huge quantity of data, higher availability, scalability, and performance.
See Chapter 12, Data Processing and Aggregation with MongoDB and Chapter 13, Working with MapReduce, for extended examples of document store database MongoDB.
The most common types of NoSQL data stores are:
- Document store: Data is stored and organized as a collection of documents. The model schema is flexible and each collection can handle any number of fields. For example, MongoDB uses a document of type BSON (binary format of JSON) and CouchDB uses a JSON document.
- Key-value store: Data is stored as key-value pairs without a predefined schema. Values are retrieved from their keys. For example, Apache Cassandra, Dynamo, HBase, and Amazon SimpleDB.
- Graph-based store: Data is stored in graph structures with nodes, edges, and properties using the computer science graph theory for storing and retrieving data. These kinds of databases are excellent to represent social network relationships. For example, Neo4js, InfoGrid, and Horton.
For more information about NoSQL see the following link:
Multimedia
The increasing number of mobile devices makes it a priority of data analysis to acquire the ability to extract semantic information from multimedia datasources. Datasources include directly perceivable media such as audio, image, and video. Some of the applications for these kinds of datasources are as follows:
- Content-based image retrieval
- Content-based video retrieval
- Movie and video classification
- Face recognition
- Speech recognition
- Audio and music classification
In Chapter 5, Similarity-based Image Retrieval, we present a similarity-based image search engine using Caltech256 that is an image dataset with over 30,600 images.
Web scraping
When we want to obtain data, a good place to start is in the web. Web scraping refers to an application that processes the HTML of a web page to extract data for manipulation. Web scraping applications will simulate a person viewing a website with a browser. In the following example, we assume we want to get the current gold price from the website www.gold.org, as is shown in the following screenshot:

Then we need to inspect the Gold Spot Price element in the website, where we will find the following HTML tag:
<td class="value" id="spotpriceCellAsk">1,573.85</td>
We can observe an id
, spotpriceCellAsk
in the td
tag; this is the element we will get with the next Python code.
Tip
For this example, we will use the library BeautifulSoup
Version 4, in Linux we can install it from the system package manager, we need to open a Terminal and execute the next command:
$ apt-get install python-bs4
For windows we need to download the library from the following link:
http://crummy.com/software/BeautifulSoup/bs4/download/
To install it, just execute in the command line:
$ python setup.py install
- First we need to import the libraries
BeautifulSoup
andurllib.request
from bs4 import BeautifulSoup import urllib.request from time import sleep from datetime import datetime
- Then we use the function
getGoldPrice
to retrieve the current price from the website, in order to do this we need to provide the URL to make the request and read the entire page.req = urllib.request.urlopen(url) page = req.read()
- Next, we use
BeautifulSoup
to parse the page (creating a list of all the elements of the page) and ask for the elementtd
with theid
,spotpriceCellAsk
:scraping = BeautifulSoup(page) price= scraping.findAll("td",attrs={"id":"spotpriceCellAsk"})[0].text
- Now we return the variable
price
with the current gold price, this value changes every minute on the website, in this case, we want all the values in an hour, so we call the functiongetGoldPrice
in a for loop 60 times, making the script wait 59 seconds between each call.for x in range(0,60): ... sleep(59)
- Finally, we save the result in a file
goldPrice.out
and include the current date time in the format HH:MM:SS (A.M. or P.M.), for example, 11:35:42PM, separated by a comma.with open("goldPrice.out","w") as f: ... sNow = datetime.now().strftime("%I:%M:%S%p") f.write("{0}, {1} \n ".format(sNow, getGoldPrice()))
The function datetime.now().strftime
creates a string representing the time under the control of an explicit format string "%I:%M:%S%p"
, where %I
represents hour as decimal number from 0 to 12, %M
represents minute as a decimal number from 00 to 59, %S
represents second as a decimal number from 00 to 61, and %p
represent either A.M. or P.M.
A list of complete format directives can be found on the following link:
http://docs.python.org/3.2/library/datetime.html
The following is the full script:
from bs4 import BeautifulSoup import urllib.request from time import sleep from datetime import datetime def getGoldPrice(): url = "http://gold.org" req = urllib.request.urlopen(url) page = req.read() scraping = BeautifulSoup(page) price= scraping.findAll("td",attrs={"id":"spotpriceCellAsk"})[0] .text return price with open("goldPrice.out","w") as f: for x in range(0,60): sNow = datetime.now().strftime("%I:%M:%S%p") f.write("{0}, {1} \n ".format(sNow, getGoldPrice())) sleep(59)
Tip
You can download the full script (WebScraping.py
) from the author's GitHub repository, which is available at https://github.com/hmcuesta/PDA_Book/tree/master/Chapter2
The output file, goldPrice.out
, will look as follows:
11:35:02AM, 1481.25 11:36:03AM, 1481.26 11:37:02AM, 1481.28 11:38:04AM, 1481.25 11:39:03AM, 1481.22 …