
Getting started with OpenRefine
OpenRefine (formerly known as Google Refine) is a formatting tool very useful in data cleansing, data exploration, and data transformation. It is an open source web application which runs directly in your computer, skipping the problem of uploading your delicate information to an external server.
To start working with OpenRefine just run the application and open a browser in the URL available at http://127.0.0.1:3333/
.
Refer to Appendix, Setting Up the Infrastructure.
Firstly, we need to upload our data and click on Create Project. In the following screenshot, we can observe our dataset, in this case, we will use monthly sales of an alcoholic beverages company. The dataset format is an MS Excel (.xlsx
) worksheet with 160 rows.
We can download the original MS Excel file and the OpenRefine project from the author's GitHub repository available at the following URL:
https://github.com/hmcuesta/PDA_Book/tree/master/Chapter2

Text facet
Text facet is a very useful tool, similar to filter in a spreadsheet. Text facet groups unique text values into groups. This can help us to merge information and we can see values, which could be spelled in a lot of different ways.
Now we will create a text facet on the name column by clicking on that column's drop-down menu and select Facet | Text Facet. In the following screenshot we can see the column name grouped by its content. This is helpful to see the distribution of elements in the dataset. We will observe the number of choices (43 in this example) and we can sort the information by name or by count.

Clustering
We can cluster all the similar values by clicking on our text facet (refer to the previous screenshot), in this case we find: Guinness Lata DR 440ml 24x01 and Guinness Lata DR 440ml 24x01., obviously the dot in the second value is a typo. The option Cluster allows us to find this kind of dirty data easily. Now we just select the option Merge? and define New Cell Value, then we click on Merge Selected & Close as seen in the following screenshot:

We can play with the parameters of the Cluster option, such as changing the Method option from Key collision to nearest neighbor, selecting Rows in Cluster or the length variance of choices. Playing with the parameters we can find duplicate items in a data column and more complex misspells, as we can see in the following screenshot, where the values JW Black Label 750ml 12x01 and JW Bck Label 750ml 12x01 refer to the same product with a typo in the color.

Text filters
We may filter a column by using a specific text string or using a regular expression (Java's regular expressions). We will click on the option Find of the column we want to filter and then type our search string in the textbox in the left. For more information about Java's regular expressions visit the following URL:
Numeric facets
Numeric facet groups numbers into numeric range bins. You can customize numeric facets much the way you can customize text facets. For example, if the numeric values in a column are drawn from a power law distribution (refer to the first row in the following screenshot), then it's better to group them by their logs (refer to the second row in the following screenshot) using the following expression:
value.log()

Otherwise, if our values are periodic we could take the modulus by the period, to find a pattern, using the following expression:
mod(value, 6)
We can create a numeric facet from a text by taking the length of the string, using this expression.
value.length()
Transforming data
In our example, the column date uses a special date format 01.04.2013 and we want to replace . by /. Fixing this is pretty easy using a transform. We need to go to Column date | Edit Cells | Transform.
We will write a replace()
expression as follows:
replace(value,".","/")
Now just click on the button OK to apply the transformation.

Google Refine Expression Language (GREL) allows us to create complex validations. For example, in simple business logic when the column value reaches 10 units we make a discount of 5 percent, we do this with an if()
statement and some algebra:
if(value>10,value*.95,value)
Tip
Visit the link for a complete list of functions supported by the GREL available at https://code.google.com/p/google-refine/wiki/GRELFunctions.
Exporting data
We can export data from an existing OpenRefine project in several formats as follows:
- TSV
- CSV
- Excel
- HTML table
To export the file as a JSON, we need to select the option Export and Templating Export, where we can specify a JSON template as shown in the following screenshot:

Operation history
We can save all the transformations applied to our dataset just by clicking on the tab Undo/Redo and then select Extract this will show all the transformations applied to the current dataset (as shown in the following screenshot). Finally, we will copy the generated JSON and we will paste it in a text file.
To apply the transformations to another dataset we just need to open the dataset in OpenRefine, and then go to the tab Undo/Redo click on the button Apply and copy the JSON from the first project.
