Google Refine – power tool for working with messy data
Detailed Presentation can be viewed @
Google Refine Tutorial
1 Introduction
Data cleansing is identifying the
wrong or inaccurate records in the data set and making appropriate corrections
to the records.It involves identifying incomplete, inaccurate, and incorrect
parts of data and then either replacing them with correct data or deleting the
incorrect data.Data cleansing results in data which is consistent with the
other standard data and is useful for performing various analysis.The error in
the data could be due to data entry error by the user, failure during
transmission of data or improper data definitions.
Google Refine is a web application, but unlike 99% of web applications, it is intended to be run on one's own machine and used by oneself. The server-side maintains states of the data (undo/redo history, long-running processes, etc.) while the client-side maintains states of the user interface (facets and their selections, view pagination, etc.). The client-side makes GET and POST ajax calls to cause changes to the data and to fetch data and data-related states from the server-side
Google Refine is a powerful tool
for effectively cleanse data online.The main features of Google Refine consists
of
·
Pulling data from various sources
·
Cleaning the data using
Transform/Clusters/Filters
·
Linking to the web URLs to get more useful data
·
Connection with various database to reconcile
the collected data
Some of the advantages of Google
Refine are
• Ease of use
• Works in any browser
• Extensive functionality
• Undo/Redo is simply awesome
2 Installation
Google
Refine is a desktop application in that you download it, install it, and run it
on your own computer. However, unlike most other desktop applications, it runs
as a small web server on your own computer and you point your web browser at
that web server in order to use Refine. So, think of Refine as a personal and
private web application.
Release Version
Install it as detailed below
for your operation system.
As long as Google Refine is
running, you can point your browser at http://127.0.0.1:3333/ to use it, and
you can even use it in several browser tabs and windows.
Development Version (Advanced
users who can build from source)
If you want the latest and
greatest version, see How to get the development version.
Windows
Install: Once you have
downloaded the .zip file, uncompress it into a folder wherever you want (such
as in C:\Google-Refine).
Run: In that folder, run the
.exe file in that folder. You should see the Command window in which Google
Refine runs. By default, the Command window has a black background and text in
monospace font in it.
Shut down: When you need to
shut down Google Refine, switch to that Command window, and press Ctrl-C. Wait
until there's a message that says the shutdown is complete. That window might
close automatically, or you can close it yourself. If you get asked,
"Terminate all batch processes? Y/N", just press Y.
Mac OSX
Install: once you have
downloaded the .dmg file, open it, and drag the Google Refine icon into the
Applications folder icon (just like you would normally install Mac
applications).
Run: to launch Google Refine,
go to the Applications folder and double click the Google Refine app. You'll
see the Google Refine app appear in your dock.
Shut down: You can switch to
the Google Refine app (clicking on its icon in the dock) and invoke its Quit
command.
Linux
Install / Run: Once you have
downloaded the tar.gz file, open a shell and type
tar xzf google-refine.tar.gz
cd google-refine
./refine
this will start Google Refine
and open your browser to its starting page.
Shut down: Press Ctrl-C in
the shell.
Running & Configuration
By default (and for security
reasons) Refine only listens to TCP requests coming from localhost (127.0.0.1).
If you want to respond to TCP requests coming to any IP address the machine
has, run refine like this from the command line
./refine -i 0.0.0.0
3 Up gradation
Upgradtion can be done from 1.1 to 2.0 by following throught the steps mentioned in the link below
4 Features
Some of the basic features of Google Refine include
The formats currently
supported (in version 2.0) include:
a)
TSV, CSV, or values
separated by a custom separator you specify
b)
Excel (.xls, xlsx)
c)
XML, RDF as XML
d)
JSON
e)
Google Spreadsheets
f)
RDF N3 triples
Once imported, the data is stored in Google Refine's own format, and your original data file is left undisturbed.
- Filtering
- Editing:
- Editing cells, editing
cells by Clustering
- Editing columns, creating
columns by Extending data
- Editing rows
- Understanding expressions
- Understanding regular
expressions
- Exporting
- History (undo/redo)
5 Getting started
Let us
know more Google Refine in detail while doing a simple project to cleanse the
data. The data to be used for the Business Intelligence purpose is “Disasters
worldwide from 1900-2008”. For a disaster to be entered into the database, it
must meet at least one of the following criteria:
a) Ten (10) or more people reported killed.
b) Hundred (100) or more people reported affecte
c) Declaration of a state of emergency.
d) Call for international assistance.
6 My First Project
Step
1 : Opening a File
Step 2 : Browsing the data
Step 3: Select the File and ‘CREATE PROJECT’
Step 4 : All the projects are listed
Step 5 : Project data in Google Refine
7 Transformation made easy /Clustering
Step 1: Transforming
the Type of Calamity data. Click of Type --> Text --> Text Facet
Step 2 : Total number of rows imprted is shown. It also shows the total number of different choices available for that column.(18).
Step 3: On looking closer we can find the occurence of redundancies and duplicates in the data
Step 4 : To eliminate that we need to create a new column by .Type --> Edit Column --> Add column based on this column
Step 5 : On giving the data as mentioned above we can reduce the number of choices..
So the number of choices came down from 18 to 15.This means the redundant data are merged to single type
Sometimes it will be complex to do manually.In that case
Google Refine provides a CLUSTERING option to cluster data based on different
algorithms.They are
While selecting the “metaphone3” algorithm we get the data
as follows
While “fingerprint” algorithm
is the strictest and safest.On selecting the Merge tab the 2 column values will
be merged into one with the value in the “new cell value”
8 Reconcilation
Reconcilation is taking a step further from just cleansing
the data to get more information about the data present through freely available
online data base.(Freebase)
Reconcile àStart
Reconciling…
It will take a couple of minutes to get connected and to get
relevant information.
After a quick glance the RECONCILE option gives the most
probable type (here country) and we have to select to proceed further.
Now we can find a link with every row .Clicking that link
will lead to the online database
9 Other Uses
Using Facebook data to find what they LIKED
And what they LIKED the least
TWITTER DATA to find Time Zones of each follower using
Freebase