Clustering and Regression using WEKA

ROLL No. : 10BM60097Name : M.P.Vijaya PrabhuClustering and Regression using WEKA
View more documents from Vijaya Prabhu

Google Refine Tutorial


Google Refine – power tool for working with messy data

Detailed Presentation can be viewed @Google Refine Tutorial

1 Introduction

Data cleansing is identifying the wrong or inaccurate records in the data set and making appropriate corrections to the records.It involves identifying incomplete, inaccurate, and incorrect parts of data and then either replacing them with correct data or deleting the incorrect data.Data cleansing results in data which is consistent with the other standard data and is useful for performing various analysis.The error in the data could be due to data entry error by the user, failure during transmission of data or improper data definitions.
Google Refine is a web application, but unlike 99% of web applications, it is intended to be run on one's own machine and used by oneself. The server-side maintains states of the data (undo/redo history, long-running processes, etc.) while the client-side maintains states of the user interface (facets and their selections, view pagination, etc.). The client-side makes GET and POST ajax calls to cause changes to the data and to fetch data and data-related states from the server-side

Google Refine is a powerful tool for effectively cleanse data online.The main features of Google Refine consists of
·         Pulling data from various sources
·         Cleaning the data using Transform/Clusters/Filters
·         Linking to the web URLs to get more useful data
·         Connection with various database to reconcile the collected data




Some of the advantages of Google Refine are
•         Ease of use
•         Works in any browser
•         Extensive functionality
•         Undo/Redo is simply awesome

2 Installation


Google Refine is a desktop application in that you download it, install it, and run it on your own computer. However, unlike most other desktop applications, it runs as a small web server on your own computer and you point your web browser at that web server in order to use Refine. So, think of Refine as a personal and private web application.

Release Version
Install it as detailed below for your operation system.
As long as Google Refine is running, you can point your browser at http://127.0.0.1:3333/ to use it, and you can even use it in several browser tabs and windows.
Development Version (Advanced users who can build from source)
If you want the latest and greatest version, see How to get the development version.

Windows
Install: Once you have downloaded the .zip file, uncompress it into a folder wherever you want (such as in C:\Google-Refine).

Run: In that folder, run the .exe file in that folder. You should see the Command window in which Google Refine runs. By default, the Command window has a black background and text in monospace font in it.

Shut down: When you need to shut down Google Refine, switch to that Command window, and press Ctrl-C. Wait until there's a message that says the shutdown is complete. That window might close automatically, or you can close it yourself. If you get asked, "Terminate all batch processes? Y/N", just press Y.

Mac OSX

Install: once you have downloaded the .dmg file, open it, and drag the Google Refine icon into the Applications folder icon (just like you would normally install Mac applications).

Run: to launch Google Refine, go to the Applications folder and double click the Google Refine app. You'll see the Google Refine app appear in your dock.

Shut down: You can switch to the Google Refine app (clicking on its icon in the dock) and invoke its Quit command.

Linux

Install / Run: Once you have downloaded the tar.gz file, open a shell and type

  tar xzf google-refine.tar.gz
  cd google-refine
  ./refine
this will start Google Refine and open your browser to its starting page.

Shut down: Press Ctrl-C in the shell.

Running & Configuration
By default (and for security reasons) Refine only listens to TCP requests coming from localhost (127.0.0.1). If you want to respond to TCP requests coming to any IP address the machine has, run refine like this from the command line

./refine -i 0.0.0.0



3 Up gradation

Upgradtion can be done from 1.1 to 2.0 by following throught the steps mentioned in the link below

4 Features



Some of the basic features of Google Refine include
  • Importing
The formats currently supported (in version 2.0) include:
a)     TSV, CSV, or values separated by a custom separator you specify
b)    Excel (.xls, xlsx)
c)     XML, RDF as XML
d)    JSON
e)     Google Spreadsheets
f)     RDF N3 triples

Once imported, the data is stored in Google Refine's own format, and your original data file is left undisturbed.

  • Filtering
  • Editing:
    • Editing cells, editing cells by Clustering
    • Editing columns, creating columns by Extending data
    • Editing rows
    • Understanding expressions
    • Understanding regular expressions
  • Exporting
  • History (undo/redo)

5 Getting started

                Let us know more Google Refine in detail while doing a simple project to cleanse the data. The data to be used for the Business Intelligence purpose is “Disasters worldwide from 1900-2008”. For a disaster to be entered into the database, it must meet at least one of the following criteria:

a) Ten (10) or more people reported killed.

b) Hundred (100) or more people reported affecte

c) Declaration of a state of emergency.

d) Call for international assistance.

6 My First Project

Step  1 : Opening a File

Step 2 : Browsing the data





Step 3: Select the File and ‘CREATE PROJECT’

Step 4 : All the projects are listed

 


Step 5 : Project data in Google Refine

7 Transformation made easy /Clustering


Step 1:  Transforming the Type of Calamity data. Click of Type --> Text --> Text Facet


Step 2 : Total number of rows imprted is shown. It also shows the total number of different choices available for that column.(18).

Step 3: On looking closer we can find the occurence of redundancies and duplicates in the data 

Step 4 : To eliminate that we need to create a new column by .Type --> Edit Column --> Add column based on this column

Step 5 : On giving the data as mentioned above we can reduce the number of choices..

So the number of choices came down from 18 to 15.This means the redundant data are merged to single type 

Sometimes it will be complex to do manually.In that case Google Refine provides a CLUSTERING option to cluster data based on different algorithms.They are

While selecting the “metaphone3” algorithm we get the data as follows

While “fingerprint” algorithm is the strictest and safest.On selecting the Merge tab the 2 column values will be merged into one with the value in the “new cell value”


8 Reconcilation

Reconcilation is taking a step further from just cleansing the data to get more information about the data present through freely available online data base.(Freebase)
Reconcile àStart Reconciling…

It will take a couple of minutes to get connected and to get relevant information.

After a quick glance the RECONCILE option gives the most probable type (here country) and we have to select to proceed further.


Now we can find a link with every row .Clicking that link will lead to the online database





9 Other Uses

Using Facebook data to find what they LIKED

And what they LIKED the least

TWITTER DATA to find Time Zones of each follower using Freebase