User Tools

Site Tools


google_refine

Cleaning Data with Open Refine

Download the latest stable version from http://openrefine.org/download.html Follow instructions to install. (May install as Google Refine.)

Double click on the icon to open it; it will open a browser window. Although it opens in a browser window, you do not need to be online to use the app.

Import the data file you want to clean by clicking the “browse” button in the This Computer tab. Select the file, and then back in the browser window click Next. Note: it is much more efficient to clean multiple files at the same time (e.g., multiple chapters of a biblical book). You can select more than one file in the browsing window if they are in the same folder.

Check the settings when you see a preview in the new window. Data should be parsed as Excel. When it looks correct, click Create Project button in the upper right. If you are opening more than one file, you’ll get a window about your file formats and multiple files. Then click on the Parsing button in the upper right, which will take you to the data preview window. Then click Create Project.

You should now see your file. with all the columns properly labeled. If you’re working with more than one file, Refine will add a column for the name of the original file.

To clean up the normalized text (to be sure the normalizer normed everything consistently), click on the menu arrow next to “norm.” Select facet > text facet.

On the left column, a list of all the contents of the norm cells will appear in alphabetical order. The number after each word is the number of times that piece of data appears in the norm column.

If you see data that does not look correct, click on it. Refine will bring up the row(s) in which that data appears. Below, I’ve selected an item that was not tokenized properly.

You cannot create new rows in Google Refine. When you find data you will need to fix, the easiest thing to do is to flag it.

Do this for any data you will need to fix later.

When you are done checking the norm layer and flagging rows that need attention, close the norm filter in the left menu. (Click the little x next to norm.)

In the All column, click on the menu arrow and select a facet filter by flag. On the left you will see a Flagged Rows menu with choice “false” and “true.” Click “true” to see all the rows you have flagged.

Use this list to correct the data in the ORIGINAL spreadsheet (not the one in Refine). You may need to add rows.

At this point, you may need to also add the morph layer to deal with compounds. You can easily see if they have been tokenized properly and where they are in Refine as you scroll through the list. Keep an eye out for ⲙⲛⲧ, ⲣⲉϥ, ⲁⲧ, ⲣⲙ/ⲣⲙⲛ, ⲣ, and compounds. See section 4.4 of http://coptic.pacific.edu/download/tools/SCRIPTORIUMDiplTranscriptionGuidelines.pdf.

Note: the entire word will need to be tagged for part of speech. (ⲙⲛⲧⲁⲧⲥⲱⲧⲙ is tagged as N). However, the morphs will be tagged for language of origin (if a compound word has one Greek morph, for example).

Here is a video example: http://youtu.be/IWu7vggwnOY

google_refine.txt · Last modified: 2015/09/09 08:51 by ctschroeder