OpenRefine: Reconcile, Extend, & Publish

Training on OpenRefine continues with instruction from expert Tom Morris. Over two weeks we have learned how to import different kinds of file formats, and explored the tools and techniques for reconciling names and other entries in our data. Reconciling is the process of matching elements in the data set to remote linked data sources.

We start the lesson by uploading a Drupal file and use the fixed width option to format the data as a table. Selecting Columnize by key/value creates columns for each tag.  We use OpenRefine’s API to find matching names for our authors through the Freebase reconciliation service at http://reconcile.freebaseapps.com/reconcile. To increase the likelihood of a match, we can make use of the death date in our data and search in the namespace “Deceased Person”. Before matching, our date of death needs to be converted to text and normalized. Transforming to text, and applying the expression value.toString().split(‘.’)[0] turns 1887.0 into 1887.

Reconciliation finds matching names and names with slight variations that display next to data in the author column. More information is available by hovering or clicking to see the full entry. With one click we can change all of the instances of the name. What is the reason for reconciliation? Once a name is matched, we can obtain its URI (or RDF identifier), which connects our data to the linked data universe and to enrich our data with additional content!

In the second half of class we test our knowledge of OpenRefine transformation techniques by working on Christine Eslao’s spreadsheet of MARC holdings data (field 852) where the first indicator is blank. Many fields clearly contain LC type call numbers and should be coded with first indicator 0. But how can we use OpenRefine to identify strings in LC call number format and distinguish them from other call number schemes?

The answer is… Regular expressions. As defined in Wikipedia, a regular expression is “a sequence of characters that forms a search pattern, mainly for use in pattern matching with strings, i.e., like find and replace- type operations”. As Tom shows us, regular expressions can get very complicated, but there are many sites that can help, such as Regex pal (http://regexpal.com), an online regular expression tester that you can use to highlight the results of a regex in a data set.

We find a regex for LC call numbers at http://faculty.washington.edu/murata/sortcall and apply it to our call number data using a text filter. The filter gives us only those text strings that match the pattern. Further refining the regex ( which is: \s*([A-XZ{1,3}]+)\s*(\d+\.*\d*)(.*)/i), gives us call numbers that begin with 1-3 capitol letters (except Y), followed by digits.

To learn more about this class, see the notes. In the next classes, Tom Morris and the class will dive into Python, and we will learn how to work with APIs, MARC and scrape websites.

Leave a Reply