OpenRefine: A Tool for Print Collection Management

When I enrolled in the Fall, 2014 Data Scientist Training for Librarians (DST4L) course, I wasn’t sure how relevant the course’s content would be to my current projects.  In prepping for the course’s first two sessions, which focused on OpenRefine (an open source tool for managing “messy” data), I quickly realized that here is a tool that I could put to use immediately.  The use would be print management for my library (Tisch Library at Tufts University).

Like a great many academic libraries, mine has embarked on a “print management” program to help us manage ever-tightening space for our print books and bound journals; despite the ongoing expansion of eBook and eJournal platforms in our collections, we continue to purchase a substantial number of print books every year and our shelving space will not expand in the face of demands for collaborative spaces, new technologies, and other uses.  So we must make painful decisions about trimming the collection, either through de-accessing and/or moving items offsite.

To help us make these painful collection decisions, we joined a growing number of libraries which have had their collections analyzed by Sustainable Collection Services   (SCS), which provides an analytic tool called GreenGlass.  This web application enables libraries in analyzing their collections not only within the context of themselves but also within that of consortial libraries and the broader library landscape as defined by OCLC WorldCat.  This tool enables us to understand our collection through the normalization of our catalog information but also through details on circulation, duplicates, and comparison to other libraries’ holdings as well as to holdings of the Hathi Trust Digital Library and to CHOICE Review selections.  The resulting analyzed data can be downloaded to Excel for additional analysis and the generation of lists for review, withdrawal, or relocation.

Although the data generated by Greenglass is wonderful, using it in Excel has limits.  First, the data set is extensive – for my own subjects of Engineering and Mathematics, I must review over 30,000 records and such a large amount of items in a spreadsheet becomes unwieldy.  Second, Excel is not a database and it is easy, particularly with extremely large numbers of records, to inadvertently overlook a record (row) or a data field (column) of information or to inadvertently to edit a cell, which can be problematic if you modify a formula or other calculated value that should apply to all records.  Third, the data generated by GreenGlass is extensive but doesn’t include all the information that could enable me to make the most informed decisions.

OpenRefine seems to address these issues by converting the spreadsheet approach into more of a database and by providing capabilities to extend the analytical possibilities.  Among the advantages is that you don’t have to worry that you’ll miss records – OpenRefine treats all rows as a finite universe.  This enables you to manipulate them with greater confidence and so facilitates functions such as parsing or merging columns as well as creating columns containing calculated values based on the content of other columns.  Built-in tools enable you to facet, filter, and cluster information in a number of ways that provide insights as well as to convert data into more useful formats; for example, the GreenGlass column for publication year imports as text; a quick conversion to numeric data enables me to see the publication data formatted as a timeline.    The ability to cluster similar values is especially useful for reviewing holdings by the Series column – if a holding is part of an important series that my library owns, its value for the collection may be greater than if it is a single volume. A simple calculation field enables me to convert a WorldCat permalink into a dynamic URL for all the records in the WorldCat number in my spreadsheet.  Extensions such as NER or RDF enable linking to external datasets so that I might be able to import additional information from external sources into the report.

OpenRefine_DataSeries_Example

Figure: Clustering Excel data about book series in OpenRefine.

GreenGlass and OpenRefine working together provide a powerful toolkit for analyzing the mathematics and engineering areas of my library’s collection.  This combination makes me optimistic that I can make more informed “print management” choices!

Leave a Reply