Visualizations in R

Our last two Thursdays were an introduction to R presented by Alex Storer, a Research Technology Consultant at Harvard’s Institute for Quantitative Social Science.  Alex tailored his presentation to our class by creating a custom tutorial based on data compiled by Chris Erdmann and Louise Rubin for the Wolbach Library.  The dataset, “Compiled List of NSF Grants Matched to ADS Records,” links NSF grants awarded from 1995 through November 2012 with corresponding astronomy articles in the NASA Astrophysics Data System (ADS).

Our first session began with a brief introduction to R and its impressively user-friendly interactive development environment, RStudio.  Then we jumped right in to examples using the Wolbach data.  Alex showed us how to create data frames and work with their contents, creating subsets and tables, filtering out duplicates, performing calculations, and generating a simple histogram.  The second class picked up where the first left off, with the aggregate function, which Alex likened to pivot tables in Excel.  Our goal for the second class was to recreate in R a graph created independently by Jim Davenport, a PhD student in Astronomy at the University of Washington, who had come across the dataset online.  Alex showed us some minor data clean up, the creation of a basic plot, some necessary aesthetic modifications, and the addition of a trend line.  Once the graph looked very close to Davenport’s original, we moved on to alternative analyses of the data showing an increase over time in the number of articles published shortly after a grant is awarded.  Our second class ended with a discussion of the cost of academic journal subscriptions to libraries and issues of copyright and access.

Between sessions, we were visited on Saturday by Erin Braswell, an Education Specialist at the Harvard-Smithsonian Center for Astrophysics, who taught herself R in order to analyze usage data generated by the traveling museum exhibit Black Holes: Space Warps and Time Twists.  At the exhibit, visitors choose two-word usernames while creating user cards, which they then scan at activity stations throughout the exhibit.  Back at home, visitors can log into the associated website with their user cards to access all the data and videos they collected while there.  Erin’s usernames-d.R script (setup-d.R should be run first) looks at how often the username options are selected and in what combinations, which usernames are more popular with teens versus children or adults, and how long users with different usernames spend on activities.  Erin used ggplot to create a variety of graphs, including several bubble charts designed with the help of a FlowingData tutorial.

More on all three sessions is contained in our notes (PDF, 166 KB) and in the presenters’ scripts, which are extensively commented.

Leave a Reply