This past semester I was one of about two dozen librarians and library students participating in the Data Scientist Training for Librarians (DST4L) course, hosted out of the Harvard-Smithsonian Center for Astrophysics, taught by Chris Erdmann. I was interested in the course as a science librarian because big data is a growing theme in science research, what with funding mandates to share datasets, and the scale of raw data to be analyzed presenting challenges to research collaboration and information preservation. I thought I’d learn a bit of coding and get a better idea of the issues my researchers face in managing their data. Little did I know that learning ‘a bit of programming’ for use in data analysis is approximately as frustrating as going on a foreign trip with a phrasebook in hand and hoping to conduct interviews.
Let me digress for a moment and give a little background on me. I have an undergraduate degree in biology and worked for several years in research microbiology labs. While obtaining my Master’s in Library Science, I took courses in XML and have experience with other markup languages. I know what the command prompt looks like and what it’s good for. I’ve even been exposed to code before and on occasion used scripts written by others to my specifications, though I had never written any code prior to this class. I thought this experience was enough of a base to build on when learning Python (specifically Enthought), the chosen language for this course. Some of you will see the flaw in my logic at this point, but more on this later.
Data, particularly public data from the Internet, is currently stored in such diverse formats that a handwritten script is often the only viable method for collecting the information into a form in which it can be analysed. That handwritten code takes expertise, patience, and perseverance to get right, and I tip my hat to the folks at places like the Sunlight Foundation who scrape data from the web and present it in more comprehensible ways for the greater good. I think my strengths with data lie more in the storytelling aspect, using visualizations to draw attention to a pattern, correlation, or trend. This course introduced me to several programs for data visualization such as Tableau, OpenRefine, and R Studio, which in addition to being powerful tools for research are just plain fun to use.
Less fun, but more challenging, were the group projects that really got us using Python to scrape and analyze information from raw text on the web. Each group needed to extract natural language data from the SAO/NASA Astrophysics Data System full text database of astronomy papers. Fortunately, we were able to work directly through the developers’ API, something I had not done before this course. For my group, Python’s Natural Language Toolkit and the power of regular expressions (regex) were brought to bear on extracting geocoordinates mentioned within papers and visualizing these locations described in dry latitude and longitude digits more graphically on a map.
The eventual goal was to be able to add a link to each paper in the database containing a geocoordinate to an accurate map showing the locations mentioned. Aggregate data by observatory location might be of interest, or it might help research conducted in temporary field locations stand out. More poetically, papers – particularly historical papers – that describe voyages of exploration or feats of astronomical deductions might be brought to life as a written description of numerical coordinates turns into a line on a map and becomes a story about the Islamic scholar in 1025 who deduced the Earth’s acceleration from solar data (while nominally calculating the precise direction to face for prayer to Mecca from his then-location of Ghazni, Afghanistan).
Just one big hurdle we needed to deal with first: people have a greater tolerance for sloppy matches than computers do. As people, we can look at the following list of coordinates and identify them as latitudes or longitudes at a glance:
- 37.1971°N and 80.5738°W
- ranging from 0° at the north pole to 180° at the south pole
- lies between 0°45′ and 3°35′ South and 29°15′ and 30°51′ East
- 36°06′ to 37°30′N, 92°18′
But the point of this exercise is so a person doesn’t need to go through each paper by hand and re-type the coordinates into a mapping program. No, the point is to run the code on all of the articles in the database at once and be presented with a neat list of specific coordinates, nicely tabulated with an article identifier, ready to be uploaded to a handy mapping tool like Google Earth or Batch Geo.
Using regex at its most accurate requires a very specific description of what we’re looking for, usually at the character-by-character level. Thus, a description that would match 37.1971°N, would not match 37°30′N, due to the differences of decimal point and minutes. The match can be made less specific, but then runs the risk of including things that are not geocoordinates.
None of this is helped by the fact that the full text articles are html encoded, meaning that special characters display as special entity codes: for example, the degree sign looks like °: the prime notation for minutes displays as ′: and text formatting tags are everywhere. So the results from the full text API search actually look more like
- 37.1971°:N <em>latitude</em> and 80.5738°:W <em>longitude)</em>
- <em>latitude</em> (ranging from 0°: at the north pole to 180°: at the south pole)
- lies between 0°:45′: and 3°:35′: South <em>latitude</em> and 29°:15′: and 30°:51′: East <em>longitude</em>
- <em>(latitude</em> 36°:06′: to 37°:30′:N, <em>longitude</em> 92°:18′:
…Let’s just say that progress was slow.
Although I did learn how to program in Python (It totally worked! Produced the text files I wanted without an error, eventually!), and even a little bit in R as well, if this course has taught me one thing it is that I am not a programmer. I can give you a detailed step-by-step list of commands the program must run to analyse the data and return the results I’m looking for but when it comes to writing the program itself, well, more often than not I end up with my phrasebook in one hand, staring at the code on screen and wondering why it just did the equivalent of telling me my accent is incomprehensible.
Perhaps the best take-away from this course has been the discussions with fellow classmates about the future of big data, how data analytics are changing our daily lives already, and the role librarians play in managing research data. The talks have ranged from the many ways to tell a story – or at least get your point across – with visualizations, to how some companies are already using data analysis to determine things like whether or not to approve your loan. And the role of librarians can be whatever we make of it.
Several recent blog posts by librarians have commented on the difference between a data scientist and a data savvy professional: I am firmly in the second camp. (Both Sally Gore and DST4L’s own Jennifer Prentice have posted blog entries offering insight into this distinction.) I feel no need to run my own statistical analyses or write my own code if there is any alternative. But I do feel it is important for librarians, as traditional managers of information, to have a vocabulary of data science, to be data savvy professionals. When assisting researchers with creating metadata for their research data sets or helping them find data produced by others, it is important to know what questions to ask and what challenges they may be facing. Although the NIH provided several grants recently to fund librarians in the role of Informationist, which is a unique support role in research labs for assisting with data management, more and more academic libraries are getting involved in these sorts of roles and providing that sort of service for researchers completely on their own.
Much like librarians have had to become knowledgeable about Open Access for journal publishing, and its impacts on the traditional publishing process, copyright, authors’ rights, and the changes this movement has wrought in the role of grey literature, big data available online challenges us to take up new roles as information organizers, managers, and facilitators to researchers. We won’t necessarily need to become expert programmers or statisticians but we will need to learn the new vocabulary that comes with discussing research data with researchers, and define a new facet of our role in the research institution. From digital humanities to terabytes of astronomy observations to the reams of medical information gathered by instruments that didn’t exist five years ago, big data plays a role in every aspect of research. Librarians must be prepared to help organize it, as we have always done. For what use is a library if you can’t find the book you’re looking for? What use is all the data in the world available on the Internet if you can’t find the data you need?