Reflections on Data Scientist Training from an LIS Student

Here I’m going to talk in fairly broad strokes about how I’ve learned to think about data differently from learning a bit of programming through this course. If you’re more interested in learning about programming and how to do it, I think that this comic from Abstruse Goose describes the process better than I could.

When I started this course, I was accustomed to accessing information through various interfaces that processed my requests through complex methods that I didn’t understand. I was a Googler, a searcher, and a user of graphical user interfaces (GUIs). I had been acquainted with command-line interfaces, communicating with machines in a language closer to their own, but I hadn’t delved into that world. Now, I’m sitting on top of a comma-separated value spreadsheet containing bibliographic information for around 12,000 astronomy and astrophysics records, a functional (though inelegant) script that generated them, a script to remove abbreviations from that sheet and others structured like it, and a few lists of invalid country codes and language codes that currently exist in the Hathi Trust’s collections (about 220,000 and 160,000, respectively, out of about 10 million). Here, I’m going to outline what’s changed in my thinking from where I’ve started to where I’ve finished.

Accessing information as a human is relatively straightforward, usually. You have a question, and you’re looking for an answer. Sometimes it’s a simple question, “Where can I find this particular text?” and sometimes it’s more complex, “What is the current research related to eye-tracking?” but there’s generally just one question, and there’s always a human involved, deciphering bits of ambiguity so fluidly that it doesn’t always notice. There can be some difficulty interacting with the systems that facilitate this line of inquiry, but there’s still that human element, addressing the problem with mysterious and complex neural algorithms.

When one is trying to organize information to facilitate this process, the process is somewhat more difficult. There is no longer one question, with a relevant set of information needed to answer it, there are a whole variety of questions that are capable of being asked, some of which have never been asked before. The challenge of organizing information and creating information about it has long been accepted by librarians, and standards and best practices have arisen to guide us. But in this challenge, there is still a human on the other side of the equation, capable of making assumptions and resolving ambiguity.

The challenge of this semester was removing the human element from the interpretation of information. This meant that 1) I had to get the information into a machine-readable format and 2) I had to tell the program what to do in the case of any given response, which can be more difficult than it sounds.

The first issue, getting information into a machine-readable format, was a challenge that took most of the course. I was in a project group focused on finding articles that were in both the Astrophysics Data System (ADS) and the Hathi Trust Digital Library. The approach that we took consisted of finding Astronomy and Astrophysics-related articles in the Hathi Trust, extracting bibliographic information for all of them, and then sending that information to the ADS API in such a way as to retrieve only articles from the Hathi Trust. The way that we wound up going about this involved a complex process of processing the search results webpages and extracting identifiers, which were then compared to a file containing metadata for the Hathi Trust’s holdings indexed by those identifiers. We did this because the APIs that the Hathi Trust provides are set up to retrieve information for items with known identifiers in small sets, which didn’t suit our needs. Essentially, we had to interpret the search results pages in a way that a computer could understand through the medium of a python script. This was not the ideal, but it was instructive in the difference between human-friendly interfaces and machine-friendly ones.

We were ultimately not able to produce a method of finding matches between the two systems, but the dataset that we retrieved from the Hathi Trust was instructive on its own. It was based on MARC records from multiple contributing libraries, each with somewhat different methods of storing their information. This came out most clearly when attempting to replace abbreviations in the records with their unabbreviated meanings. In doing so, I found that the script I wrote would stop when it encountered an abbreviation that I didn’t account for in the list that I gave it, taken from MARC documentation online. When I looked into it further, I found that there were a few invalid entries in the Astronomy and Astrophysics metadata set that I had created, but when I tried to do the same to the larger dataset that it was derived from, the number of invalid codes was less manageable, so I tried to take a look at what they were. I wrote a script to export all of the invalid codes with their corresponding IDs to a .csv file and used Tableau, a piece of data visualization software, to look at what kinds of values were in this list of invalid data. Most were simply different ways of indicating that the language or country was not recorded, but a fair number were known codes for countries that were simply not correct. This happened for around two percent of the items in Hathi Trust, but because of my script’s inability to deal with ambiguity, they stuck out like a sore thumb.

This is the great challenge and the great opportunity in dealing with data programmatically: that inconsistencies in the data are made glaringly obvious. The challenge to metadata creators is to do better than 98% validity, to consistently apply coding schemes 100% of the time. This is actually easier when the creators themselves are accessing the data programmatically, as any inconsistencies can be easily spotted and corrected, rather than poring over mostly correct records to turn up the few that aren’t. The opportunity is that enabling programmatic access to data, both through APIs and other means of access and through ensuring the consistency of the data, we can have the data tell us some interesting stories.

If you’re not yet motivated to try to learn more about how to deal with data programmatically on a large scale, I recommend Hans Rosling and Tim Berners-Lee’s TED talks on data, if you haven’t seen them already. There’s an opening world of data out there, and librarians are situated to be on its forefront, so long as we’re willing to work with machines that are incapable of understanding anything resembling nuance unless it’s been systematically deconstructed for them.

Leave a Reply