On October 1, we had our fifth and final session with our able instructor, Tom Morris, for this section of the DST4L course. Tom provided us with an overview of APIs that was succinct but detailed. We learned how web APIs are used to tease data from web sites. For those who are new to APIs (such as myself), there are many types of APIs, or “Application programming interfaces.” In this course, we are concerned with web APIs—detailed, proprietary specifications developed for a given web site. These specifications define the parameters with which outside computers may interact with that web site. If you are looking to scrape data, it is important to have familiarized yourself beforehand with the API of the site you’re interested in so that you can tailor your data search queries accordingly. There are thousands of APIs out there. For example, Tom mentioned that there are 93 library-related APIs alone mentioned on the ProgrammableWeb site. Some APIs are open source; some require registration “keys.” According to Tom, the use of keys is becoming more common as more people engage in web scraping. Web site owners use these keys to identify who is extracting data from their site—e.g., you don’t want to overtax their servers, for example, by constructing queries that are so large that they shut down the server, especially during periods of heavy usage. If you don’t play fair, you may find your key has been revoked in the middle of a search. Not good.
Several of Tom’s examples of API queries consisted of OpenLibrary searches. Its API documentation is available here. The documentation provides instructions on how to construct queries. Using the parameters provided, Tom showed how he was able to construct a search using Python and JSON of all of the titles in OpenLibrary on the subject of astronomy published before 1900. In order to be kind to the server, he limited it to 5000 titles. The search retrieved 1763 titles. Here is a screenshot of one of his examples taken from his IPython Notebook. The file of examples is available from the dst4l-projects Dropbox:
Building upon that initial search, he constructed more complex searches. The next search extracted the authors of those 1763 titles who had written more than ten works:
It was helpful that he had gone from the relatively simple to the more complex using the same basic search. He also indicated line by line what the code was doing, which was incredibly useful to programming newbies.
Because of the variation among students’ computer set ups, technical difficulties, speed of individuals typing the examples on the overhead screen into their computers, and the complexity of the topic of web APIs, we really didn’t have sufficient time at the end of the lecture to explore in depth the topics of PyMARC and PDF table extraction. Since a number of persons taking the course are catalogers, it would have been relevant to our work to have explored the types of cataloging applications for which PyMARC is suited. In addition, because there is so much data available only in PDF format, problems of data extraction from PDF files is an extremely important topic. With PDFs, one is dealing essentially with images rather than structured data. As a consolation prize, here are a couple of interesting web sites Tom provided addressing PDF data extraction:
- http://tabula.nerdpower.org (Tabula, an open source project)
- https://www.pdftoexcelonline.com/ (Nitro Cloud—converts PDFs to Excel spreadsheets)
- http://captricity.com/ (A crowdsourcing site to convert paper, faxes, and PDFs to “actionable data”)
Because of the brevity of the course and the breadth of information being conveyed, it wasn’t really practical for Tom to teach us everything he wanted to, but the skills and applications he did teach in detail will become valuable additions to our workshop of data management tools. Now, it is a matter of application and practice, practice, practice. Did I mention practice?
The notes to this class are available here.