Bootcamp…for Librarians! — Aug. 23-24, 2013

On August 23rd and 24th, we met at the Harvard-Smithsonian CfA for Session I of Data Science Training for Librarians (DST4L). The beautiful weather notwithstanding, we assembled on Friday and Saturday morning in the Phillips Auditorium, ready to work up a sweat at this 2-day Software Carpentry Bootcamp.

Our instructors from Software Carpentry–Jessica Mckellar, R. David Murray, Michael Selik and Erik Bray–led us through a variety of different tutorials intended to introduce us to some of the essential skills needed by Data Scientists. Navigating the UNIX shell came first. It was a bit of a change for those of us grown soft on graphical user interfaces, but we saw how, with a bit of practice, the shell can actually facilitate more efficient programming. Next we dived into the Python programming language. For my part, I found it fun and easy to use in comparison to other languages I’ve used, such as Java, JavaScript and C++. Python is a great choice for Data Science projects involving databases, data visualization, statistics, etc. but has broader uses too, being a full-featured programming language. No wonder it has been growing in popularity in scientific and academic settings.

We covered key concepts like variables, String- and List manipulation, conditional statements, and writing functions. With the able guidance of our instructors we quickly progressed to the point that we were able to create Python scripts to perform useful data-mining functions. We searched a dictionary file to find words with certain patterns of interest (okay, we learned how to cheat at Scrabble–but what else are you supposed to do when you’re stuck with an ‘S’, a ‘Z’ and three ‘Y”s?). We were able to parse a large file, the text of Hamlet, using regular expressions and other Python tricks we’d learned to format the text, create word-frequency lists, and extract other kinds of useful information. We also saw some examples of more complex, “real world” research projects that applied some of these principles.

Version control, a vital part of any successful collaborative project, was introduced as well. There are different version control systems out there, but for this class we used the popular Git (http://github.com). Systems like Git facilitate project management features such as the tracking of revision history, the submission of proposed changes, and the undoing of changes when necessary. Git also allows users to ‘fork’ a copy of the project to their own repository so they can experiment with changes without effecting the main project file.

This is just a quick summary of what was an information-packed inaugural session. Thank you to the folks from Software Carpentry! Thanks as well to Chris Erdmann for all he’s done to make this learning opportunity available, and for taking the time to show us the beautiful old Great Refractor telescope and amazing SDO solar images at the CfA.

Notes from the bootcamp are available:
2013-08-23-harvard
2013-08-24-harvard

Materials are also available at:
Why Python, Python 1 Slides, Online Python Tutorial, Code Academy Python Questions,
Bonus Python Material, Regular Expressions

Also see blog posts from the instructors:
Matthew Ruttley
Philip Guo

No Responses

Leave a Reply