Exploratory Data Analysis and Statistics using Pandas and Matplotlib

Today we looked at exploratory data analysis and pandas using matplotlib. After hanging out with Tom Morris for the past five weeks, we switched it up and had a lecture from Rahul Dave. Rahul is a computational scientist at Harvard’s Astrophysics Data System. It had been a while since we’d played with Python, and there were a few challenges getting up and running with it again. Some of us will definitely be attending attended the Boston Python User Group’s Monthly Meet-Up on Monday, October 14, as well as getting together to review concepts or attending the open office hours Chris is kindly setting up for us with more experienced Python users. We also started using iPython Notebook, which is helpful because it allows you to save your changes, via checkpoints, and keep a history of what you did, so you can look back and refresh your memory. You can also type notes in between the cells you’re running. I was able to type notes of Rahul’s explanations in between the exercises in the notebook, so I have a fully integrated document for once!

Rahul reminded us that “Python is a duck, and not a pelican or something of that sort,” meaning that when you type a command, the object’s methods and properties determine the output, rather than being inherited from a particular class. Essentially, it doesn’t matter what type of object you’re using—if it can be done with that object, Python will do it. You can create your own classes with a structure that works for your data, and run regular Python syntax on them, which is helpful with large or unique datasets. You can also create your own dictionaries, basically making a cache of information to refer back to later in on in your notebook, rather than having to create it from scratch every time.

One thing that’s been great about the instructors is that they have all exposed us to diverse datasets for our in-class projects. Lynn and Tom brought in datasets on Chicago crimes, hunting accidents in Wisconsin, authors and filmmakers. With Rahul, we used a dataset on Olive Oil, and looked at sorting the various types of olives and oils by region and type using pandas. Pandas is a Python software library that’s all about spreadsheet manipulation. It took our .csv excel file and displayed it in a dataframe, which we cleaned up, for instance, renaming columns. We then practiced cross-tabulation and creating new dataframes with subsets of the data, and pulled the means and medians.  Pandas has excellent documentation (and explains these concepts far better than I could) here. Probably the best part of class was that Rahul would present a few concepts, then we’d get a chunk of time to try them out on an exercise, and then we’d circle back and review what had worked and what didn’t. It was helpful to both get a chance to practice and work with my neighbors to solve problems.

We also looked at NumPy (for the un-initiated, it’s pronounced “Num-pie”, like apple pie, and not like “Numpy,” as you might expect) and matplotlib. NumPy is a Python package for scientific computing which allows you to work with large arrays and includes additional mathematical functions. One cool thing we learned how to do was transpose the data, rearrange it, and look at the shape of the array. Matplotlib was our first intro to visualization and turned our Olive Oils dataset into scatter plots and bar charts (you can see the full, completed notebook with the charts here). We were then asked to work through the section on data munging on our own for homework (which was tricky, but again, helpful).

Notes for this class can be found here.

Leave a Reply