Pandas: Munging, Stats and Visualization

In this past week’s class on Tuesday, October 15th, we went back to our dataset of olive oils and their constituent fatty acids to learn some more about Pandas, NumPy, and MatPlotLib. As an aside, this course is doing strange things to my vocabulary. It takes me a bit to remember that pandas and pythons are actual animals, especially when Googling.

Pandas can take in data from csv files and then store that data in a way that can be accessed in a way that resembles a table in a relational database, or, more familiarly, a spreadsheet. The reason to use a scripting language tool like Pandas rather than a tool like excel is that when you make a graph or do an analysis in Excel, doing it the next time takes the same amount of effort. When you do it in Pandas, it takes a lot less effort to do it the second time, or the third time, or the hundredth time, and you can do similar things to different data sets with some minor changes.

It’s also good for analysis, since it’s ahead of the graphing curve. We were able to create semi-transparent graphs like these:

acids_by_region

With code that looks like this: (Functional code is bolded() and comments are #italicized)

fig, axes=plt.subplots(figsize=(10,20), nrows=len(acidlist), ncols=1) # sets up the framework for the graphs. Acidlist is defined elsewhere, and is a list of the acids we’re interested in.

i=0 # Sets a counter to 0

for ax in axes.flatten(): # Starts a loop to go through our plot and render each row

acid=acidlist[i]

    seriesacid=df[acid] # creates seriesacid and sets it to df[acid], a list of the percent composition of the acid in the current iteration that’s in each olive oil.

minmax=[seriesacid.min(), seriesacid.max()] # the minimum and maximum values plotted will be the minimum and maximum percentages that we find in the data

for k,g in df.groupby(‘region’): # starts a loop in the loop to plot the values by region

        style = {‘histtype’:’stepfilled’, ‘alpha’:0.5, ‘label’:rmap[k], ‘ax’:ax}

        g[acid].hist(**style)

        ax.set_xlim(minmax)

        ax.set_title(acid)

        ax.grid(False)

#construct legend

ax.legend()

  i=i+1 # increments the counter, to move the loop on to the next acid.

fig.tight_layout()

The comments are my own, so they reflect my current, imperfect understanding of what’s going on, but it’s a start. I need more practice with more data, but this feels like a good start, and I’ve got a bunch of re-usable code to work off of from class. Also, if you want to avoid palmitic acid, which has “convincing” evidence linking it to heart disease, stay away from olive oil from Southern Italy, apparently.

Don’t forget to check out the notes, here!

Leave a Reply