Starting with Excel: Data Manipulations and Graphing

Lynn Cherny (@arnicas on Twitter) shared her Excel expertise with us at Round 2 of Data Scientist Training for Librarians (#DST4L) on August 28th.  It was just 4 days after our 2-day Software Carpentry Bootcamp where we were drilled on Unix, Python, and Git skills.

The Windows version of Excel is very different than in years past.  It’s new and improved while Mac users complain that Excel on a Mac is a dark place.  With sympathy for our Mac-using colleagues, Excel is a great place to start our data wrangling.  Before getting started with Excel, you may consider using Unix commands like “head” and “tail” to confirm your data is well-formed.

Lynn began by listing some benefits of using Excel:  It’s ubiquitous so it’s a file type that anyone can use.  It’s great for initial data exploration.  You can take a first look at the data and get a sense of patterns and what cleaning or validating your data may need.  You can create reports.  You can upload Excel data to Google docs for visualization sharing or use its data as input for web visualization.

Sorting and filtering features and removing duplicates are basic necessities.  Lynn reviewed those features then moved to more nerdy data manipulations.  A common data issue is dealing with problematic characters such as tabs, commas, and quotes. Lynn showed us a trick to import a file with problem characters into Excel as tab delimited data.  Next the file was saved as a comma-separated csv file.  The csv file now had the problem characters embedded in their respective fields as desired.  I used this tip the day after class to prepare data that had embedded spaces so I could parse it properly.  The class worked enthusiastically through many exercises using the Paris rainfall, crime statistics, and hunting accident data that was provided to reinforce these features.

Fun with formulae

Lynn gave a quick overview of the LEN and SUM functions before detailing the transpose function and outlining concatenate, replace, and conditional functionality.  After adding columns for seasonal and annual rainfall to our Paris_Rainfall.xlsx file, using trend lines becomes possible!  They really jazz up a report and show trends.

Sparklines from Paris Rainfall data

 

Stringing functions together provides powerful data transformation.  We learned how to use the RIGHT, LEN, and FIND commands individually before copying them together into a new column.  This allows you to be sure your functions perform as expected before using them in report or production environment.  Here’s a photo of a label from counterfeit jeans where the counterfeiters had a bug in their Excel formula.  The label shows the Excel formula in red font instead of the counterfeit brand name!

Excel formula appears instead of data

 

What do we want?  Pivot Tables! What do pivot tables want? 

Pivot tables want tall, raw data (i. e. unprocessed data)

Pivot tables are incredibly important. They allow you to bin or group data together.  They help you answer questions about your data.  Using our hunting accidents data, we created pivot tables showing the location of wounds, body parts that were injured, number of fatalities, age of victim, age of shooter, etc.  Information about any Pivot Table Field can be displayed by dragging and dropping it into the pivot table.  In this case, the data was disturbing, but really useful to explore the power of pivot tables.

Final note

Excel is a powerful tool.  There is no need to be shy or embarrassed about using Excel.

Homework, References, and Notes

We went through a ton of material in our 3 hour session.  Lynn included homework in the ExcelClassArchive that we downloaded for class.  It has more exercises to try and includes these references:

 

This class was held at 90 Mount Auburn St. where I saw some of the work that goes into setting up the class for the day.  Thanks to Chris Erdmann, Michelle Durocher, the DST4L volunteers from the previous rounds and of course Lynn Cherny!

Leave a Reply