Web Scraping 101

This week we continued on our Data Scientist journey by exploring web scraping basics. Our intrepid leader was again Tom Morris, who demonstrated two ways, out of many, to scrape web pages. While there are plenty of tools online to accomplish this task, the focus of the class was on OpenRefine and Python. Before starting on OpenRefine, we explored elements of a web page by using “Inspect Element” if using Chrome or Firebug for Firefox users.This tool enabled us to easily determine the selector expressions. In OpenRefine, we used the “Fetch” command to import the HTML data to be manipulated. Elements were parsed based on their attributes. Altogether OpenRefine provides a relatively “easy” way to import and manipulate data.

From Open Refine, we then moved on to building a web scraper using Python. We discovered a variety of Python modules, imported them and explored their functions. Some of the modules we looked at included requests, requests_cache, lxml, BeautifulSoup and cssselect. In addition, Tom showed us how to import parts of modules instead of importing the whole module by using the “from “x” import “y” command. We also started using iPython Notebook, a helpful tool for editing and sharing our Python work. I look forward to putting what we learned today into practice, as web scraping will likely be best method to retrieve the data for our group’s project involving the Belfer Center.

Notes for this class can be found here.

Leave a Reply