Naive Bayes and Friends

In our 9th week of #DST4L, Rahul Dave built on the previous week’s crash course in statistics with a deep dive into machine learning, classifiers, and information retrieval.

First, this saw us working through further applications of scikit-learn, using a common pattern for classification:

  • Create a classifier
  • Fit the classifier with training values
  • Run classifier.predict on a set of sample values

We largely focused on logistic regression, “a probabilistic model that links observed binary data to a set of features”.  In addition to using scikit-learn to do these analyses, we used matplotlib (who’s creator sadly passed away last year) to draw classification boundary plots to visually represent how the classifier was working.  We spent some time learning about automation of cross-validation using the Grid Search method, leading to a further understanding of how to tune a model to optimally predict with.  This led into a conversation about bias and variance, and respectively overfitting and underfitting a model.



In addition to the discussion about regression and fitting, we branched into document analysis and reviewed the Vector Space Model (further described in this free book on Information Retrieval), which converts textual statistics into a model for geometric analysis.  From this foundation, we learned about Naive Bayes classifiers, which can be used in sentiment detection, classifying documents into categories, etc.  There is plenty to talk about in this space, but we started with vectorizers like the Count Vectorizer, again provided by the powerful scikit-learn.



We were sent off with homework to build a Naive Bayes predictor for whether a review on Yelp was “fresh” or not (borrowed from the Rotten Tomatoes terminology) by making use of the Yelp Phoenix Dataset.

Next week promises even more exploration in these areas!

Class notes in PDF

Leave a Reply