Modeling 3: Intro to machine learning with scikit-learn

Intro and Objectives

The sckit-learn module (sklearn, for short) is a full featured Python module for all kinds of data analysis and predictive modeling algorithms. We’ll do a brief overview of this widely used module and get a bit more exposure to statistical learning algorithms. We’ll also explore an unsupervised learning technique - K-means cluster analysis (via R and then via Python using scikit-learn).


We are going to work through a series of tutorials exploring the topic of machine learning with scikit-learn (and a little R).

Intro to machine learning with scikit-learn

Start by opening the notebook ml_sklearn_intro_5470.ipynb which you can find in the ml_sklearn_intro subfolder of our downloads file. The main topics we’ll be covering, include:

  • an introduction to the scikit-learn package

  • review of basic machine learning concepts

  • scikit-learn API details

  • overfitting and underfitting - the “variance-bias tradeoff”

  • more rows, more columns, more model complexity?

You’ll see that we’ll be using several notebooks from the PDSH text. All of the necessary files are included in the ml_sklearn_intro subfolder.

I broke the screencasts up into several chunks:

Unsupervised learning with R and Python

We take a brief look at both cluster analysis as well as principal components analysis (PCA). Everything can be found in the unsupervised_learning subfolder.

Start by reading through the following notebook.

  • unsupervised_intro.ipynb

Then, let’s bounce back to R and do a little clustering of wines using R.

  • WineCluster/ - cluster analysis in R

Finally, see how cluster analysis can be used to recolor image files and learn a little bit about image processing.

  • clusterviz/clustercolors.ipynb - Python for clustering images (clustercolors.ipynb)

  • WARNING: Don’t try more than about 5 colors for this example as otherwise it can take a long time to run the cluster analysis. :)

In the above example, we’ll take this picture of a Blackburnian Warbler :


and use cluster analysis to turn it into something like this:


Using R in Juypyter Notebooks via rmagic (OPTIONAL)

If you want to see how you can use R from within a Juyter notebook, check out the exploring_rmagic.ipynb notebook in the rmagic subfolder. You’ll see that you’ll have to do a pip install of the rpy2 module.

There are two additional notebooks in there that demo uses of rmagic to run R commands from Jupyter notebooks.

