Data analysis and plotting in Python

Intro and Objectives

Pandas, developed by Wes McKinney, is the “go to” library for doing data manipulation and analysis in Python. It’s not really a statistics library (ala R); for that, StatsModels is the Python library of choice for now. For more advanced stuff like machine learning and data mining algorithms, scikit-learn is the go to Python module.

The de-facto standard plotting library for Python is called matplotlib and it’s one of the key reasons that Python has become such a major force in the analytics world. Below, you’ll also find information on Seaborn, a newish plotting package that uses matplotlib under the hood but provides an easier to use high level interface for common plotting tasks. Another option for visualization is Bokeh it’s designed for creating interactive graphs using web browsers for presentation. Plotly also has similar tools for interactive, web based, Python plotting.

Readings

Python Data Science Handbook (PDSH) - Ch 3 is on pandas, Ch 4 is matplotlib

Downloads and other resources

Activities

Start with this intro to the session: SCREENCAST: Overview of session

This is the primary notebook that we’ll use to learn the basics of pandas and matplotlib. We’ll also be looking at a few of JVP’s notebooks on matplotlib.

  • ORSchedLeadTime_Python.ipynb in ORSchedLeadTime_Python/

Clear the output before starting. This notebook covers the basics of pandas and matplotlib.

Back in April 2020, I developed a Jupyter notebook to automate the process of downloading and processing daily Covid-19 case data. The processing steps included adding new fields, reshaping to make plotting easier and just organizing the data in a way that facilitated analysis. Finally, the notebook produced faceted plots of cases by county in the state of Michigan and by state for the entire US. The plotting is done using matplotlib and Seaborn. I’ve adapted that notebook for use in this class by cleaning things up and adding a large amount of explanatory text. This is a good example of a very real use of Python for data analysis and includes more advanced things than done in the first introductory notebook.

Note

Before launching Jupyter lab do a conda activate datasci so that you can pip install the us package. It’s really useful package for working with things like state abbreviations and FIPS codes.

You can pip install it:

$ conda activate datasci
$ pip install us
$ conda deactivate
$ conda activate jupyter
$ jupyter lab

Now you can go through the c19_data_wrangle_viz.ipynb notebook.

Optional activities

Here are some additional notebooks that you can check out if you are interested in the topic.

  • Visualization_Techniques_Seaborn.ipynb in Final_Project_SeabornPlotting/

    • Seaborn is a newish visualization library built on top of matplotlib

    • this is a student final project

  • pandas_ch2_movielens.ipynb in movielens/

    • based on Ch2 of Wes McKinney’s Pandas book

    • shows table merging and pointers to other pandas tutorials from SQL point of view

  • datetime_exploring.ipynb in datetime/

    • Python has terrific libraries for dealing with time series (pandas). However, there be dragons in the confluence of pandas, numpy, and base Python date and time handling. Given the ubiquitous nature of datetime data in business, slaying these dragons is a calling we cannot avoid. I cover this topic is quite a bit of detail in my MIS 4900/6900 - Advanced Analytics with Python course. You can find the full set of datetime related subtopics and screencasts at http://www.sba.oakland.edu/faculty/isken/courses/mis6900_s21/datetime_occupancy.html

  • templogger_batch.py in temp_logging_pcda/

    • Short focused example of using pandas and matplotlib for automated data processing

    • includes file globbing

Explore (Optional)

Pandas

Visualization

  • Python Charts

  • Python data viz cookbook - nice little interactive web site for generating common plots in pandas, matplotlib, Seaborn, and plotly.

  • Effectively using matplotlib

  • Python plotting with matplotlib

  • Seaborn: statistical visualization

    Seaborn is a library for making attractive and informative statistical graphics in Python. It is built on top of matplotlib and tightly integrated with the PyData stack, including support for numpy and pandas data structures and statistical routines from scipy and statsmodels. Some of the features that seaborn offers are

    • Several built-in themes that improve on the default matplotlib aesthetics

    • Tools for choosing color palettes to make beautiful plots that reveal patterns in your data

    • Functions for visualizing univariate and bivariate distributions or for comparing them between subsets of data

    • Tools that fit and visualize linear regression models for different kinds of independent and dependent variables

    • Functions that visualize matrices of data and use clustering algorithms to discover structure in those matrices

    • A function to plot statistical timeseries data with flexible estimation and representation of uncertainty around the estimate

    • High-level abstractions for structuring grids of plots that let you easily build complex visualizations

  • Modern Pandas: Visualization This is actually Part 6 of a series of blog posts on modern use of pandas. It gives a good overview of the landscape of Python plotting with matplotlib, pandas, and Seaborn and focuses on how Seaborn is a great direction for those looking for a plotting package that supports exploratory data analysis.

  • The magic of matplotlib stylesheets

Applications