Text wrangling with Python

Intro and Objectives

Now that we’ve got some basic Python hacking skills and have learned a little about ingesting data files of various types, we are going to learn some more advanced data cleaning techniques using things like regular expressions (regex) and even “fuzzy matching”. The days of purely analyzing numeric data are over and text mining skills are really nice to add to your toolkit. These topics will start us on that part of our journey.

Readings

Downloads and other resources

Regex

Web based tools and tutorials

  • RegExr - an HTML/JS based site for creating, testing, and learning about Regular Expressions.

  • regex101 - Another nice interactive web based tool for learning regex.

  • RegexOne - Learn regular expressions with simple, interactive examples.

  • Regular-Expressions.info - One of my go to sites for regex for a long time now. Very complete, many examples with substantive explanations.

  • Learning to Use Regular Expressions - Gnossis.cx - This is the site from which I first learned regular expressions. It has been around forever, is widely read, and quite good.

  • Regular Expressions - A Gentle User Guide and Tutorial - This is a good tutorial, cleverly written, at a greater level of detail than some of the others above. It’s got a browser based regex testing tool and the examples are based on matching parts of server logs which is a relevant application for our class.

  • Regex Cheat Sheet

  • Regular Expressions: Now You Have Two Problems - Classic blog post on regex and a related famous quote about regex. Good links to some resources on the bare minimum that every analyst/coder/hacker should know about the incredible world of regular expressions.

  • https://xkcd.com/208/

Books/Chapters

  • Mastering Regular Expressions (Friedl) - The definitive book on regex.

  • Regular Expressions Cookbook (Goyvaerts & Levithan)

  • Chapter 16 in R for Everyone covers string manipulation and shows how regex can be used in R.

  • Chapter 7 in Data Wrangling with Python shows how regex can be used within Python.

  • Whirlwind Tour of Python - p76-91

Activities

Explore

More on learning to program in Python

These are based on the Software Carpentry tutorials on programming with Python. They cover slightly more advanced topics. We won’t cover them in class but are useful for going beyond basic programming.