Organizing your data

Intro and Objectives

We often work with tabular data in spreadsheets - it’s kind of what spreadsheets were designed for. There are some data organization principles that can help you organize tabular data in a way that:

  • makes it easier to analyze

  • helps avoid errors

  • makes it easier to share with collaborators

  • makes it easier to move between other analysis tools (e.g. Tableau, R, Python)

  • makes you a better Excel user

  • makes you a better analyst

Readings

The following blog post does a terrific job of articulating important spreadsheet organization principles. The post led to a journal article in The American Statistician. You can find a PDF of the journal article at the link below (and in the Downloads file).

In addition, the folks at Data Carpentry have a freely accessible tutorial on spreadsheet organization principles that they aimed at ecologists. Many good tips to be found here.

Downloads

Downloads-Module04-DataOrg-MiscPowerExcel.zip

Activities

In the Downloads file you’ll find a subfolder named DataOrgPrinciples. Within there you’ll find a file named DataOrg.xlsm. Get that file open and this series of screencasts will cover some important data organization principles using this file.

We’ll start with an overview of the data org blog post by Broman and an intro to the file I’ve created to illustrate some of the important data org principles.

Let’s learn about tidy data and also explore doing some computations with that data.

What’s the difference between wide and long data? How do we reshape data to get it into the form we need for analysis?

What to do about missing data? What is #N/A?

Dates - you really need to be careful with date formats, especially when sharing data. We’ve already learned how Excel stores dates as integers (or “serial values”). There’s more chapters to this story.

I highly encourage you to read through the entire blog post by Broman. Much wisdom in there.

Explore (OPTIONAL)

Here are some more links to material on good (and bad) spreadsheet practices.