UCB — Fall 2013 / Week 8

Facing The Abyss

How to Probe Unknown Data

So you have your shiny new R skills and you’ve successfully loaded a cool dataframe into R… Now what?

The best charts come from understanding your data, asking good questions from it, and displaying the answers to those questions as clearly as possible.

Here’s a basic checklist to get more familiar with any data set.

We’ll go over this checklist with a test data set from a table about the Chicago teachers' union strike, but you can follow along on your own data set too, if you like.

  1. Set your working directory.

    setwd("~/path-to-your-folder")
    
  2. Get your data loaded into R and make sure there are no factors.

    # csv
    data <- read.csv("path-to-file", stringsAsFactors=F)
    
    # tsv or txt
    data <- read.delim("path-to-file", stringsAsFactors=F)
    
    # excel file
    library(gdata)
    data <- read.xls("path-to-file", stringsAsFactors=F)
    
  3. Know how many rows and columns you have. Know what each column is and what each row represents.

    dim(data)
    names(data)
    
  4. Make sure all your columns are the correct data type (nums, strings, factors, dates) Changing data types in R

  5. If you have dirty data, clean it and put the results in new columns. (We did this with the Chicago guns exercise and have examples in the Data field guide )

  6. View some summary stats on your number columns (min, max, mean)

    summary(data)
    
    summary(data$column)
    
    table(data$column)
    

    More detail here

  7. Chart histograms of your numeric columns (distributions)

    hist(data$column_name)
    
  8. Chart histograms of your categorical columns (gender, color, etc..)

    plot(table(data$column), type="h")
    
  9. If you have any time-sensitive data, plot your quantitative data against it to see if there’s any relationship.

  10. Plot a time series, but aggregated up to a different time frame (days to months, months to years)

  11. Look for correlations and outliers

    #plots all the combinations of all your columns
    pairs(data)
    
    #If you want to plot just one of the combinations for more detail
    plot(data$column1, data$column2)
    
  12. Look for outliers in these plots.