Facing The Abyss
How to Probe Unknown Data
So you have your shiny new R skills and you’ve successfully loaded a cool dataframe into R… Now what?
The best charts come from understanding your data, asking good questions from it, and displaying the answers to those questions as clearly as possible.
- What month was X highest or lowest?
- How has field Z changed since policy X was implemented?
- What’s the relationship between X and Y?
- What does the distribution of one column of data tell us?
Here’s a basic checklist to get more familiar with any data set.
We’ll go over this checklist with a test data set from a table about the Chicago teachers' union strike, but you can follow along on your own data set too, if you like.
Set your working directory.
setwd("~/path-to-your-folder")
Get your data loaded into R and make sure there are no factors.
# csv data <- read.csv("path-to-file", stringsAsFactors=F) # tsv or txt data <- read.delim("path-to-file", stringsAsFactors=F) # excel file library(gdata) data <- read.xls("path-to-file", stringsAsFactors=F)
Know how many rows and columns you have. Know what each column is and what each row represents.
dim(data) names(data)
Make sure all your columns are the correct data type (nums, strings, factors, dates) Changing data types in R
If you have dirty data, clean it and put the results in new columns. (We did this with the Chicago guns exercise and have examples in the Data field guide )
View some summary stats on your number columns (min, max, mean)
summary(data) summary(data$column) table(data$column)
More detail here
Chart histograms of your numeric columns (distributions)
hist(data$column_name)
Chart histograms of your categorical columns (gender, color, etc..)
plot(table(data$column), type="h")
If you have any time-sensitive data, plot your quantitative data against it to see if there’s any relationship.
Plot a time series, but aggregated up to a different time frame (days to months, months to years)
Look for correlations and outliers
#plots all the combinations of all your columns pairs(data) #If you want to plot just one of the combinations for more detail plot(data$column1, data$column2)
Look for outliers in these plots.