UCB — Fall 2013 / Week 8

# Facing The Abyss ## How to Probe Unknown Data

So you have your shiny new R skills and you’ve successfully loaded a cool dataframe into R… Now what?

The best charts come from understanding your data, asking good questions from it, and displaying the answers to those questions as clearly as possible.

• What month was X highest or lowest?
• How has field Z changed since policy X was implemented?
• What’s the relationship between X and Y?
• What does the distribution of one column of data tell us?

Here’s a basic checklist to get more familiar with any data set.

We’ll go over this checklist with a test data set from a table about the Chicago teachers' union strike, but you can follow along on your own data set too, if you like.

``````setwd("~/path-to-your-folder")
``````
2. Get your data loaded into R and make sure there are no factors.

``````# csv

# tsv or txt

# excel file
library(gdata)
``````
3. Know how many rows and columns you have. Know what each column is and what each row represents.

``````dim(data)
names(data)
``````
4. Make sure all your columns are the correct data type (nums, strings, factors, dates) Changing data types in R

5. If you have dirty data, clean it and put the results in new columns. (We did this with the Chicago guns exercise and have examples in the Data field guide )

6. View some summary stats on your number columns (min, max, mean)

``````summary(data)

summary(data\$column)

table(data\$column)
``````

More detail here

7. Chart histograms of your numeric columns (distributions)

``````hist(data\$column_name)
``````
8. Chart histograms of your categorical columns (gender, color, etc..)

``````plot(table(data\$column), type="h")
``````
9. If you have any time-sensitive data, plot your quantitative data against it to see if there’s any relationship.

10. Plot a time series, but aggregated up to a different time frame (days to months, months to years)

11. Look for correlations and outliers

``````#plots all the combinations of all your columns
pairs(data)

#If you want to plot just one of the combinations for more detail
plot(data\$column1, data\$column2)
``````
12. Look for outliers in these plots.