# Facing The Abyss

## How to Probe Unknown Data

So you have your shiny new R skills and you’ve successfully loaded a cool dataframe into R… Now what?

The best charts come from understanding your data, asking good questions from it, and displaying the answers to those questions as clearly as possible.

- What month was X highest or lowest?
- How has field Z changed since policy X was implemented?
- What’s the relationship between X and Y?
- What does the distribution of one column of data tell us?

Here’s a basic checklist to get more familiar with any data set.

We’ll go over this checklist with a test data set from a table about the Chicago teachers' union strike, but you can follow along on your own data set too, if you like.

Set your working directory.

`setwd("~/path-to-your-folder")`

Get your data loaded into R and make sure there are no factors.

`# csv data <- read.csv("path-to-file", stringsAsFactors=F) # tsv or txt data <- read.delim("path-to-file", stringsAsFactors=F) # excel file library(gdata) data <- read.xls("path-to-file", stringsAsFactors=F)`

Know how many rows and columns you have. Know what each column is and what each row represents.

`dim(data) names(data)`

Make sure all your columns are the correct data type (nums, strings, factors, dates) Changing data types in R

If you have dirty data, clean it and put the results in new columns. (We did this with the Chicago guns exercise and have examples in the Data field guide )

View some summary stats on your number columns (min, max, mean)

`summary(data) summary(data$column) table(data$column)`

More detail here

Chart histograms of your numeric columns (distributions)

`hist(data$column_name)`

Chart histograms of your categorical columns (gender, color, etc..)

`plot(table(data$column), type="h")`

If you have any time-sensitive data, plot your quantitative data against it to see if there’s any relationship.

Plot a time series, but aggregated up to a different time frame (days to months, months to years)

Look for correlations and outliers

`#plots all the combinations of all your columns pairs(data) #If you want to plot just one of the combinations for more detail plot(data$column1, data$column2)`

Look for outliers in these plots.