Kevin and Shan’s

Field Guide to Working with Data in the Wild

• Summary stats on a vector — r

You should know your data – every column, every row.

To get the min, max or mean of a quantitative vector

df <- data.frame(val1=c(-34, 45, 34, 69, 7, 99), val2=c(-14, 435, 134, 6, -7, 9))

#min of thr first col
min(df\$val1)
mean(df\$val1)
max(df\$val1)

#or do summary on the whole data frame
summary(df)

If your data is qualitative, you can use table

animals <- c("monkeys", "monkeys", "giraffe", "donkey", "blue whale", "giraffe", "giraffe")

table(animals)

# blue whale     donkey    giraffe    monkeys
#         1          1          3          2

• Sorting — r

#sort data frame on a field name – note ascending order by default.
data <- data[order(data\$fieldname),]

#descending sort
data <- data[order(data\$fieldname, decreasing=TRUE),]

• Joining Two Datasets — r

# field names must match for R to know what to join on
# note: this will DROP RECORDS that don't join; returns only successful joins
merged <- merge(dataframe1, dataframe2, by="field_name")

# to keep all the records in dataframe1 even if they dont join perfectly to dataframe2
merged <- merge(dataframe1, dataframe2, by="field_name", all.x = TRUE)

# to keep all the records in dataframe2 even if they dont join perfectly to dataframe1
merged <- merge(dataframe1, dataframe2, by="field_name", all.y = TRUE)

• Aggregation — r

#to get, say, the total strikeouts per year (from our homework)
agg <- aggregate(data\$strikeouts, list(data\$year), sum)

#to get, say, the total strikeouts per team per year (from our homework)
agg <- aggregate(data\$strikeouts, list(data\$year, data\$team), sum)

# the third argument takes any function (including a custom one)
# fewest strikeouts per team per game
agg <- aggregate(data\$strikeouts, list(data\$year, data\$team), min)

• Filtering — r

Return a new dataset that has only the elements that match given criteria.

filtered_data <- subset(data, year > 2000)

# can take many arguments
filtered_data <- subset(data, year <= 2000 & player_name == "Shavin")

# filter matching many conditions
upper_midest = subset(data, state_abb%in%c("Minnesota", "Wisconsin", "North Dakota", "South Dakota"))

• Formatting Data types — r

Sometimes, when we load a data frame into the R console, it’s not in the right format. Dumb R might make it a factor, or the numbers might be strings, or whatever.

To turn a character vector into a number vector:

bad <- c("1", "3", "23", "9238")

# note: if it's not a number, R turns it into NA.
bad <- c("1", "3", "giraffe", "9238")
# returns  1    3   NA 9238

To turn a character vector of dates into a date class R likes isn’t hard, but you do need to specify the format your date is in. Mostly this just takes practice. See ?format.Date in the R console for more details.

• Manipulating Columns — r

#Delete a column
data\$name <- NULL

#Rename one column, in this case the second column
colnames(data) <- "newname"

#Rename all the columns at once
colnames(data) <- c("newname1", "newname2", "newname2")

• R Transforms Cleaning Strsplit —

R’s strsplit lets you break up strings into little chunks (called lists), which you can then pull out with another method called sapply. This is not the most efficient way to do this, but it’s presented here in multiple steps for clarity.

# try this in the console. it returns a list (of lists!)
strsplit(dirty, split = "\\(")

# make a function to fetch out the second element of a list
get_second_element <- function(item) {
return (item)
}

# use sapply to apply this function to each element
dirty_list <- strsplit(dirty, split = "\\(")

#and use!
sapply(dirty_list, get_second_element)

#ps, when you're good and the function is this simple, you can do this all in one line

sapply(strsplit(dirty, split="\\("), function(x) { x} )

• R Transforms Cleaning Regex —

We frequently need to strip out unwanted characters from vectors. gsub is R’s way of doing this.

dirty <- c("VA)", "ID)", "KY)", "MO)", "CO)", "IL)", "IN)", "MS)", "NE)", "OH)")

# get rid of right parentheses, replace them with nothing
gsub("\\)", "", dirty)

# returns "VA" "ID" "KY" "MO" "CO" "IL" "IN" "MS" "NE" "OH"

• Sorting — javascript

//Simple alphabetical sort
data.sort();

//Easy way to reverse the sort order
data.sort().reverse();

//If you want to sort on a field
data.sort(function(a, b) {
if (a.year > b.year) return 1;
if (a.year < b.year) return -1;
return 0;
});

//If you're sorting on numbers, you can use this shortcut
data.sort(function(a, b) { return a - b; });

//Same thing but with field names
data.sort(function(a, b) { return a.year - b.year; });

//Alphabetical sort by last then first name
data.sort(function(a, b){
if (a.lastName < b.lastName) return -1;
if (a.lastName > b.lastName) return 1;
if (a.firstName < b.firstName) return -1;
if (a.firstName > b.firstName) return 1;
return 0;
});

• Filtering — javascript

Return a new dataset that has only the elements that match given criteria.

var filteredData = data.filter(function(d) { return d.year > 2000; });

• Grouping — d3

How you aggregate really depends on what you want. Mister Nester is useful for tinkering.

d3.nest()
.key(function(d) { return d.year; })
.rollup(function(values) {
return {
totalStrikeouts: d3.sum(values, function(d) { return d.strikeouts })
};
})
.entries(data);