Kevin and Shan’s

# Field Guide to Working with Data in the Wild

• ### Summary stats on a vector — r

You should know your data – every column, every row.

To get the min, max or mean of a quantitative vector

``````df <- data.frame(val1=c(-34, 45, 34, 69, 7, 99), val2=c(-14, 435, 134, 6, -7, 9))

#min of thr first col
min(df\$val1)
mean(df\$val1)
max(df\$val1)

#or do summary on the whole data frame
summary(df)
``````

If your data is qualitative, you can use `table`

``````animals <- c("monkeys", "monkeys", "giraffe", "donkey", "blue whale", "giraffe", "giraffe")

table(animals)

# blue whale     donkey    giraffe    monkeys
#         1          1          3          2
``````

• ### Sorting — r

``````#sort data frame on a field name – note ascending order by default.
data <- data[order(data\$fieldname),]

#descending sort
data <- data[order(data\$fieldname, decreasing=TRUE),]
``````

• ### Joining Two Datasets — r

``````# field names must match for R to know what to join on
# note: this will DROP RECORDS that don't join; returns only successful joins
merged <- merge(dataframe1, dataframe2, by="field_name")

# to keep all the records in dataframe1 even if they dont join perfectly to dataframe2
merged <- merge(dataframe1, dataframe2, by="field_name", all.x = TRUE)

# to keep all the records in dataframe2 even if they dont join perfectly to dataframe1
merged <- merge(dataframe1, dataframe2, by="field_name", all.y = TRUE)
``````

• ### Aggregation — r

``````#to get, say, the total strikeouts per year (from our homework)
agg <- aggregate(data\$strikeouts, list(data\$year), sum)

#to get, say, the total strikeouts per team per year (from our homework)
agg <- aggregate(data\$strikeouts, list(data\$year, data\$team), sum)

# the third argument takes any function (including a custom one)
# fewest strikeouts per team per game
agg <- aggregate(data\$strikeouts, list(data\$year, data\$team), min)
``````

• ### Filtering — r

Return a new dataset that has only the elements that match given criteria.

``````filtered_data <- subset(data, year > 2000)

# can take many arguments
filtered_data <- subset(data, year <= 2000 & player_name == "Shavin")

# filter matching many conditions
upper_midest = subset(data, state_abb%in%c("Minnesota", "Wisconsin", "North Dakota", "South Dakota"))
``````

• ### Formatting Data types — r

Sometimes, when we load a data frame into the R console, it’s not in the right format. Dumb R might make it a factor, or the numbers might be strings, or whatever.

To turn a character vector into a number vector:

``````bad <- c("1", "3", "23", "9238")

# note: if it's not a number, R turns it into NA.
bad <- c("1", "3", "giraffe", "9238")
# returns  1    3   NA 9238
``````

To turn a character vector of dates into a date class R likes isn’t hard, but you do need to specify the format your date is in. Mostly this just takes practice. See `?format.Date` in the R console for more details.

``````bad <- c("4-16-81", "9-19-45", "1-3-93")

``````

• ### Manipulating Columns — r

``````#Delete a column
data\$name <- NULL

#Rename one column, in this case the second column
colnames(data) <- "newname"

#Rename all the columns at once
colnames(data) <- c("newname1", "newname2", "newname2")
``````

• ### R Transforms Cleaning Strsplit —

R’s strsplit lets you break up strings into little chunks (called lists), which you can then pull out with another method called `sapply`. This is not the most efficient way to do this, but it’s presented here in multiple steps for clarity.

``````dirty <- c("ACCOMACK(VA)", "ADA(ID)", "ADAIR(KY)", "ADAIR(MO)", "ADAMS(CO)", "ADAMS(IL)", "ADAMS(IN)", "ADAMS(MS)", "ADAMS(NE)", "ADAMS(OH)", "ADAMS(PA)", "ADAMS(WI)", "AIKEN(SC)")

# try this in the console. it returns a list (of lists!)
strsplit(dirty, split = "\\(")

# make a function to fetch out the second element of a list
get_second_element <- function(item) {
return (item)
}

# use sapply to apply this function to each element
dirty_list <- strsplit(dirty, split = "\\(")

#and use!
sapply(dirty_list, get_second_element)

#ps, when you're good and the function is this simple, you can do this all in one line

sapply(strsplit(dirty, split="\\("), function(x) { x} )
``````

• ### R Transforms Cleaning Regex —

We frequently need to strip out unwanted characters from vectors. `gsub` is R’s way of doing this.

``````dirty <- c("VA)", "ID)", "KY)", "MO)", "CO)", "IL)", "IN)", "MS)", "NE)", "OH)")

# get rid of right parentheses, replace them with nothing
gsub("\\)", "", dirty)

# returns "VA" "ID" "KY" "MO" "CO" "IL" "IN" "MS" "NE" "OH"
``````

• ### Sorting — javascript

``````//Simple alphabetical sort
data.sort();

//Easy way to reverse the sort order
data.sort().reverse();

//If you want to sort on a field
data.sort(function(a, b) {
if (a.year > b.year) return 1;
if (a.year < b.year) return -1;
return 0;
});

//If you're sorting on numbers, you can use this shortcut
data.sort(function(a, b) { return a - b; });

//Same thing but with field names
data.sort(function(a, b) { return a.year - b.year; });

//Alphabetical sort by last then first name
data.sort(function(a, b){
if (a.lastName < b.lastName) return -1;
if (a.lastName > b.lastName) return 1;
if (a.firstName < b.firstName) return -1;
if (a.firstName > b.firstName) return 1;
return 0;
});
``````

• ### Filtering — javascript

Return a new dataset that has only the elements that match given criteria.

``````var filteredData = data.filter(function(d) { return d.year > 2000; });
``````

• ### Grouping — d3

How you aggregate really depends on what you want. Mister Nester is useful for tinkering.

``````d3.nest()
.key(function(d) { return d.year; })
.rollup(function(values) {
return {
totalStrikeouts: d3.sum(values, function(d) { return d.strikeouts })
};
})
.entries(data);
``````