You should know your data – every column, every row.
To get the min, max or mean of a quantitative vector
df <- data.frame(val1=c(-34, 45, 34, 69, 7, 99), val2=c(-14, 435, 134, 6, -7, 9))
#min of thr first col
min(df$val1)
mean(df$val1)
max(df$val1)
#or do summary on the whole data frame
summary(df)
If your data is qualitative, you can use table
animals <- c("monkeys", "monkeys", "giraffe", "donkey", "blue whale", "giraffe", "giraffe")
table(animals)
# blue whale donkey giraffe monkeys
# 1 1 3 2
#sort data frame on a field name – note ascending order by default.
data <- data[order(data$fieldname),]
#descending sort
data <- data[order(data$fieldname, decreasing=TRUE),]
# field names must match for R to know what to join on
# note: this will DROP RECORDS that don't join; returns only successful joins
merged <- merge(dataframe1, dataframe2, by="field_name")
# to keep all the records in dataframe1 even if they dont join perfectly to dataframe2
merged <- merge(dataframe1, dataframe2, by="field_name", all.x = TRUE)
# to keep all the records in dataframe2 even if they dont join perfectly to dataframe1
merged <- merge(dataframe1, dataframe2, by="field_name", all.y = TRUE)
#to get, say, the total strikeouts per year (from our homework)
agg <- aggregate(data$strikeouts, list(data$year), sum)
#to get, say, the total strikeouts per team per year (from our homework)
agg <- aggregate(data$strikeouts, list(data$year, data$team), sum)
# the third argument takes any function (including a custom one)
# fewest strikeouts per team per game
agg <- aggregate(data$strikeouts, list(data$year, data$team), min)
Return a new dataset that has only the elements that match given criteria.
filtered_data <- subset(data, year > 2000)
# can take many arguments
filtered_data <- subset(data, year <= 2000 & player_name == "Shavin")
# filter matching many conditions
upper_midest = subset(data, state_abb%in%c("Minnesota", "Wisconsin", "North Dakota", "South Dakota"))
Sometimes, when we load a data frame into the R console, it’s not in the right format. Dumb R might make it a factor, or the numbers might be strings, or whatever.
To turn a character vector into a number vector:
bad <- c("1", "3", "23", "9238")
good <- as.numeric(bad)
# note: if it's not a number, R turns it into NA.
bad <- c("1", "3", "giraffe", "9238")
good <- as.numeric(bad)
# returns 1 3 NA 9238
To turn a character vector of dates into a date class R likes isn’t hard, but you do need to specify the format your date is in. Mostly this just takes practice. See ?format.Date
in the R console for more details.
bad <- c("4-16-81", "9-19-45", "1-3-93")
good <- as.Date(bad, format="%m-%d-%y")
bad <- c("04/16/1981", "09/19/1945", "01/03/1993")
good <- as.Date(bad, format="%m/%d/%Y")
#Delete a column
data$name <- NULL
#Rename one column, in this case the second column
colnames(data)[2] <- "newname"
#Rename all the columns at once
colnames(data) <- c("newname1", "newname2", "newname2")
R’s strsplit lets you break up strings into little chunks (called lists), which you can then pull out with another method called sapply
.
This is not the most efficient way to do this, but it’s presented here in multiple steps for clarity.
dirty <- c("ACCOMACK(VA)", "ADA(ID)", "ADAIR(KY)", "ADAIR(MO)", "ADAMS(CO)", "ADAMS(IL)", "ADAMS(IN)", "ADAMS(MS)", "ADAMS(NE)", "ADAMS(OH)", "ADAMS(PA)", "ADAMS(WI)", "AIKEN(SC)")
# try this in the console. it returns a list (of lists!)
strsplit(dirty, split = "\\(")
# make a function to fetch out the second element of a list
get_second_element <- function(item) {
return (item[2])
}
# use sapply to apply this function to each element
dirty_list <- strsplit(dirty, split = "\\(")
#and use!
sapply(dirty_list, get_second_element)
#ps, when you're good and the function is this simple, you can do this all in one line
sapply(strsplit(dirty, split="\\("), function(x) { x[2]} )
We frequently need to strip out unwanted characters from vectors. gsub
is R’s way of doing this.
dirty <- c("VA)", "ID)", "KY)", "MO)", "CO)", "IL)", "IN)", "MS)", "NE)", "OH)")
# get rid of right parentheses, replace them with nothing
gsub("\\)", "", dirty)
# returns "VA" "ID" "KY" "MO" "CO" "IL" "IN" "MS" "NE" "OH"
//Simple alphabetical sort
data.sort();
//Easy way to reverse the sort order
data.sort().reverse();
//If you want to sort on a field
data.sort(function(a, b) {
if (a.year > b.year) return 1;
if (a.year < b.year) return -1;
return 0;
});
//If you're sorting on numbers, you can use this shortcut
data.sort(function(a, b) { return a - b; });
//Same thing but with field names
data.sort(function(a, b) { return a.year - b.year; });
//Alphabetical sort by last then first name
data.sort(function(a, b){
if (a.lastName < b.lastName) return -1;
if (a.lastName > b.lastName) return 1;
if (a.firstName < b.firstName) return -1;
if (a.firstName > b.firstName) return 1;
return 0;
});
More documentation about Javascript sorting.
Return a new dataset that has only the elements that match given criteria.
var filteredData = data.filter(function(d) { return d.year > 2000; });
How you aggregate really depends on what you want. Mister Nester is useful for tinkering.
d3.nest()
.key(function(d) { return d.year; })
.rollup(function(values) {
return {
totalStrikeouts: d3.sum(values, function(d) { return d.strikeouts })
};
})
.entries(data);