Apply Functions
Abstract
In R, we might encounter situations where we need to iterate over several (or all) of the elements in a data frame. With languages like Python or Java, for-loops and while-loops are indispensable when dealing with iteration, but in R, we have the apply functions, which offer a faster, more concise way of applying a function to a data set.
There’s a great online encyclopedia on R that inspired and informed many examples for this page. Check them out! |
apply
The basic apply function takes an array/matrix, a MARGIN
parameter for the desired dimension, and the function you want to apply.
Our test matrix will be 5x5, where the numbers 1-25 will populate each column sequentially:
cubed5 <- matrix(c(1:5, 6:10, 11:15, 16:20, 21:25), nrow = 5, ncol = 5)
cubed5
[,1] [,2] [,3] [,4] [,5] [1,] 1 6 11 16 21 [2,] 2 7 12 17 22 [3,] 3 8 13 18 23 [4,] 4 9 14 19 24 [5,] 5 10 15 20 25
-
insert examples for
apply
here
Ultimately, that’s all there is to apply
. One important note is that apply
will not work for vectors — we discuss vector apply
functions in a moment.
squares <- c(1, 4, 9, 16, 25)
apply(squares, 1, function(x) x ^ 0.5)
`Error in apply(squares, 1, function(x) x^0.5) : dim(X) must have a positive length`
lapply & sapply
lapply
applies a function to each element of a list, then returns a list that’s been altered by the function. Since there is only one dimension in a list, the MARGIN
parameter does not apply. Let’s use sum
on the squares
vector from before:
lapply(squares, sum)
[[1]] [1] 1 [[2]] [1] 4 [[3]] [1] 9 [[4]] [1] 16 [[5]] [1] 25
This is obviously not what we wanted — the problem here is that vectors and lists are not the same thing and lapply
treats every element in the vector like its own list. Casting squares
to a list will fix the issue:
lapply(list(squares), sum)
[[1]] [1] 55
sapply
will function identically to lapply
unless the output can be simplified, in which case sapply
executes that simplification. The following occurs when we run sapply
in place of lapply
on our squares
vector.
sapply(squares, sum)
[1] 1 4 9 16 25
Unless you explicitly need a list, sapply
can often be the more advantageous function because of its output simplification.
tapply
The documentation definition for tapply
is a bit more specific than the others, where the arguments are now (X, INDEX, FUN)
, with X
being an object where the split
function applies, INDEX
is a factor by which X
is grouped, and FUN
is function as before.
To simplify this definition, we can say tapply
applies FUN
to X
when X
is grouped by INDEX
. Consider the following: we have a grades
data.frame that contains information for grade, year, and sex for several students. We can use tapply
to get the average grade by year in a simple way.
grades
grade year sex 1 100 junior M 2 99 sophomore F 3 75 sophomore M 4 74 sophomore M 5 44 senior F 6 69 junior M 7 88 junior F 8 99 senior <NA> 9 90 freshman M 10 92 junior F
The solution begins below.
tapply(grades$grade, grades$year, mean)
freshman junior senior sophomore 90.00000 87.25000 71.50000 82.66667
We can use the optional arguments here to remove any rows that contain missing data.
tapply(grades$grade, grades$year, mean, na.rm=T)
## freshman junior senior sophomore ## 90.00000 87.25000 44.00000 82.66667
Examples
How can I find the average of several variables in the flight
dataset using 1 line of lapply
code?
We can store the data for 2003 flights as follows:
myDF <- read.csv("/depot/datamine/data/flights/subset/2003.csv")
We can categorize the flight distances in groups of <100 miles, 100-200 miles, 200-500 miles, 500-1000 miles, 1000-2000 miles, and 2000+ miles using the cut
function, then tabulating it
my_distance_categories <- cut(myDF$Distance, breaks = c(0,100,200,500,1000,2000,Inf), include.lowest=T)
We can get the averages of all applicable flights for 4 variables, broken down by the distance categories we just defined.
tapply(myDF$DepDelay, my_distance_categories, mean, na.rm=T) # the DepDelay in each category
tapply(myDF$ArrDelay, my_distance_categories, mean, na.rm=T) # the ArrDelay in each category
tapply(myDF$TaxiOut, my_distance_categories, mean, na.rm=T) # the time to TaxiOut in each category
tapply(myDF$TaxiIn, my_distance_categories, mean, na.rm=T) # the time to TaxiIn in each category
However, we can condense this to one line using lapply
according to the prompt. To make it easier to read, we can make a temporary data frame flights_by_distance
with these 4 variables. Then we split the data into 6 data.frames using the distance categories, yielding averages for DepDelay
, ArrDelay
, TaxiOut
, and TaxiIn
. This will agree exactly with the results of the 4 separate tapply
functions, but it only takes us 1 call to lapply
!
flights_by_distance <- split( data.frame(myDF$DepDelay, myDF$ArrDelay, myDF$TaxiOut, myDF$TaxiIn), my_distance_categories )
lapply( flights_by_distance, colMeans, na.rm=T )
How can I find the average of variables DRUNK_DR
, FATALS
, and PERSONS
in the fars
dataset using 1 line of lapply
code?
This is a question that was asked in previous STAT19000 classes when the apply
functions are introduced. We’ll start by reading in the dataset and adding state names.
There are more efficient ways to add the names, but this code mirrors the solution to the previous implementation of this question, which we’ll follow from here on out. |
dat <- read.csv("/depot/datamine/data/fars/7581.csv")
state_names <- read.csv("/depot/datamine/data/fars/states.csv")
v <- state_names$state
names(v) <- state_names$code
dat$mystates <- v[as.character(dat$STATE)]
If we wanted to get the averages for the 3 variables in question, we can use tapply
independently:
tapply(dat$DRUNK_DR, dat$mystates, mean)
tapply(dat$FATALS, dat$mystates, mean)
tapply(dat$PERSONS, dat$mystates, mean)
However, there is an easier way that also fits the requirements of the prompt. We’ll create the data.frame accidents_by_state
with only these 3 variables for readability:
accidents_by_state <- split( data.frame(dat$DRUNK_DR, dat$FATALS, dat$PERSONS), dat$mystates )
lapply( accidents_by_state, colMeans )
The split
function creates 51 different data.frames based on the values in mystates
, where lapply
then uses colMeans
as its function to get the averages for our 3 variables. Awesome!
Use the provided code to create a new column transformed
in the data.frame example_df
. transformed
should contain TRUE
if the value in column pre_transformed
is "t", FALSE
if it is "f", and NA
otherwise.
string_to_bool <- function(value) {
if (value == "t") {
return(TRUE)
} else if (value == "f") {
return(FALSE)
} else {
return(NA)
}
}
example_df <- data.frame(pre_transformed=c("f", "f", "t", "f", "something", "t", "else", ""), other=c(1,2,3,4,5,6,7,8))
The solution begins below.
example_df$transformed <- sapply(example_df$pre_transformed, string_to_bool)
example_df
pre_transformed other transformed 1 f 1 FALSE 2 f 2 FALSE 3 t 3 TRUE 4 f 4 FALSE 5 something 5 NA 6 t 6 TRUE 7 else 7 NA 8 8 NA
Here we have not a question, but a demonstration. We use tapply
in various ways on the Amazon Fine Food Reviews dataset.
The goal of our demonstration is to show the most consistently helpful users in this dataset. This is calculated using the HelpfulnessNumerator
and HelpfulnessDenominator
fields in the dataset. As an example, we find the user that wrote the most reviews.
myDF <- read.csv("/depot/datamine/data/amazon/amazon_fine_food_reviews.csv")
tail(sort(table(myDF$UserId)))
The user in question is A3OXHLG6DIBRW8, which will be further referred to as A3O. The code below provides two summations: the HelpfulnessDenominator
sum is the total number of people who read A3O’s reviews, while the HelpfulnessNumerator
is the number of people who found their reviews helpful. We can call the sum
functions on both, then taking the quotient to get A3O’s Helpfulness proportion.
sum(myDF$HelpfulnessNumerator[myDF$UserId == "A3OXHLG6DIBRW8"])/sum(myDF$HelpfulnessDenominator[myDF$UserId == "A3OXHLG6DIBRW8"])
Instead of grabbing each user individually, we can use tapply
to calculate these proportions for all users.
tapply(myDF$HelpfulnessNumerator, myDF$UserId, sum)/tapply(myDF$HelpfulnessDenominator, myDF$UserId, sum)