Loops are a great, relatively straightforward way to repeatedly execute a chunk of code. However, they aren’t especially efficient. Enter: the apply type function:

These functions run code chunks in a non-sequential way that is often more efficient than a loop (as long as the elements in your object aren’t dependent on other elements in your object).

replicate()

We will start with replicate because it’s arguably the easiest of these functions to use. The replicate function repeats a function call n times.

replicate(n=4, "Hello")
## [1] "Hello" "Hello" "Hello" "Hello"
replicate(10, factorial(7))
##  [1] 5040 5040 5040 5040 5040 5040 5040 5040 5040 5040
# histogram of the means from 100 random samples of size n=10 from a standard normal distribution 
hist(replicate(100, mean(rnorm(10))),
     main = "", xlab="Means") 

A Similar Function for Simple Value Replication

The function rep replicates the values in the first argument. This is not part of the apply family, but may serve a similar purpose to replicate.

Suppose I want to represent all of the possible combinations when rolling two four-sided dice.

v1 <- rep(1:4, times=4) # replicate the sequence 1:4, four times
v2 <- rep(1:4, each=4) # replicate 1:4, with each number replicated 4 times (in a row)
data.frame(v1, v2)
##    v1 v2
## 1   1  1
## 2   2  1
## 3   3  1
## 4   4  1
## 5   1  2
## 6   2  2
## 7   3  2
## 8   4  2
## 9   1  3
## 10  2  3
## 11  3  3
## 12  4  3
## 13  1  4
## 14  2  4
## 15  3  4
## 16  4  4

apply()

The apply function applies a given function to the rows or columns of matrices (or arrays). It assembles the returned values into a vector, array, or list, which it returns.

The apply() arguments:

data <- matrix(1:9, nrow=3, ncol=3)
# the following is equivalent to the command: colMeans(data)
apply(data, 2, mean) # data, columns, mean --> get column means
## [1] 2 5 8
# the following is equivalent to the command: rowSums(data)
apply(data, 1, sum) # data, rows, sum --> get row sums
## [1] 12 15 18

We can also use apply functions on user-defined functions.

# Define the function within the apply statement:
apply(data, 2, function(x){
    y <- sum(x)^2 # sum of the input vector (here a column) squared
    return(y)
  }
)
## [1]  36 225 576
# Define the function outside of the apply statement:
fn <- function(x){
    y <- sum(x)^2 # sum of the input vector (here a column) squared
    return(y)
}
apply(data, 2, fn)
## [1]  36 225 576

The values that apply() returns depends on the function FUN.

In short, apply prioritizes returning a vector, array (matrix), and list (in that order). What is returned depends on the output of FUN.

Note: running apply on a data frame will cause R to convert the data frame using as.matrix. This is often not what we want, so be cautious doing that.

Example: Extra Arguments, Array Output

x <- cbind(x1 = 3, x2 = c(4:1, 2:5))

fun1 <- function(x, c1, c2){
  mean_vec <- c(mean(x[c1]), mean(x[c2]))
  return(mean_vec)
}

apply(x, 1, fun1,  c1 = "x1", c2 = c("x1","x2"))
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,]  3.0    3  3.0    3  3.0    3  3.0    3
## [2,]  3.5    3  2.5    2  2.5    3  3.5    4

Example: List Output

mat <- matrix(c(-1, 1, 0, 
                2, -2, 20, 
                62,-2, -6), nrow = 3)

CheckPos <- function(Vec){
  # Subset values of Vec that are even
  PosVec <- Vec[Vec > 0]
  
  # Return only the even values
  return(PosVec)
}

# Check Positive values by column 
apply(mat, 2, CheckPos)
## [[1]]
## [1] 1
## 
## [[2]]
## [1]  2 20
## 
## [[3]]
## [1] 62

On Your Own

Use an apply function to find the interquartile range (IQR()) of each variable in the ChickWeight data. (This dataset is built into R.)

lapply()

The lapply function is used to apply a function to each element of a list. It collects the returned values into another list, which it returns.

Arguments:

data_lst <- list(item1 = 1:5,
                item2 = seq(4,36,8),
                item3 = c(1,3,5,7,9))
data_vector <- 1:8

lapply(data_lst, sum)
## $item1
## [1] 15
## 
## $item2
## [1] 100
## 
## $item3
## [1] 25
lapply(data_vector, sum) # lapply performs an `as.list` command on X if it's not already a list
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 2
## 
## [[3]]
## [1] 3
## 
## [[4]]
## [1] 4
## 
## [[5]]
## [1] 5
## 
## [[6]]
## [1] 6
## 
## [[7]]
## [1] 7
## 
## [[8]]
## [1] 8
x <- list(a = 1:10, 
          beta = exp(-3:3), 
          logic = c(TRUE,FALSE,FALSE,TRUE))

# compute the list mean for each list element
lapply(x, mean)
## $a
## [1] 5.5
## 
## $beta
## [1] 4.535125
## 
## $logic
## [1] 0.5

Consider the built-in data set iris. If we use the as.list() function, each column is converted into an element of a list.

head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
str(as.list(iris))
## List of 5
##  $ Sepal.Length: num [1:150] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num [1:150] 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num [1:150] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num [1:150] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

So if we use lapply() in this case, it will iterate over the columns. We can find all values within a variable that are greater than the variable mean (for columns 1-4, the numeric variables).

lapply(iris[,1:4], function(column){
  big_values <- column[column > mean(column)]
  return(big_values)
})
## $Sepal.Length
##  [1] 7.0 6.4 6.9 6.5 6.3 6.6 5.9 6.0 6.1 6.7 6.2 5.9 6.1 6.3 6.1 6.4 6.6 6.8 6.7
## [20] 6.0 6.0 6.0 6.7 6.3 6.1 6.2 6.3 7.1 6.3 6.5 7.6 7.3 6.7 7.2 6.5 6.4 6.8 6.4
## [39] 6.5 7.7 7.7 6.0 6.9 7.7 6.3 6.7 7.2 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7
## [58] 6.3 6.4 6.0 6.9 6.7 6.9 6.8 6.7 6.7 6.3 6.5 6.2 5.9
## 
## $Sepal.Width
##  [1] 3.5 3.2 3.1 3.6 3.9 3.4 3.4 3.1 3.7 3.4 4.0 4.4 3.9 3.5 3.8 3.8 3.4 3.7 3.6
## [20] 3.3 3.4 3.4 3.5 3.4 3.2 3.1 3.4 4.1 4.2 3.1 3.2 3.5 3.6 3.4 3.5 3.2 3.5 3.8
## [39] 3.8 3.2 3.7 3.3 3.2 3.2 3.1 3.3 3.1 3.2 3.4 3.1 3.3 3.6 3.2 3.2 3.8 3.2 3.3
## [58] 3.2 3.8 3.4 3.1 3.1 3.1 3.1 3.2 3.3 3.4
## 
## $Petal.Length
##  [1] 4.7 4.5 4.9 4.0 4.6 4.5 4.7 4.6 3.9 4.2 4.0 4.7 4.4 4.5 4.1 4.5 3.9 4.8 4.0
## [20] 4.9 4.7 4.3 4.4 4.8 5.0 4.5 3.8 3.9 5.1 4.5 4.5 4.7 4.4 4.1 4.0 4.4 4.6 4.0
## [39] 4.2 4.2 4.2 4.3 4.1 6.0 5.1 5.9 5.6 5.8 6.6 4.5 6.3 5.8 6.1 5.1 5.3 5.5 5.0
## [58] 5.1 5.3 5.5 6.7 6.9 5.0 5.7 4.9 6.7 4.9 5.7 6.0 4.8 4.9 5.6 5.8 6.1 6.4 5.6
## [77] 5.1 5.6 6.1 5.6 5.5 4.8 5.4 5.6 5.1 5.1 5.9 5.7 5.2 5.0 5.2 5.4 5.1
## 
## $Petal.Width
##  [1] 1.4 1.5 1.5 1.3 1.5 1.3 1.6 1.3 1.4 1.5 1.4 1.3 1.4 1.5 1.5 1.8 1.3 1.5 1.2
## [20] 1.3 1.4 1.4 1.7 1.5 1.2 1.6 1.5 1.6 1.5 1.3 1.3 1.3 1.2 1.4 1.2 1.3 1.2 1.3
## [39] 1.3 1.3 2.5 1.9 2.1 1.8 2.2 2.1 1.7 1.8 1.8 2.5 2.0 1.9 2.1 2.0 2.4 2.3 1.8
## [58] 2.2 2.3 1.5 2.3 2.0 2.0 1.8 2.1 1.8 1.8 1.8 2.1 1.6 1.9 2.0 2.2 1.5 1.4 2.3
## [77] 2.4 1.8 1.8 2.1 2.4 2.3 1.9 2.3 2.5 2.3 1.9 2.0 2.3 1.8

On Your Own

Use lapply to find the range for each item in the list data_lst (which should already be in your R environment from an earlier code chunk).

sapply()

The sapply function works basically the same as the lapply function. The primary difference is that sapply attempts to simplify the result into a vector or matrix (instead of a list). This simplification works the same way as in apply.

lapply(data_lst, sum) # returns a list
## $item1
## [1] 15
## 
## $item2
## [1] 100
## 
## $item3
## [1] 25
sapply(data_lst, sum) # returns a vector
## item1 item2 item3 
##    15   100    25

On Your Own

Use sapply to find the range for each item in the list data_lst.

tapply()

The tapply function breaks the data set up into groups and applies a function to each group.

Arguments:

data = data.frame(name=c("Amy","Jose","Ray","Kim","Sam","Eve","Bob"), 
                  age=c(24, 22, 21, 23, 20, 24, 21),
                  gender=factor(c("F","M","M","F","M","F","M"))) 
tapply(data$age, data$gender, min) # age, grouped by gender, min for each group
##  F  M 
## 23 20

On Your Own

For the ChickWeight data, use tapply find the mean weight for each chick.

mapply()

The mapply function is a multivariate version of sapply. It applies FUN to the first elements of each ... argument, the second elements, the third elements, and so on.

Arguments:

mapply(rep, times = 1:4, x = 4:1)
## [[1]]
## [1] 4
## 
## [[2]]
## [1] 3 3
## 
## [[3]]
## [1] 2 2 2
## 
## [[4]]
## [1] 1 1 1 1

More information and examples: http://adv-r.had.co.nz/Functionals.html