Programming in R

After learning the basics, programming in R is the next big step towards attaining coding nirvana. In R this freedom is gained by being able to wield and manipulate code to get it to do exactly what you want, rather than contort, bend and restrict yourself to what others have built. It’s worth remembering that many packages are written with the goal of solving a problem the author had. Your problem may be similar, but not exactly the same. In such cases you could contort, bend and restrict your data so that it works with the package, or you could build your own functions that solve your specific problems.

But there are already a vast number of R packages available, surely more than enough to cover everything you could possibly want to do? Why then, would you ever need to use R as a programming language? Why not just stick to the functions from a package? Well, in some cases you’ll want to customise those existing functions to suit your specific needs. Or you may want to implement an approach that’s hot off the press or you’ve come up with an entirely novel idea, which means there won’t be any pre-existing packages that work for you. Both of these are not particularly common early-on when you start using R, but you may well find as you progress, the reliance on existing functions becomes a little limiting. In addition, if you ever start using more advanced statistical approaches (i.e. Bayesian inference coded through Stan or JAGS) then an understanding of basic programming constructs, especially loops, becomes fundamental.

In this Chapter we’ll explore the basics of the programming side of R. Even if you never create your own function, you will at the very least use functions on a daily basis. Pulling back the curtain and showing the nitty-gritty of how these work will hopefully, in the worst case, help your confidence and in the best case, provide a starting point for doing some clever coding.

Looking behind the curtain

A good way to start learning to program in R is to see what others have done. We can start by briefly peeking behind the curtain. In GGPLOT2 we made use of the theme_classic() function when customising our ggplot figures. With many functions in R, if you want to have a quick glance at the machinery behind the scenes, we can simply write the function name but without the (). This is the same trick we used in GGPLOT2 to alter the theme_classic() style of ggplot2 to make theme_rbook().

Note that to view the source code of base R packages (those that come with R) requires some additional steps which we won’t cover here (see this [link][show-code] if you’re interested), but for most other packages that you install yourself, generally entering the function name without () will show the source code of the function.

theme_classic
## function (base_size = 11, base_family = "", base_line_size = base_size/22, 
##     base_rect_size = base_size/22) 
## {
##     theme_bw(base_size = base_size, base_family = base_family, 
##         base_line_size = base_line_size, base_rect_size = base_rect_size) %+replace% 
##         theme(panel.border = element_blank(), panel.grid.major = element_blank(), 
##             panel.grid.minor = element_blank(), axis.line = element_line(colour = "black", 
##                 linewidth = rel(1)), strip.background = element_rect(fill = "white", 
##                 colour = "black", linewidth = rel(2)), complete = TRUE)
## }
## <bytecode: 0x12cde6be0>
## <environment: namespace:ggplot2>

What we see above is the underlying code for this particular function. As we did in GGPLOT2, we could copy and paste this into our own script and make any changes we deemed necessary. This approach isn’t limited to ggplot2 functions and can be used for other functions as well, although tread carefully and test the changes you’ve made.

Don’t worry overly if most of the code contained in functions doesn’t make sense immediately. This will be especially true if you are new to R, in which case it seems incredibly intimidating. To help with that, we’ll begin by making our own functions in R in the next section.

Functions in R

Functions are your loyal servants, waiting patiently to do your bidding to the best of their ability. They’re made with the utmost care and attention … though sometimes may end up being something of a Frankenstein’s monster - with an extra limb or two and a head put on backwards. But no matter how ugly they may be they’re completely faithful to you.

They’re also very stupid.

If we asked you to go to the supermarket to get us some ingredients to make Francesinha, even if you don’t know what the heck that is, you’d be able to guess and bring at least something back. Or you could decide to make something else. Or you could ask a celebrity chef for help. Or you could pull out your phone and search online for what [Francesinha][francesinha] is. The point is, even if we didn’t give you enough information to do the task, you’re intelligent enough to, at the very least, try to find a work around.

If instead, we asked our loyal function to do the same, it would listen intently to our request, stand still for a few milliseconds, compose itself, and then start shouting Error: 'data' must be a data frame, or other object .... It would then repeat this every single time we asked it to do the job. The point here, is that code and functions are not intelligent. They cannot find workarounds. It’s totally reliant on you, to tell it very explicitly what it needs to do step by step.

Remember two things: the intelligence of code comes from the coder, not the computer and functions need exact instructions to work.

To prevent functions from being too stupid you must provide the information the function needs in order for it to function. As with the Francesinha example, if we’d supplied a recipe list to the function, it would have managed just fine. We call this “fulfilling an argument”. The vast majority of functions require the user to fulfill at least one argument.

This can be illustrated in the pseudocode below. When we make a function we can specify what arguments the user must fulfill (e.g. argument1 and argument2), as well as what to do once it has this information (expression):

nameOfFunction <- function(argument1, argument2, ...) {expression}

The first thing to note is that we’ve used the function function() to create a new function called nameOfFunction. To walk through the above code; we’re creating a function called nameOfFunction. Within the round brackets we specify what information (i.e. arguments) the function requires to run (as many or as few as needed). These arguments are then passed to the expression part of the function. The expression can be any valid R command or set of R commands and is usually contained between a pair of braces { } (if a function is only one line long you can omit the braces). Once you run the above code, you can then use your new function by typing:

nameOfFunction(argument1, argument2)

Confused? Let’s work through an example to help clear things up.

First we are going to create a data frame called city, where columns porto, aberdeen, nairobi, and genoa are filled with 100 random values drawn from a bag (using the rnorm() function to draw random values from a Normal distribution with mean 0 and standard deviation of 1). We also include a “problem”, for us to solve later, by including 10 NA values within the nairobi column (using rep(NA, 10)).

city <- data.frame(
  porto = rnorm(100),
  aberdeen = rnorm(100),
  nairobi = c(rep(NA, 10), rnorm(90)),
  genoa = rnorm(100)
)

Let’s say that you want to multiply the values in the variables Porto and Aberdeen and create a new object called porto_aberdeen. We can do this “by hand” using:

porto_aberdeen <- city$porto * city$aberdeen

We’ve now created an object called porto_aberdeen by multiplying the vectors city$porto and city$aberdeen. Simple. If this was all we needed to do, we can stop here. R works with vectors, so doing these kinds of operations in R is actually much simpler than other programming languages, where this type of code might require loops (we say that R is a vectorised language). Something to keep in mind for later is that doing these kinds of operations with loops can be much slower compared to vectorisation.

But what if we want to repeat this multiplication many times? Let’s say we wanted to multiply columns porto and aberdeen, aberdeen and genoa, and nairobi and genoa. In this case we could copy and paste the code, replacing the relevant information.

porto_aberdeen <- city$porto * city$aberdeen
aberdeen_genoa <- city$aberdeen * city$aberdeen
nairobi_genoa <- city$nairobi * city$genoa

While this approach works, it’s easy to make mistakes. In fact, here we’ve “forgotten” to change aberdeen to genoa in the second line of code when copying and pasting. This is where writing a function comes in handy. If we were to write this as a function, there is only one source of potential error (within the function itself) instead of many copy-pasted lines of code (which we also cut down on by using a function).

In this case, we’re using some fairly trivial code where it’s maybe hard to make a genuine mistake. But what if we increased the complexity?

city$porto * city$aberdeen / city$porto + (city$porto * 10^(city$aberdeen)) 
                  - city$aberdeen - (city$porto * sqrt(city$aberdeen + 10))

Now imagine having to copy and paste this three times, and in each case having to change the porto and aberdeen variables (especially if we had to do it more than three times).

What we could do instead is generalise our code for x and y columns instead of naming specific cities. If we did this, we could recycle the x * y code. Whenever we wanted to multiple columns together, we assign a city to either x or y. We’ll assign the multiplication to the objects porto_aberdeen and aberdeen_nairobi so we can come back to them later.

# Assign x and y values
x <- city$porto
y <- city$aberdeen

# Use multiplication code
porto_aberdeen <- x * y

# Assign new x and y values
x <- city$aberdeen
y <- city$nairobi

# Reuse multiplication code
aberdeen_nairobi <- x * y

This is essentially what a function does. OK down to business, let’s call our new function multiply_columns() and define it with two arguments, x and y. In the function code we simply return the value of x * y using the return() function. Using the return() function is not strictly necessary in this example as R will automatically return the value of the last line of code in our function. We include it here to make this explicit.

multiply_columns <- function(x, y) {
  return(x * y)
}

Now that we’ve defined our function we can use it. Let’s use the function to multiple the columns city$porto and city$aberdeen and assign the result to a new object called porto_aberdeen_func.

porto_aberdeen_func <- multiply_columns(x = city$porto, y = city$aberdeen)
porto_aberdeen_func
##   [1]  1.26206657  0.06848963 -0.81066950  0.49847607 -1.22278301  0.39824089
##   [7]  1.74358158  0.92784467 -0.04857687  0.15154000  0.08686293 -0.98977892
##  [13] -0.45808708 -0.52696514  1.22126588 -0.83054572 -0.74378472  1.13471138
##  [19] -0.57196060  1.95092373  1.02897066  0.24426356 -0.08388141  0.86533537
##  [25] -1.49653801  0.26943040 -0.42382601 -0.05253883  1.85408186  0.16887063
##  [31]  0.05497458  1.99559702  0.41805478  0.16107513  0.19709268 -0.41098573
##  [37]  3.33156070  0.26657344  0.29132051  0.20609891  0.21448581 -1.23971263
##  [43]  0.67195281 -0.52559590 -0.20503224 -0.09327091 -0.09545263 -1.91281447
##  [49]  0.39488736 -0.02122213 -0.36636576  0.15481154 -2.55029107  0.15234106
##  [55] -0.36095086 -1.67193339  0.04989543 -0.28113699 -0.11470818  0.35808955
##  [61]  0.01476053 -0.66677439  0.48710655  0.09682939 -1.16828436  0.06146796
##  [67] -0.08393281  0.08220809  0.14054721 -0.57657757  0.13773337  0.16977789
##  [73] -0.50324683  0.60377491 -0.71730738 -1.66971422 -0.07337153  0.97519978
##  [79] -0.55986377  2.37953731 -0.50462258  1.51868905  2.27088731 -0.08645319
##  [85] -0.85074469 -0.41906071  1.29185604  0.02500684 -0.37517160  0.44416784
##  [91]  0.45686704  0.03481269  0.51991068 -0.02688135 -0.13981627 -1.34498132
##  [97]  0.02222702 -1.13085986  0.83135295  1.17235317

If we’re only interested in multiplying city$porto and city$aberdeen, it would be overkill to create a function to do something once. However, the benefit of creating a function is that we now have that function added to our environment which we can use as often as we like. We also have the code to create the function, meaning we can use it in completely new projects, reducing the amount of code that has to be written (and retested) from scratch each time. As a rule of thumb, you should consider writing a function whenever you’ve copied and pasted a block of code more than twice.

To satisfy ourselves that the function has worked properly, we can compare the porto_aberdeen variable with our new variable porto_aberdeen_func using the identical() function. The identical() function tests whether two objects are exactly identical and returns either a TRUE or FALSE value. Use ?identical if you want to know more about this function.

identical(porto_aberdeen, porto_aberdeen_func)
## [1] TRUE

And we confirm that the function has produced the same result as when we do the calculation manually. We recommend getting into a habit of checking that the function you’ve created works the way you think it has.

Now let’s use our multiply_columns() function to multiply columns aberdeen and nairobi. Notice now that argument x is given the value city$aberdeen and y the value city$nairobi.

aberdeen_nairobi_func <- multiply_columns(x = city$aberdeen, y = city$nairobi)
aberdeen_nairobi_func
##   [1]           NA           NA           NA           NA           NA
##   [6]           NA           NA           NA           NA           NA
##  [11] -0.008136705 -0.220566536 -0.142245562  0.235641132 -1.242867097
##  [16] -0.973571475  0.675814695 -1.061042393 -0.459994949 -0.209974194
##  [21] -1.636558287 -0.048052468  0.479714952 -0.014199527 -3.114875993
##  [26] -0.633914752 -0.119910611  0.018562896  2.530876160 -2.791055936
##  [31] -0.136059408 -1.813828792 -0.311939982  0.544955311 -0.372292552
##  [36]  0.101426367  1.735211905  0.264296945 -1.060117797  0.929893435
##  [41]  0.284327783 -0.945869285 -0.886847895  0.315642963  0.413552816
##  [46]  0.528735498  0.044063766  0.213766843 -0.095633878 -0.002140543
##  [51] -2.189619315  0.780518394 -0.706173624 -0.060039320  0.087825998
##  [56] -2.170017162 -0.015087150 -0.203797375 -0.173636277  1.248520202
##  [61] -0.010845075  0.096133655 -0.906481424 -0.081691926  2.074620487
##  [66] -0.350633353 -3.748589791 -0.618963740  0.715851675 -3.352932279
##  [71]  2.750134561 -0.102692506  0.527379797 -0.933245137 -1.924203055
##  [76]  0.455045910  0.603605045 -1.302343772 -0.637190165 -3.609264319
##  [81] -0.018146442 -0.818895571 -1.876569370 -0.170592213 -2.713911727
##  [86] -0.688417979  1.658113182 -0.192627812 -0.017116470 -0.031779879
##  [91] -0.178078074 -0.032550091 -0.930049347  0.008825016 -0.352749371
##  [96]  0.413487506 -3.570955909  0.318120662  0.152980039 -1.769991752

So far so good. All we’ve really done is wrapped the code x * y into a function, where we ask the user to specify what their x and y variables are.

Now let’s add a little complexity. If you look at the output of nairobi_genoa some of the calculations have produced NA values. This is because of those NA values we included in nairobi when we created the city data frame. Despite these NA values, the function appeared to have worked but it gave us no indication that there might be a problem. In such cases we may prefer if it had warned us that something was wrong. How can we get the function to let us know when NA values are produced? Here’s one way.

multiply_columns <- function(x, y) {
  temp_var <- x * y
  if (any(is.na(temp_var))) {
    warning("The function has produced NAs")
    return(temp_var)
  } else {
    return(temp_var)
  }
}

aberdeen_nairobi_func <- multiply_columns(city$aberdeen, city$nairobi)
## Warning in multiply_columns(city$aberdeen, city$nairobi): The function has
## produced NAs
porto_aberdeen_func <- multiply_columns(city$porto, city$aberdeen)

The core of our function is still the same. We still have x * y, but we’ve now got an extra six lines of code. Namely, we’ve included some conditional statements, if and else, to test whether any NAs have been produced and if they have we display a warning message to the user. The next section of this Chapter will explain how these work and how to use them.

Conditional statements

x * y does not apply any logic. It merely takes the value of x and multiplies it by the value of y. Conditional statements are how you inject some logic into your code. The most commonly used conditional statement is if. Whenever you see an if statement, read it as ‘If X is TRUE, do a thing’. Including an else statement simply extends the logic to ‘If X is TRUE, do a thing, or else do something different’.

Both the if and else statements allow you to run sections of code, depending on a condition is either TRUE or FALSE. The pseudocode below shows you the general form.

  if (condition) {
  Code executed when condition is TRUE
  } else {
  Code executed when condition is FALSE
  }

To delve into this a bit more, we can use an old programmer joke to set up a problem.


A programmer’s partner says: ‘Please go to the store and buy a carton of milk and if they have eggs, get six.’

The programmer returned with 6 cartons of milk.

When the partner sees this, and exclaims ‘Why the heck did you buy six cartons of milk?’

The programmer replied ‘They had eggs’.


At the risk of explaining a joke, the conditional statement here is whether or not the store had eggs. If coded as per the original request, the programmer should bring 6 cartons of milk if the store had eggs (condition = TRUE), or else bring 1 carton of milk if there weren’t any eggs (condition = FALSE). In R this is coded as:

eggs <- TRUE # Whether there were eggs in the store

  if (eggs == TRUE) { # If there are eggs
  n.milk <- 6 # Get 6 cartons of milk
    } else { # If there are not eggs
  n.milk <- 1 # Get 1 carton of milk
  }

We can then check n.milk to see how many milk cartons they returned with.

n.milk
## [1] 6

And just like the joke, our R code has missed that the condition was to determine whether or not to buy eggs, not more milk (this is actually a loose example of the [Winograd Scheme][winograd], designed to test the intelligence of artificial intelligence by whether it can reason what the intended referent of a sentence is).

We could code the exact same egg-milk joke conditional statement using an ifelse() function.

eggs <- TRUE
n.milk <- ifelse(eggs == TRUE, yes = 6, no = 1)

This ifelse() function is doing exactly the same as the more fleshed out version from earlier, but is now condensed down into a single line of code. It has the added benefit of working on vectors as opposed to single values (more on this later when we introduce loops). The logic is read in the same way; “If there are eggs, assign a value of 6 to n.milk, if there isn’t any eggs, assign the value 1 to n.milk”.

We can check again to make sure the logic is still returning 6 cartons of milk:

n.milk
## [1] 6

Currently we’d have to copy and paste code if we wanted to change if eggs were in the store or not. We learned above how to avoid lots of copy and pasting by creating a function. Just as with the simple x * y expression in our previous multiply_columns() function, the logical statements above are straightforward to code and well suited to be turned into a function. How about we do just that and wrap this logical statement up in a function?

milk <- function(eggs) {
  if (eggs == TRUE) {
    6
  } else {
    1
  }
}

We’ve now created a function called milk() where the only argument is eggs. The user of the function specifies if eggs is either TRUE or FALSE, and the function will then use a conditional statement to determine how many cartons of milk are returned.

Let’s quickly try:

milk(eggs = TRUE)
## [1] 6

And the joke is maintained. Notice in this case we have actually specified that we are fulfilling the eggs argument (eggs = TRUE). In some functions, as with ours here, when a function only has a single argument we can be lazy and not name which argument we are fulfilling. In reality, it’s generally viewed as better practice to explicitly state which arguments you are fulfilling to avoid potential mistakes.

OK, lets go back to the multiply_columns() function we created above and explain how we’ve used conditional statements to warn the user if NA values are produced when we multiple any two columns together.

multiply_columns <- function(x, y) {
  temp_var <- x * y
  if (any(is.na(temp_var))) {
    warning("The function has produced NAs")
    return(temp_var)
  } else {
    return(temp_var)
  }
}

In this new version of the function we still use x * y as before but this time we’ve assigned the values from this calculation to a temporary vector called temp_var so we can use it in our conditional statements. Note, this temp_var variable is local to our function and will not exist outside of the function due something called [R’s scoping rules][scoping]. We then use an if statement to determine whether our temp_var variable contains any NA values. The way this works is that we first use the is.na() function to test whether each value in our temp_var variable is an NA. The is.na() function returns TRUE if the value is an NA and FALSE if the value isn’t an NA. We then nest the is.na(temp_var) function inside the function any() to test whether any of the values returned by is.na(temp_var) are TRUE. If at least one value is TRUE the any() function will return a TRUE. So, if there are any NA values in our temp_var variable the condition for the if() function will be TRUE whereas if there are no NA values present then the condition will be FALSE. If the condition is TRUE the warning() function generates a warning message for the user and then returns the temp_var variable. If the condition is FALSE the code below the else statement is executed which just returns the temp_var variable.

So if we run our modified multiple_columns() function on the columns city$aberdeen and city$nairobi (which contains NAs) we will receive an warning message.

aberdeen_nairobi_func <- multiply_columns(city$aberdeen, city$nairobi)
## Warning in multiply_columns(city$aberdeen, city$nairobi): The function has
## produced NAs

Whereas if we multiple two columns that don’t contain NA values we don’t receive a warning message.

porto_aberdeen_func <- multiply_columns(city$porto, city$aberdeen)

Combining logical operators

The functions that we’ve created so far have been perfectly suited for what we need, though they have been fairly simplistic. Let’s try creating a function that has a little more complexity to it. We’ll make a function to determine if today is going to be a good day or not based on two criteria. The first criteria will depend on the day of the week (Friday or not) and the second will be whether or not your code is working (TRUE or FALSE). To accomplish this, we’ll be using if and else statements. The complexity will come from if statements immediately following the relevant else statement. We’ll use such conditional statements four times to achieve all combinations of it being a Friday or not, and if your code is working or not.

good.day <- function(code.working, day) {
  if (code.working == TRUE && day == "Friday") {
    "BEST. DAY. EVER. Stop while you are ahead and go to the pub!"
  } else if (code.working == FALSE && day == "Friday") {
    "Oh well, but at least it's Friday! Pub time!"
  } else if (code.working == TRUE && day != "Friday") {
    "So close to a good day... shame it's not a Friday"
  } else if (code.working == FALSE && day != "Friday") {
    "Hello darkness."
  }
}

good.day(code.working = TRUE, day = "Friday")
## [1] "BEST. DAY. EVER. Stop while you are ahead and go to the pub!"

good.day(FALSE, "Tuesday")
## [1] "Hello darkness."

Notice that we never specified what to do if the day was not a Friday? That’s because, for this function, the only thing that matters is whether or not it’s Friday.

We’ve also been using logical operators whenever we’ve used if statements. Logical operators are the final piece of the logical conditions jigsaw. Below is a table which summarises operators. The first two are logical operators and the final six are relational operators. You can use any of these when you make your own functions (or loops).

 

Operator Technical Description What it means Example
&& Logical AND Both conditions must be met if(cond1 == test && cond2 == test)
|| Logical OR Either condition must be met if(cond1 == test || cond2 == test)
< Less than X is less than Y if(X < Y)
> Greater than X is greater than Y if(X > Y)
<= Less than or equal to X is less/equal to Y if(X <= Y)
>= Greater than or equal to X is greater/equal to Y if(X >= Y)
== Equal to X is equal to Y if(X == Y)
!= Not equal to X is not equal to Y if(X != Y)

 

Loops

R is very good at performing repetitive tasks. If we want a set of operations to be repeated several times we use what’s known as a loop. When you create a loop, R will execute the instructions in the loop a specified number of times or until a specified condition is met. There are three main types of loop in R: the for loop, the while loop and the repeat loop.

Loops are one of the staples of all programming languages, not just R, and can be a powerful tool (although in our opinion, used far too frequently when writing R code).

For loop

The most commonly used loop structure when you want to repeat a task a defined number of times is the for loop. The most basic example of a for loop is:

for (i in 1:5) {
  print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5

But what’s the code actually doing? This is a dynamic bit of code were an index i is iteratively replaced by each value in the vector 1:5. Let’s break it down. Because the first value in our sequence (1:5) is 1, the loop starts by replacing i with 1 and runs everything between the { }. Loops conventionally use i as the counter, short for iteration, but you are free to use whatever you like, even your pet’s name, it really does not matter (except when using nested loops, in which case the counters must be called different things, like SenorWhiskers and HerrFlufferkins.

So, if we were to do the first iteration of the loop manually

i <- 1
print(i)
## [1] 1

Once this first iteration is complete, the for loop loops back to the beginning and replaces i with the next value in our 1:5 sequence (2 in this case):

i <- 2
print(i)
## [1] 2

This process is then repeated until the loop reaches the final value in the sequence (5 in this example) after which point it stops.

To reinforce how for loops work and introduce you to a valuable feature of loops, we’ll alter our counter within the loop. This can be used, for example, if we’re using a loop to iterate through a vector but want to select the next row (or any other value). To show this we’ll simply add 1 to the value of our index every time we iterate our loop.

for (i in 1:5) {
  print(i + 1)
}
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6

As in the previous loop, the first value in our sequence is 1. The loop begins by replacing i with 1, but this time we’ve specified that a value of 1 must be added to i in the expression resulting in a value of 1 + 1.

i <- 1
i + 1
## [1] 2

As before, once the iteration is complete, the loop moves onto the next value in the sequence and replaces i with the next value (2 in this case) so that i + 1 becomes 2 + 1.

i <- 2
i + 1
## [1] 3

And so on. We think you get the idea! In essence this is all a for loop is doing and nothing more.

Whilst above we have been using simple addition in the body of the loop, you can also combine loops with functions.

Let’s go back to our data frame city. Previously in the Chapter we created a function to multiply two columns and used it to create our porto_aberdeen, aberdeen_nairobi, and nairobi_genoa objects. We could have used a loop for this. Let’s remind ourselves what our data look like and the code for the multiple_columns() function.

# Recreating our dataset
city <- data.frame(
  porto = rnorm(100),
  aberdeen = rnorm(100),
  nairobi = c(rep(NA, 10), rnorm(90)),
  genoa = rnorm(100)
)

# Our function
multiply_columns <- function(x, y) {
  temp <- x * y
  if (any(is.na(temp))) {
    warning("The function has produced NAs")
    return(temp)
  } else {
    return(temp)
  }
}

To use a list to iterate over these columns we need to first create an empty list (remember lists?) which we call temp (short for temporary) which will be used to store the output of the for loop.

temp <- list()
for (i in 1:(ncol(city) - 1)) {
  temp[[i]] <- multiply_columns(x = city[, i], y = city[, i + 1])
}
## Warning in multiply_columns(x = city[, i], y = city[, i + 1]): The function has
## produced NAs

## Warning in multiply_columns(x = city[, i], y = city[, i + 1]): The function has
## produced NAs

When we specify our for loop notice how we subtracted 1 from ncol(city). The ncol() function returns the number of columns in our city data frame which is 4 and so our loop runs from i = 1 to i = 4 - 1 which is i = 3. We’ll come back to why we need to subtract 1 from this in a minute.

So in the first iteration of the loop i takes on the value 1. The multiply_columns() function multiplies the city[, 1] (porto) and city[, 1 + 1] (aberdeen) columns and stores it in the temp[[1]] which is the first element of the temp list.

The second iteration of the loop i takes on the value 2. The multiply_columns() function multiplies the city[, 2] (aberdeen) and city[, 2 + 1] (nairobi) columns and stores it in the temp[[2]] which is the second element of the temp list.

The third and final iteration of the loop i takes on the value 3. The multiply_columns() function multiplies the city[, 3] (nairobi) and city[, 3 + 1] (genoa) columns and stores it in the temp[[3]] which is the third element of the temp list.

So can you see why we used ncol(city) - 1 when we first set up our loop? As we have four columns in our city data frame if we didn’t use ncol(city) - 1 then eventually we’d try to add the 4th column with the non-existent 5th column.

Again, it’s a good idea to test that we are getting something sensible from our loop (remember, check, check and check again!). To do this we can use the identical() function to compare the variables we created by hand with each iteration of the loop manually.

porto_aberdeen_func <- multiply_columns(city$porto, city$aberdeen)
i <- 1
identical(multiply_columns(city[, i], city[, i + 1]), porto_aberdeen_func)
## [1] TRUE

aberdeen_nairobi_func <- multiply_columns(city$aberdeen, city$nairobi)
## Warning in multiply_columns(city$aberdeen, city$nairobi): The function has
## produced NAs
i <- 2
identical(multiply_columns(city[, i], city[, i + 1]), aberdeen_nairobi_func)
## Warning in multiply_columns(city[, i], city[, i + 1]): The function has
## produced NAs
## [1] TRUE

If you can follow the examples above, you’ll be in a good spot to begin writing some of your own for loops. That said there are other types of loops available to you.

While loop

Another type of loop that you may use (albeit less frequently) is the while loop. The while loop is used when you want to keep looping until a specific logical condition is satisfied (contrast this with the for loop which will always iterate through an entire sequence).

The basic structure of the while loop is:

while(logical_condition){ expression }

A simple example of a while loop is:

i <- 0
while (i <= 4) {
  i <- i + 1
  print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5

Here the loop will only continue to pass values to the main body of the loop (the expression body) when i is less than or equal to 4 (specified using the <= operator in this example). Once i is greater than 4 the loop will stop.

There is another, very rarely used type of loop; the repeat loop. The repeat loop has no conditional check so can keep iterating indefinitely (meaning a break, or “stop here”, has to be coded into it). It’s worthwhile being aware of it’s existence, but for now we don’t think you need to worry about it; the for and while loops will see you through the vast majority of your looping needs.

When to use a loop?

Loops are fairly commonly used, though sometimes a little overused in our opinion. Equivalent tasks can be performed with functions, which are often more efficient than loops. Though this raises the question when should you use a loop?

In general loops are implemented inefficiently in R and should be avoided when better alternatives exist, especially when you’re working with large datasets. However, loop are sometimes the only way to achieve the result we want.

Some examples of when using loops can be appropriate:

  • Some simulations (e.g. the Ricker model can, in part, be built using loops)

  • Recursive relationships (a relationship which depends on the value of the previous relationship [“to understand recursion, you must understand recursion”])

  • More complex problems (e.g., how long since the last badger was seen at site \(j\), given a pine marten was seen at time \(t\), at the same location \(j\) as the badger, where the pine marten was detected in a specific 6 hour period, but exclude badgers seen 30 minutes before the pine marten arrival, repeated for all pine marten detections)

  • While loops (keep jumping until you’ve reached the moon)

If not loops, then what?

In short, use the apply family of functions; apply(), lapply(), tapply(), sapply(), vapply(), and mapply(). The apply functions can often do the tasks of most “home-brewed” loops, sometimes faster (though that won’t really be an issue for most people) but more importantly with a much lower risk of error. A strategy to have in the back of your mind which may be useful is; for every loop you make, try to remake it using an apply function (often lapply or sapply will work). If you can, use the apply version. There’s nothing worse than realising there was a small, tiny, seemingly meaningless mistake in a loop which weeks, months or years down the line has propagated into a huge mess. We strongly recommend trying to use the apply functions whenever possible.

lapply

Your go to apply function will often be lapply() at least in the beginning. The way that lapply() works, and the reason it is often a good alternative to for loops, is that it will go through each element in a list and perform a task (i.e. run a function). It has the added benefit that it will output the results as a list - something you’d have to otherwise code yourself into a loop.

An lapply() has the following structure:

lapply(X, FUN)

Here X is the vector which we want to do something to. FUN stands for how much fun this is (just kidding!). It’s also short for “function”.

Let’s start with a simple demonstration first. Let’s use the lapply() function create a sequence from 1 to 5 and add 1 to each observation (just like we did when we used a for loop):

lapply(0:4, function(a) {a + 1})
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 2
## 
## [[3]]
## [1] 3
## 
## [[4]]
## [1] 4
## 
## [[5]]
## [1] 5

Notice that we need to specify our sequence as 0:4 to get the output 1 ,2 ,3 ,4 , 5 as we are adding 1 to each element of the sequence. See what happens if you use 1:5 instead.

Equivalently, we could have defined the function first and then used the function in lapply().

add_fun <- function(a) {a + 1}
lapply(0:4, add_fun)
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 2
## 
## [[3]]
## [1] 3
## 
## [[4]]
## [1] 4
## 
## [[5]]
## [1] 5

The sapply() function does the same thing as lapply() but instead of storing the results as a list, it stores them as a vector.

sapply(0:4, function(a) {a + 1})
## [1] 1 2 3 4 5

As you can see, in both cases, we get exactly the same results as when we used the for loop.