We’ve seen a number of R’s built-in functions so far, but here comes the best part — you can write your own functions.
Recall a line we had earlier:
> sum(Nile > 1200)
This gave us the count of the elements in the Nile
data larger than 1200.
Now, say we want the mean of those elements:
> gt1200 <- which(Nile > 1200)
> nileSubsetGT1200 <- Nile[gt1200]
> mean(nileSubsetGT1200)
[1] 1250
As before, we could instead write a more compact version,
> mean(Nile[Nile > 1200])
[1] 1250
But it’s best to do it step by step at first. Let’s see how those steps work. Writing the code with line numbers for reference, the code is
1 gt1200Indices <- which(Nile > 1200)
2 nileSubsetGT1200 <- Nile[gt1200Indices]
3 mean(nileSubsetGT1200)
Let’s review how this works:
In line 1, we find the indices in Nile
for the elements larger than 1200.
In line 2, we extract the subset of Nile
consisting of those
elements.
In line 3, we compute the desired mean.
But we may wish to do this kind thing often, on many datasets etc. Then we have:
Tip: If we have an operation we will use a lot, we should consider writing a function for it.
Say we want to do the above again, but with 1350 instead of 1200. Or, with the
tg$len
vector from our ToothGrowth example, with 10.2 as our lower bound. We could keep typing the same pattern as above, but if we’re going to do this a lot, it’s better to write a function for it:
Here is our function:
> mgd <- function(x,d) mean(x[x > d])
Here I’ve used a compact form for convenience. (Otherwise I’d
need to use blocks to be covered in a later lesson.) I named it mgd
for “mean of elements greater than d,” but any name is fine.
Let’s try it out, then explain:
> mgd(Nile,1200)
[1] 1250
> mgd(tg$len,10.2)
[1] 21.58125
This saved me typing. In the second call, I would have had to type
mean(tg$len[tg$len > 10.2])
considerably longer. But even more importantly, I’d have to think about the operation each time I used it; by making a function out of it, I’ve got it ready to go, all debugged, whenever I need it.
So, how does all this work? Again, look at the code:
> mgd <- function(x,d) mean(x[x > d])
> class(mgd)
[1] "function"
There is a lot going on here. Bear with me for a moment, as I bring in a little of the “theory” of R:
Odd to say, but there is a built-in function in R itself named
‘function’! We’ve already seen several built-in R functions, e.g.
mean()
, sum()
and plot()
. Well, here is another,
function()
. We’re calling it here. And its job is to build a
function. Yes, as I like to say, to my students’ amusement,
“The function of the function named
function
is to build functions! And the class of object returned byfunction
isfunction
!”
So, in the line
> mgd <- function(x,d) mean(x[x > d])
we are telling R, “R, I want to write my own function. I’d like to name
it mgd
; it will have arguments x
and d
, and it will do mean(x[x > d])
. Please build the function for me. Thanks in advance, R!”
Here we called function
to build a function
object, and then
assigned to mgd
. We can then call the latter, as we saw above,
repeated here for convenience:
> mgd(Nile,1200)
[1] 1250
In executing
> mgd <- function(x,d) mean(x[x > d])
x
and d
are known as formal arguments, as they are just
placeholders. For example, in
> mgd(Nile,1200)
we said, “R, please execute mgd
with Nile
playing the role of
x
, and 1200 playing the role of d
. Here Nile
and 1200 are
known as the actual arguments.
As with variables, we can pretty much name functions and their arguments as we please.
As you have seen with R’s built-in functions, a function will typically have a return value. In our case here, we could arrange that by writing
> mgd <- function(x,d) return(mean(x[x > d]))
a bit more complicated than the above version. But the call to return is not needed here, because in any function, R will return the last value computed, in this case the requested mean.
And we can save the function for later use. One way to do this is to
call R’s save
function, which can be used to save any R object:
> save(mgd,file='mean_greater_than_d')
The function has now been saved in the indicated file, which will be in whatever folder R is running in right now. We can leave R, and say, come back tomorrow. If we then start R from that same folder, we then run
> load('mean_greater_than_d')
and then mgd
will be restored, ready for us to use again.
(Typically this is not the way people save code, but this is
the subject of a later lesson.)
Let’s write another function, this one to find the range of a vector, i.e. the difference between the minimal and maximal values:
> rng <- function(y) max(y) - min(y)
> rng(Nile)
[1] 914
Here we made use of the built-in R functions max
and min
.
Tip: Build new functions from old ones (which may in turn depend on other old ones, etc.).
Again, the last item computed is the subtraction, so it will be
automatically returned, just what we want. As before, I chose to name
the argument y
, but it could be anything. However, I did not name
the function range
, as there is already a built-in R function of that
name.
Your Turn: Try your hand at writing some simple functions along the lines seen here. You might start by writing a function
cgd()
, likemgd()
above, but returning the count of the number of elements inx
that are greater thand
. Then may try writing a functionn0(x)
, that returns the number of 0s in the vectorx
. (Hint: Make use of R’s==
andsum
.) Another suggestion would be a functionhld(x,d)
, which draws a histogram for those elements in the vectorx
that are less thand
. Write at least 4 or 5 functions; the more you write, the easier it will be in later lessons.
Functions are R objects, just as are vectors, lists and so on. Thus, we can print them by just typing their names!
> mgd <- function(x,d) mean(x[x > d])
> mgd
function(x,d) mean(x[x > d])