Edit: This post originally appeared on my Wordpress blog on September 20, 2009. I present it here in its original form.
The R Function of the Day series will focus on describing in plain language how certain R functions work, focusing on simple examples that you can apply to gain insight into your own data.
Today, I will discuss the tapply function.
What situation is tapply useful in?
In statistics, one of the most basic activities we do is computing summaries of variables. These summaries might be as simple as an average, or more complex. Let's look at some simple examples.
When you read the results of a medical trial, you will see things such as "The average age of subjects in this trial was 55 years in the treatment group, and 54 years in the control group."
As another example, let's look at one from the world of baseball.
Batting Leaders per Team
| Team | Player | Batting Average |
|---|---|---|
| Minnesota Twins | Joe Mauer | .374 |
| Seattle Mariners | Ichiro Suzuki | .355 |
| Boston Red Sox | Kevin Youkilis | .309 |
| … | … | … |
These two examples have a lot in common, even if they don't appear to when first reading. In the first example, we have a dataset from a medical trial. We want to break up the dataset into two groups, treatment and control, and then compute the sample average for age within each group.
In the second example, we want to break up the dataset into 30 groups, one for each MLB team, and then compute the maximum batting average within each group.
So what is in common?
In each case we have
- A dataset that can be broken up into groups
- We want to break it up into groups
- Within each group, we want to apply a function
The following table summarizes the situation.
| Example | Group Variable | Summary Variable | Function |
|---|---|---|---|
| Medical Example | Treatment | age | mean |
| Baseball Example | Team | batting average | max |
The tapply function can solve both of these problems for us!
How do I use tapply?
The tapply function is simple to use. First, we will generate some data.
> ## generate data for medical example > medical.example <- data.frame(patient = 1:100, age = rnorm(100, mean = 60, sd = 12), treatment = gl(2, 50, labels = c("Treatment", "Control"))) > summary(medical.example) patient age treatment Min. : 1.00 Min. : 29.40 Treatment:50 1st Qu.: 25.75 1st Qu.: 54.31 Control :50 Median : 50.50 Median : 61.24 Mean : 50.50 Mean : 61.29 3rd Qu.: 75.25 3rd Qu.: 66.22 Max. :100.00 Max. :102.47 > ## generate data for baseball example > ## 5 teams with 5 players per team > > baseball.example <- data.frame(team = gl(5, 5, labels = paste("Team", LETTERS[1:5])), player = sample(letters, 25), batting.average = runif(25, .200, .400)) > summary(baseball.example) team player batting.average Team A:5 a : 1 Min. :0.2172 Team B:5 c : 1 1st Qu.:0.2553 Team C:5 d : 1 Median :0.2854 Team D:5 e : 1 Mean :0.2887 Team E:5 f : 1 3rd Qu.:0.3013 g : 1 Max. :0.3859 (Other):19
Now we have some sample data. Using tapply is now straightforward. In general, the call to the function will look like the example in the first comment. Then, actual calls to the function using the data we defined above are shown.
> ## Generic Example > ## tapply(Summary Variable, Group Variable, Function) > > ## Medical Example > tapply(medical.example$age, medical.example$treatment, mean) Treatment Control 62.26883 60.30371 > ## Baseball Example > tapply(baseball.example$batting.average, baseball.example$team, max) Team A Team B Team C Team D Team E 0.3784396 0.3012680 0.3488655 0.2962828 0.3858841
Summary of tapply
The tapply function is useful when we need to break up a vector into groups defined by some classifying factor, compute a function on the subsets, and return the results in a convenient form. You can even specify multiple factors as the grouping variable, for example treatment and sex, or team and handedness.

Comments
tapply
"You can even specify multiple factors as the grouping variable, for example treatment and sex, or team and handedness."
But how?
Re: Multiple factors
Its not obvious is it ? Here is an example using the mtcars data set built-in to R. In the case of a single grouping factor - say number of cylinders ("cyl"):
> tapply(mtcars$mpg,mtcars$cyl,mean)
which is also equivalent to:
> tapply(mtcars$mpg,mtcars[c('cyl')],mean)
but you wouldn't usually employ the second style of notation for one factor (at least I wouldn't). However, this style of notation is what you could use to specify multi-factors:
> tapply(mtcars$mpg,mtcars[c('cyl','am')],mean)
cyl
4 6 8
26.66364 19.74286 15.10000
Well done! You have in one
Well done! You have in one short post cut through the confusion that the Rhelp file firmly implanted! I will most definately take a look at your other R blog posts! THANK YOU!
Well done!
As Chris said, a readable understandable description of what tapply does. Thanks for helping me having a new tool in R.
tapply with multiple FUN?
Can tapply be used to call up multiple statistical functions in one line (like mean, sd, etc). So far I can only get it to work with a single function at a time.
It is a nice demonstration
It is a nice demonstration for the tapply() function. It would be more straight forward to understand if tables of data were used directly instead of using a program to generate the tables, since the reader may have difficulties to figure out the structure of the tables.
Add new comment