Missing data and data aggregation in R

Faceted barplot with doubled bars

Note the doubled bars in facet 2, 19. This is because of missing rows in the data frame.

I was puzzled why my bar plots (using the R package ggplot2 and geom_bar()) were showing up with doubled bars (see facet 2/19).

When I looked at the data used for the plotting, it turned out that the data frame I was plotting data from had suppressed rows with missing data (e.g. the data frame has no row for subj.new 2 for some experimental conditions):


dat.subj <- ddply(mono, c("is.creak","response","subj.new"), function(d) data.frame(mean.log.rt=mean(d[,"log.rt"])))

> head(dat.subj)
  is.creak response subj.new mean.log.rt
1        0       T4        1  0.20970491
3        0       T4        3 -0.35065706
4        0       T4        4 -0.02450301
5        0       T4        5 -0.20948722
6        0       T4        6  0.72335601

Then I learned about the .drop argument for ddply() from this post. Read the post for more information on how other data aggregation functions behave with respect to missing data.

By default, ddply() assigns .drop = TRUE. So I assigned .drop = FALSE and now the missing row appears with NaN.

dat.subj <- ddply(mono, c("is.creak","response","subj.new"), function(d) data.frame(mean.log.rt=mean(d[,"log.rt"])), .drop=FALSE)

> head(dat.subj)
  is.creak response subj.new mean.log.rt
1        0       T4        1  0.20970491
2        0       T4        2         NaN
3        0       T4        3 -0.35065706
4        0       T4        4 -0.02450301
5        0       T4        5 -0.20948722
6        0       T4        6  0.72335601

Here’s the revised plot, which prints correctly.

The plot shows missing data correctly when the data frame indicates missing data explicitly with NaN

Postscript: the last thing I needed to fix was that for calculating standard error over the subjects in further data analysis, I used the sd() function in aggregation. Because the data frame for subjects, dat.subj, now included rows with missing data, I needed to call sd() to ignore missing values, like this:

sd(d[,"mean.log.rt"], na.rm = TRUE)

Units for log reaction times

In psychological experiments, it is common to measure reaction time in the response of a subject (an experimental participant). In inferential statistics, reaction times are typically log-transformed because raw reaction times are skewed to the high. This is because there is a lower physical limit to how fast participants can respond, but not an upper one. Many statistical tests assume a normal distribution of the dependent variable, and thus, reaction times are log-transformed to reduce skew.

What happens to the units when we take the logarithm of a reaction time? A reaction time is measured in some unit of time, [T], e.g. seconds. But we cannot take the logarithm of a dimensioned quantity! The logarithm is defined as the inverse of exponentiation:

y = \log_{b}x \,\,\,\mbox{if}\,\,\, x = b^{y}

where x, y, b \in \mathbb{R}.

Dimensional homogeneity must be preserved under equality; that is, the units of x must be the same as the units of b^{y} and the units of y must be the same as the units of \log_{b}x. Thus, x, y, b must all be unitless: in particular, the logarithmic function does not admit a dimensioned quantity as an argument, and a log-transformed quantity is unitless. Note that dimensional analysis is essentially type checking, where the types are physical units.

So when we log transform a reaction time, e.g. \log(0.024 s), what we actually mean is \log(\frac{0.024 s}{1 s}) which we can also write as \log(0.024 s/s). Since the logarithmic function admits only unitless arguments, we must take the logarithm of a ratio of reaction times. In physical systems, there is often a natural standard reference value to take the ratio to, as for pressure (standard atmospheric pressure), but there isn’t such a natural standard that I know of for reaction times. So one can take the ratio with respect to a unit quantity in the units the reaction time was measured in, as shown above.

Thus, in labeling a plot or table of log-transformed reaction times, it is incorrect to write log RT (s). Instead, one should write log (RT/s) or log (RT/[s]) or maybe log RT (RT in s). We still want to know what units the raw reaction times were measured in, since they scale the log-transformed values!

For an expanded discussion of these topics, see Can One Take the Logarithm or the Sine of a Dimensioned Quantity or a Unit? Dimensional Analysis Involving Transcendental Functions by Chérif F. Matta and Lou Massa and Anna V. Gubskaya and Eva Knoll, to appear in the Journal of Chemical Education.


R: do not have nlme() and lmer() packages simultaneously loaded

I noticed when I was trying to display the summary of an lmer object using display() from the arm() package, I was getting this error:

Error in UseMethod("fixef") :
no applicable method for 'fixef' applied to an object of class "mer"

And I found from this post that you should not simultaneously have the nlme() and lmer() packages loaded. To detach a package, for instance, nlme(), you can do this (see R FAQ 5.2):

 detach("package:nlme")