Some resources for R help (especially for GLMMs)

A quick pointer to some on-line resources for help on stats, especially on GLMMs (from the R user community):

  • The GLMM Wiki is a resource especially for researchers working with GLMMs and includes a FAQ from R-sig-ME.
  • RSeek allows searches by different categories, including support lists, functions, code, and blogs, includes R-sig-ME, and appears to be up-to-date.
  • R Site Search also includes R-sig-ME through 2010.
  • The R-lang archives are searchable, too.

Formatting numbers for printing in R: rounding and trailing zeroes

In journal articles, descriptive statistics results are often presented as a table of means, with standard deviation/standard error given in parentheses following the mean. Here is how you can prepare the print formatting when working in R: it’s rather trivial, but I wasted a sufficient amount of time on it that I thought it was worth mentioning. (I follow this by using latex() from the Hmisc package to generate LaTeX output of such a table.)

Suppose I have a vector of means for percent correct from 5 different experimental conditions, and a vector of the corresponding standard errors, and both are sorted in the same order:

all.res.corr$per.correct # Vector of mean percent correct, comes from a data frame called all.res.corr
#[1] 52.53561 60.51282 64.13105 66.38177 67.46439
all.res.corr$se.per.corr # Vector of standard errors, from same data frame
#[1] 2.409955 2.763525 2.831450 2.909737 2.902924
  1. First, use round() to round the numbers to two decimal places.
    #[1] 52.54 60.51 64.13 66.38 67.46
    #[1] 2.41 2.76 2.83 2.91 2.90
  2. Then use formatC() to access C-like formatting options to format the two vectors to print up to 2 digits after the decimal point. The argument specification format = "f" allows you to set the number of digits after the decimal point to print. This is a crucial step! If you don’t do it, you won’t get trailing zeroes to print, even though it looks like in the preceding code block, for instance, that round() preserves the trailing zero. It gets lost in a numeric to character data type conversion. See this post for an alternative using sprintf() (which may be familiar from many other programming languages).
    formatC(round(all.res.corr$per.correct,2), 2, format = "f"
    #[1] "52.54" "60.51" "64.13" "66.38" "67.46" # note conversion to character type
    formatC(round(all.res.corr$se.per.corr,2), 2, format = "f")
    [1] "2.41" "2.76" "2.83" "2.91" "2.90"
  3. Finally, use paste() to concatenate the vectors for publication, together with parentheses in the appropriate place. The code below concatenates the arguments given: the formatted mean vector all.res.corr$per.correct, a space and opening parentheses (, the formatted SE vector all.res.corr$se.per.corr, and a closing parentheses ). The separator sep is specified to be an empty string, for there to be no separator between the arguments concatenated.
  4. (per.all.res.corr.print <- paste(formatC(round(all.res.corr$per.correct,2), 2, format = "f"), " (", formatC(round(all.res.corr$se.per.corr,2), 2, format = "f"), ")", sep="") # n.b. Parentheses around an assignment command print the assigned value
    #[1] "52.54 (2.41)" "60.51 (2.76)" "64.13 (2.83)" "66.38 (2.91)" "67.46 (2.90)"

Missing data and data aggregation in R

Faceted barplot with doubled bars

Note the doubled bars in facet 2, 19. This is because of missing rows in the data frame.

I was puzzled why my bar plots (using the R package ggplot2 and geom_bar()) were showing up with doubled bars (see facet 2/19).

When I looked at the data used for the plotting, it turned out that the data frame I was plotting data from had suppressed rows with missing data (e.g. the data frame has no row for 2 for some experimental conditions):

dat.subj <- ddply(mono, c("is.creak","response",""), function(d) data.frame(mean.log.rt=mean(d[,"log.rt"])))

> head(dat.subj)
  is.creak response mean.log.rt
1        0       T4        1  0.20970491
3        0       T4        3 -0.35065706
4        0       T4        4 -0.02450301
5        0       T4        5 -0.20948722
6        0       T4        6  0.72335601

Then I learned about the .drop argument for ddply() from this post. Read the post for more information on how other data aggregation functions behave with respect to missing data.

By default, ddply() assigns .drop = TRUE. So I assigned .drop = FALSE and now the missing row appears with NaN.

dat.subj <- ddply(mono, c("is.creak","response",""), function(d) data.frame(mean.log.rt=mean(d[,"log.rt"])), .drop=FALSE)

> head(dat.subj)
  is.creak response mean.log.rt
1        0       T4        1  0.20970491
2        0       T4        2         NaN
3        0       T4        3 -0.35065706
4        0       T4        4 -0.02450301
5        0       T4        5 -0.20948722
6        0       T4        6  0.72335601

Here’s the revised plot, which prints correctly.

The plot shows missing data correctly when the data frame indicates missing data explicitly with NaN

Postscript: the last thing I needed to fix was that for calculating standard error over the subjects in further data analysis, I used the sd() function in aggregation. Because the data frame for subjects, dat.subj, now included rows with missing data, I needed to call sd() to ignore missing values, like this:

sd(d[,"mean.log.rt"], na.rm = TRUE)

Units for log reaction times

In psychological experiments, it is common to measure reaction time in the response of a subject (an experimental participant). In inferential statistics, reaction times are typically log-transformed because raw reaction times are skewed to the high. This is because there is a lower physical limit to how fast participants can respond, but not an upper one. Many statistical tests assume a normal distribution of the dependent variable, and thus, reaction times are log-transformed to reduce skew.

What happens to the units when we take the logarithm of a reaction time? A reaction time is measured in some unit of time, [T], e.g. seconds. But we cannot take the logarithm of a dimensioned quantity! The logarithm is defined as the inverse of exponentiation:

y = \log_{b}x \,\,\,\mbox{if}\,\,\, x = b^{y}

where x, y, b \in \mathbb{R}.

Dimensional homogeneity must be preserved under equality; that is, the units of x must be the same as the units of b^{y} and the units of y must be the same as the units of \log_{b}x. Thus, x, y, b must all be unitless: in particular, the logarithmic function does not admit a dimensioned quantity as an argument, and a log-transformed quantity is unitless. Note that dimensional analysis is essentially type checking, where the types are physical units.

So when we log transform a reaction time, e.g. \log(0.024 s), what we actually mean is \log(\frac{0.024 s}{1 s}) which we can also write as \log(0.024 s/s). Since the logarithmic function admits only unitless arguments, we must take the logarithm of a ratio of reaction times. In physical systems, there is often a natural standard reference value to take the ratio to, as for pressure (standard atmospheric pressure), but there isn’t such a natural standard that I know of for reaction times. So one can take the ratio with respect to a unit quantity in the units the reaction time was measured in, as shown above.

Thus, in labeling a plot or table of log-transformed reaction times, it is incorrect to write log RT (s). Instead, one should write log (RT/s) or log (RT/[s]) or maybe log RT (RT in s). We still want to know what units the raw reaction times were measured in, since they scale the log-transformed values!

For an expanded discussion of these topics, see Can One Take the Logarithm or the Sine of a Dimensioned Quantity or a Unit? Dimensional Analysis Involving Transcendental Functions by Chérif F. Matta and Lou Massa and Anna V. Gubskaya and Eva Knoll, to appear in the Journal of Chemical Education.