# Some resources for R help (especially for GLMMs)

**Posted:**August 8, 2011

**Filed under:**data analysis, R, statistics Leave a comment

A quick pointer to some on-line resources for help on stats, especially on GLMMs (from the R user community):

- The GLMM Wiki is a resource especially for researchers working with GLMMs and includes a FAQ from R-sig-ME.
- RSeek allows searches by different categories, including support lists, functions, code, and blogs, includes R-sig-ME, and appears to be up-to-date.
- R Site Search also includes R-sig-ME through 2010.
- The R-lang archives are searchable, too.

# Formatting numbers for printing in R: rounding and trailing zeroes

**Posted:**January 11, 2011

**Filed under:**code, data analysis, R 2 Comments

In journal articles, descriptive statistics results are often presented as a table of means, with standard deviation/standard error given in parentheses following the mean. Here is how you can prepare the print formatting when working in `R`

: it’s rather trivial, but I wasted a sufficient amount of time on it that I thought it was worth mentioning. (I follow this by using `latex()`

from the `Hmisc`

package to generate `LaTeX`

output of such a table.)

Suppose I have a vector of means for percent correct from 5 different experimental conditions, and a vector of the corresponding standard errors, and both are sorted in the same order:

all.res.corr$per.correct # Vector of mean percent correct, comes from a data frame called all.res.corr #[1] 52.53561 60.51282 64.13105 66.38177 67.46439

all.res.corr$se.per.corr # Vector of standard errors, from same data frame #[1] 2.409955 2.763525 2.831450 2.909737 2.902924

- First, use
`round()`

to round the numbers to two decimal places.round(all.res.corr$per.correct,2) #[1] 52.54 60.51 64.13 66.38 67.46 round(all.res.corr$se.per.corr,2) #[1] 2.41 2.76 2.83 2.91 2.90

- Then use
`formatC()`

to access C-like formatting options to format the two vectors to print up to 2 digits after the decimal point. The argument specification`format = "f"`

allows you to set the number of digits after the decimal point to print. This is a crucial step! If you don’t do it, you won’t get trailing zeroes to print, even though it looks like in the preceding code block, for instance, that`round()`

preserves the trailing zero. It gets lost in a numeric to character data type conversion. See this post for an alternative using`sprintf()`

(which may be familiar from many other programming languages).formatC(round(all.res.corr$per.correct,2), 2, format = "f" #[1] "52.54" "60.51" "64.13" "66.38" "67.46" # note conversion to character type formatC(round(all.res.corr$se.per.corr,2), 2, format = "f") [1] "2.41" "2.76" "2.83" "2.91" "2.90"

- Finally, use
`paste()`

to concatenate the vectors for publication, together with parentheses in the appropriate place. The code below concatenates the arguments given: the formatted mean vector`all.res.corr$per.correct`

, a space and opening parentheses`(`

, the formatted SE vector`all.res.corr$se.per.corr`

, and a closing parentheses`)`

. The separator`sep`

is specified to be an empty string, for there to be no separator between the arguments concatenated.

(per.all.res.corr.print &lt;- paste(formatC(round(all.res.corr$per.correct,2), 2, format = "f"), " (", formatC(round(all.res.corr$se.per.corr,2), 2, format = "f"), ")", sep="") # n.b. Parentheses around an assignment command print the assigned value #[1] "52.54 (2.41)" "60.51 (2.76)" "64.13 (2.83)" "66.38 (2.91)" "67.46 (2.90)"

# Missing data and data aggregation in R

**Posted:**December 1, 2010

**Filed under:**code, data analysis, R, statistics Leave a comment

I was puzzled why my bar plots (using the R package `ggplot2`

and `geom_bar()`

) were showing up with doubled bars (see facet 2/19).

When I looked at the data used for the plotting, it turned out that the data frame I was plotting data from had suppressed rows with missing data (e.g. the data frame has no row for subj.new 2 for some experimental conditions):

dat.subj <- ddply(mono, c("is.creak","response","subj.new"), function(d) data.frame(mean.log.rt=mean(d[,"log.rt"]))) > head(dat.subj) is.creak response subj.new mean.log.rt 1 0 T4 1 0.20970491 3 0 T4 3 -0.35065706 4 0 T4 4 -0.02450301 5 0 T4 5 -0.20948722 6 0 T4 6 0.72335601

Then I learned about the `.drop`

argument for `ddply()`

from this post. Read the post for more information on how other data aggregation functions behave with respect to missing data.

By default, `ddply()`

assigns `.drop = TRUE`

. So I assigned `.drop = FALSE`

and now the missing row appears with `NaN`

.

dat.subj <- ddply(mono, c("is.creak","response","subj.new"), function(d) data.frame(mean.log.rt=mean(d[,"log.rt"])), .drop=FALSE) > head(dat.subj) is.creak response subj.new mean.log.rt 1 0 T4 1 0.20970491 2 0 T4 2 NaN 3 0 T4 3 -0.35065706 4 0 T4 4 -0.02450301 5 0 T4 5 -0.20948722 6 0 T4 6 0.72335601

Here’s the revised plot, which prints correctly.

Postscript: the last thing I needed to fix was that for calculating standard error over the subjects in further data analysis, I used the `sd()`

function in aggregation. Because the data frame for subjects, `dat.subj`

, now included rows with missing data, I needed to call `sd()`

to ignore missing values, like this:

sd(d[,"mean.log.rt"], na.rm = TRUE)

# Units for log reaction times

**Posted:**November 28, 2010

**Filed under:**data analysis, mathematics Leave a comment

In psychological experiments, it is common to measure reaction time in the response of a subject (an experimental participant). In inferential statistics, reaction times are typically log-transformed because raw reaction times are skewed to the high. This is because there is a lower physical limit to how fast participants can respond, but not an upper one. Many statistical tests assume a normal distribution of the dependent variable, and thus, reaction times are log-transformed to reduce skew.

What happens to the units when we take the logarithm of a reaction time? A reaction time is measured in some unit of time, [T], e.g. seconds. But we cannot take the logarithm of a dimensioned quantity! The logarithm is defined as the inverse of exponentiation:

where .

Dimensional homogeneity must be preserved under equality; that is, the units of must be the same as the units of and the units of must be the same as the units of . Thus, must all be unitless: in particular, **the logarithmic function does not admit a dimensioned quantity as an argument, and a log-transformed quantity is unitless**. Note that dimensional analysis is essentially type checking, where the types are physical units.

So when we log transform a reaction time, e.g. , what we actually mean is which we can also write as . **Since the logarithmic function admits only unitless arguments, we must take the logarithm of a ratio of reaction times**. In physical systems, there is often a natural standard reference value to take the ratio to, as for pressure (standard atmospheric pressure), but there isn’t such a natural standard that I know of for reaction times. So one can take the ratio with respect to a unit quantity in the units the reaction time was measured in, as shown above.

Thus, in labeling a plot or table of log-transformed reaction times, it is incorrect to write *log RT (s)*. Instead, one should write * log (RT/s)* or * log (RT/[s])* or maybe *log RT (RT in s)*. We still want to know what units the raw reaction times were measured in, since they scale the log-transformed values!

For an expanded discussion of these topics, see Can One Take the Logarithm or the Sine of a Dimensioned Quantity or a Unit? Dimensional Analysis Involving Transcendental Functions by Chérif F. Matta and Lou Massa and Anna V. Gubskaya and Eva Knoll, to appear in the *Journal of Chemical Education*.