# Missing data and data aggregation in R

**Posted:**December 1, 2010

**Filed under:**code, data analysis, R, statistics Leave a comment

I was puzzled why my bar plots (using the R package `ggplot2`

and `geom_bar()`

) were showing up with doubled bars (see facet 2/19).

When I looked at the data used for the plotting, it turned out that the data frame I was plotting data from had suppressed rows with missing data (e.g. the data frame has no row for subj.new 2 for some experimental conditions):

dat.subj <- ddply(mono, c("is.creak","response","subj.new"), function(d) data.frame(mean.log.rt=mean(d[,"log.rt"]))) > head(dat.subj) is.creak response subj.new mean.log.rt 1 0 T4 1 0.20970491 3 0 T4 3 -0.35065706 4 0 T4 4 -0.02450301 5 0 T4 5 -0.20948722 6 0 T4 6 0.72335601

Then I learned about the `.drop`

argument for `ddply()`

from this post. Read the post for more information on how other data aggregation functions behave with respect to missing data.

By default, `ddply()`

assigns `.drop = TRUE`

. So I assigned `.drop = FALSE`

and now the missing row appears with `NaN`

.

dat.subj <- ddply(mono, c("is.creak","response","subj.new"), function(d) data.frame(mean.log.rt=mean(d[,"log.rt"])), .drop=FALSE) > head(dat.subj) is.creak response subj.new mean.log.rt 1 0 T4 1 0.20970491 2 0 T4 2 NaN 3 0 T4 3 -0.35065706 4 0 T4 4 -0.02450301 5 0 T4 5 -0.20948722 6 0 T4 6 0.72335601

Here’s the revised plot, which prints correctly.

Postscript: the last thing I needed to fix was that for calculating standard error over the subjects in further data analysis, I used the `sd()`

function in aggregation. Because the data frame for subjects, `dat.subj`

, now included rows with missing data, I needed to call `sd()`

to ignore missing values, like this:

sd(d[,"mean.log.rt"], na.rm = TRUE)

# Units for log reaction times

**Posted:**November 28, 2010

**Filed under:**data analysis, mathematics Leave a comment

In psychological experiments, it is common to measure reaction time in the response of a subject (an experimental participant). In inferential statistics, reaction times are typically log-transformed because raw reaction times are skewed to the high. This is because there is a lower physical limit to how fast participants can respond, but not an upper one. Many statistical tests assume a normal distribution of the dependent variable, and thus, reaction times are log-transformed to reduce skew.

What happens to the units when we take the logarithm of a reaction time? A reaction time is measured in some unit of time, [T], e.g. seconds. But we cannot take the logarithm of a dimensioned quantity! The logarithm is defined as the inverse of exponentiation:

where .

Dimensional homogeneity must be preserved under equality; that is, the units of must be the same as the units of and the units of must be the same as the units of . Thus, must all be unitless: in particular, **the logarithmic function does not admit a dimensioned quantity as an argument, and a log-transformed quantity is unitless**. Note that dimensional analysis is essentially type checking, where the types are physical units.

So when we log transform a reaction time, e.g. , what we actually mean is which we can also write as . **Since the logarithmic function admits only unitless arguments, we must take the logarithm of a ratio of reaction times**. In physical systems, there is often a natural standard reference value to take the ratio to, as for pressure (standard atmospheric pressure), but there isn’t such a natural standard that I know of for reaction times. So one can take the ratio with respect to a unit quantity in the units the reaction time was measured in, as shown above.

Thus, in labeling a plot or table of log-transformed reaction times, it is incorrect to write *log RT (s)*. Instead, one should write * log (RT/s)* or * log (RT/[s])* or maybe *log RT (RT in s)*. We still want to know what units the raw reaction times were measured in, since they scale the log-transformed values!

For an expanded discussion of these topics, see Can One Take the Logarithm or the Sine of a Dimensioned Quantity or a Unit? Dimensional Analysis Involving Transcendental Functions by Chérif F. Matta and Lou Massa and Anna V. Gubskaya and Eva Knoll, to appear in the *Journal of Chemical Education*.

# R: do not have nlme() and lmer() packages simultaneously loaded

**Posted:**November 10, 2010

**Filed under:**code, R, statistics Leave a comment

I noticed when I was trying to display the summary of an lmer object using display() from the arm() package, I was getting this error:

`Error in UseMethod("fixef") :`

no applicable method for 'fixef' applied to an object of class "mer"

And I found from this post that you should not simultaneously have the nlme() and lmer() packages loaded. To detach a package, for instance, nlme(), you can do this (see R FAQ 5.2):

detach("package:nlme")