Missing data and data aggregation in R

Faceted barplot with doubled bars

Note the doubled bars in facet 2, 19. This is because of missing rows in the data frame.

I was puzzled why my bar plots (using the R package ggplot2 and geom_bar()) were showing up with doubled bars (see facet 2/19).

When I looked at the data used for the plotting, it turned out that the data frame I was plotting data from had suppressed rows with missing data (e.g. the data frame has no row for subj.new 2 for some experimental conditions):


dat.subj <- ddply(mono, c("is.creak","response","subj.new"), function(d) data.frame(mean.log.rt=mean(d[,"log.rt"])))

> head(dat.subj)
  is.creak response subj.new mean.log.rt
1        0       T4        1  0.20970491
3        0       T4        3 -0.35065706
4        0       T4        4 -0.02450301
5        0       T4        5 -0.20948722
6        0       T4        6  0.72335601

Then I learned about the .drop argument for ddply() from this post. Read the post for more information on how other data aggregation functions behave with respect to missing data.

By default, ddply() assigns .drop = TRUE. So I assigned .drop = FALSE and now the missing row appears with NaN.

dat.subj <- ddply(mono, c("is.creak","response","subj.new"), function(d) data.frame(mean.log.rt=mean(d[,"log.rt"])), .drop=FALSE)

> head(dat.subj)
  is.creak response subj.new mean.log.rt
1        0       T4        1  0.20970491
2        0       T4        2         NaN
3        0       T4        3 -0.35065706
4        0       T4        4 -0.02450301
5        0       T4        5 -0.20948722
6        0       T4        6  0.72335601

Here’s the revised plot, which prints correctly.

The plot shows missing data correctly when the data frame indicates missing data explicitly with NaN

Postscript: the last thing I needed to fix was that for calculating standard error over the subjects in further data analysis, I used the sd() function in aggregation. Because the data frame for subjects, dat.subj, now included rows with missing data, I needed to call sd() to ignore missing values, like this:

sd(d[,"mean.log.rt"], na.rm = TRUE)
Advertisements


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s