Hadley Wickham « Jared Lander

An often requested feature for Hadley Wickham's ggplot2 package is the ability to vertically dodge points, lines and bars. There has long been a function to shift geoms to the side when the x-axis is categorical: position_dodge. However, no such function exists for vertical shifts when the y-axis is categorical. Hadley usually responds by saying it should be easy to build, so here is a hacky patch.

All I did was copy the old functions (geom_dodge, collide, pos_dodge and PositionDodge) and make them vertical by swapping y's with x's, height with width and vice versa. It's hacky and not tested but seems to work as I'll show below.

First the new functions:

require(proto)

## Loading required package: proto

collidev <- function(data, height = NULL, name, strategy, check.height = TRUE) {
    if (!is.null(height)) {
        if (!(all(c("ymin", "ymax") %in% names(data)))) {
            data <- within(data, {
                ymin <- y - height/2
                ymax <- y + height/2
            })
        }
    } else {
        if (!(all(c("ymin", "ymax") %in% names(data)))) {
            data$ymin <- data$y
            data$ymax <- data$y
        }
        heights <- unique(with(data, ymax - ymin))
        heights <- heights[!is.na(heights)]
        if (!zero_range(range(heights))) {
            warning(name, " requires constant height: output may be incorrect", 
                call. = FALSE)
        }
        height <- heights[1]
    }
    data <- data[order(data$ymin), ]
    intervals <- as.numeric(t(unique(data[c("ymin", "ymax")])))
    intervals <- intervals[!is.na(intervals)]
    if (length(unique(intervals)) > 1 & any(diff(scale(intervals)) < -1e-06)) {
        warning(name, " requires non-overlapping y intervals", call. = FALSE)
    }
    if (!is.null(data$xmax)) {
        ddply(data, .(ymin), strategy, height = height)
    } else if (!is.null(data$x)) {
        message("xmax not defined: adjusting position using x instead")
        transform(ddply(transform(data, xmax = x), .(ymin), strategy, height = height), 
            x = xmax)
    } else {
        stop("Neither x nor xmax defined")
    }
}

pos_dodgev <- function(df, height) {
    n <- length(unique(df$group))
    if (n == 1) 
        return(df)
    if (!all(c("ymin", "ymax") %in% names(df))) {
        df$ymin <- df$y
        df$ymax <- df$y
    }
    d_width <- max(df$ymax - df$ymin)
    diff <- height - d_width
    groupidx <- match(df$group, sort(unique(df$group)))
    df$y <- df$y + height * ((groupidx - 0.5)/n - 0.5)
    df$ymin <- df$y - d_width/n/2
    df$ymax <- df$y + d_width/n/2
    df
}

position_dodgev <- function(width = NULL, height = NULL) {
    PositionDodgev$new(width = width, height = height)
}

PositionDodgev <- proto(ggplot2:::Position, {
    objname <- "dodgev"

    adjust <- function(., data) {
        if (empty(data)) 
            return(data.frame())
        check_required_aesthetics("y", names(data), "position_dodgev")

        collidev(data, .$height, .$my_name(), pos_dodgev, check.height = TRUE)
    }

})

Now that they are built we can whip up some example data to show them off. Since this was inspired by a refactoring of my coefplot package I will use a deconstructed sample.

# get tips data
data(tips, package = "reshape2")

# fit some models
mod1 <- lm(tip ~ day + sex, data = tips)
mod2 <- lm(tip ~ day * sex, data = tips)

# build data/frame with coefficients and confidence intervals and combine
# them into one data.frame
require(coefplot)

## Loading required package: coefplot

## Loading required package: ggplot2

df1 <- coefplot(mod1, plot = FALSE, name = "Base", shorten = FALSE)
df2 <- coefplot(model = mod2, plot = FALSE, name = "Interaction", shorten = FALSE)
theDF <- rbind(df1, df2)
theDF

##    LowOuter HighOuter LowInner HighInner     Coef            Name Checkers
## 1    1.9803    3.3065  2.31183    2.9750  2.64340     (Intercept)  Numeric
## 2   -0.4685    0.9325 -0.11822    0.5822  0.23202          daySat      day
## 3   -0.2335    1.1921  0.12291    0.8357  0.47929          daySun      day
## 4   -0.6790    0.7672 -0.31745    0.4056  0.04408         dayThur      day
## 5   -0.2053    0.5524 -0.01589    0.3630  0.17354         sexMale      sex
## 6    1.8592    3.7030  2.32016    3.2421  2.78111     (Intercept)  Numeric
## 7   -1.0391    1.0804 -0.50921    0.5506  0.02067          daySat      day
## 8   -0.5430    1.7152  0.02156    1.1507  0.58611          daySun      day
## 9   -1.2490    0.8380 -0.72725    0.3163 -0.20549         dayThur      day
## 10  -1.3589    1.1827 -0.72349    0.5473 -0.08811         sexMale      sex
## 11  -1.0502    1.7907 -0.34000    1.0804  0.37022  daySat:sexMale  day:sex
## 12  -1.5324    1.4149 -0.79560    0.6781 -0.05877  daySun:sexMale  day:sex
## 13  -0.9594    1.9450 -0.23328    1.2189  0.49282 dayThur:sexMale  day:sex
##          CoefShort       Model
## 1      (Intercept)        Base
## 2           daySat        Base
## 3           daySun        Base
## 4          dayThur        Base
## 5          sexMale        Base
## 6      (Intercept) Interaction
## 7           daySat Interaction
## 8           daySun Interaction
## 9          dayThur Interaction
## 10         sexMale Interaction
## 11  daySat:sexMale Interaction
## 12  daySun:sexMale Interaction
## 13 dayThur:sexMale Interaction

# build the plot
require(ggplot2)
require(plyr)

## Loading required package: plyr

ggplot(theDF, aes(y = Name, x = Coef, color = Model)) + geom_vline(xintercept = 0, 
    linetype = 2, color = "grey") + geom_errorbarh(aes(xmin = LowOuter, xmax = HighOuter), 
    height = 0, lwd = 0, position = position_dodgev(height = 1)) + geom_errorbarh(aes(xmin = LowInner, 
    xmax = HighInner), height = 0, lwd = 1, position = position_dodgev(height = 1)) + 
    geom_point(position = position_dodgev(height = 1), aes(xmax = Coef))

plot of chunk make-Plot

Compare that to the multiplot function in coefplot that was built using geom_dodge and coord_flip.

multiplot(mod1, mod2, shorten = F, names = c("Base", "Interaction"))

plot of chunk multiplot

With the exception of the ordering and plot labels, these charts are the same. The main benefit here is that avoiding coord_flip still allows the plot to be faceted, which was not possible with coord_flip.

Hopefully Hadley will be able to take these functions and incorporate them into ggplot2.

Jared Lander is the Chief Data Scientist of Lander Analytics a New York data science and AI firm, Adjunct Professor at Columbia University, Organizer of the New York Open Statistical Programming meetup and the New York and Government Data Science and AI Conferences and author of R for Everyone.

A great way to visualize the results of a regression is to use a Coefficient Plot like the one to the right. I’ve seen people on Twitter asking how to build this and there has been an option available using Andy Gelman’s coefplot() in the arm package. Not knowing this I built my own (as seen in this post about taste testing tomatoes) and they both suffered the same problems:. Long coefficient names often got cut off by the left margin of the graph and the name of the variable was appended to all the levels of a factor. One big difference between his and mine is that his does not include the Intercept by default. Mine includes the intercept with the option of excluding it.

I managed to solve the latter problem pretty quickly using some regular expressions. Now the levels of factors are displayed alone, without being prepended by the factor name. As for the former, I fixed that yesterday by taking advantage of ggplot by Hadley Wickham which deals with the margins better than I do.

Both of these changes made for a vast improvement over what I had avialable before. Future improvements will address the sorting of the coefficients displayed and allow users to choose their own display names for the coefficients.

The function is in this file and is called plotCoef() and is very customizable, down to the color and line thickness. I kept my old version, plotCoefBase(), in the file in case some people are adverse to using ggplot, though no one should be. I sent the code to Dr. Gelman to hopefully be incorporated into his function which I’m sure gets used by a lot more people than mine will. Examples of my old version and of Dr. Gelman’s are after the break.

As always, any comments or questions are welcomed. Go to the Contact page or send an email to contact -at- jaredlander -dot- com or find me on Twitter @jaredlander. Continue reading →

Last week Slice ran a post about a tomato taste test they conducted with Scott Wiener (of Scott’s NYC Pizza Tours), Brooks Jones, Jason Feirman, Nick Sherman and Roberto Caporuscio from Keste. While the methods used may not be rigorous enough for definitive results, I took the summary data that was in the post and performed some simple analyses.

The first thing to note is that there are only 16 data points, so multiple regression is not an option. We can all thank the Curse of Dimensionality for that. So I stuck to simpler methods and visualizations. If I can get the raw data from Slice, I can get a little more advanced.

For the sake of simplicity I removed the tomatoes from Eataly because their price was such an outlier that it made visualizing the data difficult. As usual, most of the graphics were made using ggplot2 by Hadley Wickham. The coefficient plots were made using a little function I wrote. Here is the code. Any suggestions for improvement are greatly appreciated, especially if you can help with increasing the left hand margin of the plot. And as always, all the work was done in R.

The most obvious relationship we want to test is Overall Quality vs. Price. As can be seen from the scatterplot below with a fitted loess curve, there is not a linear relationship between price and quality.

More after the break. Continue reading →

Drew Conway has a piece on his Zero Intelligence Agents blog about how well informed Tea Party protesters are about tax policy. His analysis is pretty technical and he even offers up the R code he used to analyze the data and build the graphs which were made with a package called ggplot2 by Hadley Wickham at Rice University.

More after the break. Continue reading →

Jared Lander

Tag Archives: Hadley Wickham

Vertical Dodging in ggplot2

Related Posts

Coefficient Plot

Related Posts

Tomato Taste Test

Related Posts

Overestimating Taxes

Related Posts