Stock, W. A., & Behrens, J. T. (1991). Box, line, and midgap plots: Effects of display characteristics on the accuracy and bias of estimates of whisker length. Journal of Educational Statistics, 16(1), 1–20. https://doi.org/10.2307/1165096
Examined the accuracy and bias of estimates of whisker length on box-and-whisker plots based on box, line, and midgap plots. For each type of graph, a different sample of undergraduates (58 Ss total) viewed 48 single-plot graphs. For each plot, Ss were given the length of an interquartile spread and asked to estimate the length of a whisker. Plots varied in spatial orientation (horizontal or vertical), interquartile spread, the ratio of whisker length to interquartile spread, and whisker judged. Estimates of whisker length for box and line plots were more accurate and less biased than those for midgap plots. Interquartile spread, the ratio of whisker length to interquartile spread, and the interaction of these 2 factors significantly influenced both accuracy and bias. Boxplots displayed a predicted pattern of over- and underestimation. Midgap plots are judged to be less optimal displays than box and line plots.
That paper is fairly widely accessible. People with access to Jstor will find it easily. You can tell from the abstract that the authors aren't positive about the design, but to me the question being focused on is of little or no interest: The point about a box plot is not whether I can estimate distances by eye exactly; if I want to know those distances I look at the numbers! The point about a box plot is whether it helps me get a good idea of the main features of a distribution. Also, with respect, whether a captive audience of undergraduates is good with a design they may have not seen before is of interest and importance, but not my main concern, which is communicating results to myself and to other researchers.
But what is a midgap plot? It appears to be the authors' own term for a design mentioned, and not even very enthusiastically, by Edward Tufte in what I still think is his best single book on visualization, https://www.edwardtufte.com/tufte/books_vdqi Tufte's name is quartile plot.
It's a box plot -- without the box -- but with a marker symbol for the median, and whiskers between each quartile and the extreme beyond. This is minimalism so minimal that minimalism is too long a word for it. But the information the box gives is implied by the other information.
I think it's a much better idea than any of these authors imply, especially if the data are also shown.
* Boxes in a box plot show emphatically where are the median and quartiles, but sometimes the emphasis is too strong. The quartiles are not magic thresholds at which anything happens beyond the cumulative probability passing 25% and 75%. This can bite very hard, as with a U-shaped distribution in which the top 25% and the bottom 25% are shown only by short whiskers. Even experienced statisticians have misread such box plots. (To be fair John Tukey in Exploratory Data Analysis has a salutary example showing the superiority of dot plots over box plots where the data are basically two groups.)
* Boxes take up space, but you can control that by making them thin. The ultimate in control of box width is to make them invisible.
* What is going on in the middle of a distribution is not necessarily the feature that needs most emphasis. The tails are as or more important for many problems.
For box plots I sometimes use graph box or graph hbox but more often I reach for stripplot from SSC. I thought about hitting the code to add a distinct new option, but the syntax is complicated enough already and it's possible to get there without too much extra work. I like to show the data and summaries too, and I don't mind if that design is accused or being repetitive or redundant.
Code:
sysuse auto, clear
set scheme s1color
egen median = median(mpg), by(foreign)
gen where = foreign - 0.07
stripplot mpg , over(foreign) stack box(barw(0)) pctile(0) boffset(-0.07) vertical addplot(scatter median where, ms(Dh) mc(black)) ms(Sh) height(0.2)
Array
I used diamonds partly for fun, whereas all the authors mentioned above used circles, but I think that really is detail at a designer's discretion. It's helpful, however, to be able to say in a caption: medians are shown by diamonds, or whatever.else you choose, and whiskers join quartiles and extremes. (Don't use the same marker symbol for data and medians.)
Next time I will rotate the y axis labels to horizontal and lose the tick marks on the horizontal axis.
0 Response to Midgap plots, anyone? Or, box plots without boxes
Post a Comment