Meet the little-known cousins of the histogram!
Most folks have a good understanding of basic histograms—just bar charts with continuous numeric axes. For example, here is a simple histogram showing how 4-year universities (excluding for-profit schools) in the United States were distributed based on tuition & fees for the 2014 - 15 academic year:
The X-axis represents total tuition & fees, rounded to the nearest $1,000. The Y-axis represents the number of institutions. For those interested, this data is freely available from the government (which is why it's a bit stale). A common thing to do with histograms is to overlay a normal distribution curve on top, to see the overall expected distribution pattern and make deviations easier to spot:
Often the entire point of overlaying the distribution curve is to see how an actual histogram differs from a distribution estimate. But, in 1971 - 1972, mathematician John Tukey (best known for for his 1977 invention of the box plot) noted that the difference between two values, when one is represented as a bar and the other as a curved line, is hard to accurately measure. Tukey was referring to this:
When a normal distribution curve is overlaid on top of a histogram, it is virtually guaranteed that the portion of the line graph that overlaps any particular bar will not be flat. Since it is hard to estimate exactly where the horizontal midpoint of the bar is, it is also hard to estimate the difference between the line and bar with the naked eye. To solve this visualization challenge, Tukey proposed "hanging" the bars from the distribution curve line and using the X-axis as a flat line to measure the difference between the histogram and the distribution curve:
This object is known as a hanging rootogram*. The differences become much easier to estimate, since you can now use the X-axis itself as a flat line for your comparisons:
* Actually, it's not. The "hanging" part is true, but a "rootogram" uses a square root scale for both the bar and line, in order to show departures from expectations even at small frequencies. But I'm trying to keep things simple here and (hopefully) understandable.
I created a little twist on the hanging rootogram that I'm dubbing a chandelier plot:
This borrows a concept we discussed before to clean up the visualization a little. As all bins are guaranteed to be the same width in a histogram, showing real bars doesn't add much value. On the other hand, I find that converting bars to "lollipops" with lines plotted in the horizontal middle of each bin results in a somewhat cleaner UI with a pleasant "chandelier" effect. This is just a matter of preference, of course, and some people (Stephen Few among them) would vastly prefer the bar view to this lollipop plot. I can see both sides of the argument. Note that I intentionally use open circles as my symbol of choice, to allow the viewer to see the exact endpoint of each line and not lose precision.
All of the above visualizations are available for download via the link below. There is nothing particularly difficult here from a technical point of view, except for the matter of axis scaling. In the first object, the histogram, the distribution curve is overlaid and plotted on the right axis, while the bars are plotted on the left axis. That's because the scale of the numbers plotted by the curve is many, many times smaller than the scale of the numbers plotted by the bars. However, in order to actually hang the bars from the curve, it becomes necessary to bring the curve onto the same axis as the bars. This is achieved through a scaling factor that is derived by using three variables; see the Variable Overview for more details.