Violin Plots: What They Are and Why You Should Care

Continuing our series on little-known charts and how to implement them in QlikView!

 
To put it simply, you should care about violin plots because they can effectively convey a fantastically huge amount of information in a small amount of space. The downside to violin plots is that they are not very well understood, which is something that this article can hopefully help with. At its core, a violin plot combines two different types of charts into one: (1) a box plot, and (2) a density plot.

 
Let's explore each of these components. But, before we do, let's talk about a "hidden" component that will affect everything we do with violin plots: the confidence interval.
 
Confidence Interval
Let's say you have a range of values that you want to plot. For our purposes, we're going to use the average daily temperature readings in Mexico City from 1985 - 2015, which represents over 11,000 data points. To get an apples-to-apples comparison, we want to plot these by calendar month. The larger our sample gets (and 31 years is quite large), the more we would expect the data to fall into a normal distribution, AKA a bell curve. A confidence interval is simply a way to shave off the ends of the bell curve so outliers do not skew the final results. Take the difference between 100% and the confidence interval, divide it by 2, and take that much off of each end of the bell curve. For instance, if you are using the standard 95% confidence interval, you would shave 2.5 percentile points off each end of the bell curve:

 
Note that what you are shaving off are percentiles, so often there will not be the same number of outliers on each side, and sometimes there may be no outliers at all. Practically speaking, if you are dealing with summer temperatures, for instance, and most of your temperatures fall in the 60 - 74 degree Fahrenheit range, this could lead to removal of a 52 degree outlier, a 76 degree outlier, etc., leaving the rest of your data points cleaner so that you can see real patterns more easily.

 
For the rest of the analysis, we are simply going to say that we do not care about the 52 - 59 and 75 - 76 degree outliers, and pretend they never existed.
 
Box Plot
Folks have a tendency to be scared of box plots—I admit that I felt the same way at one point. Because they have up to 5 components and are rarely used in real-life scenarios, there is a misconception that they are somehow difficult or involve complex math. That's really not the case. The best way to understand box plots is to lay out a bunch of numbers in a row and then chop them up the same way a box plot would. A box plot has the following components: (1) a median line, (2a) an upper box limit, (2b) a lower box limit, (3a) an upper whisker, and (3b) a lower whisker. Using our example, let's lay out some temperatures in a row, ordered from smallest to largest. I'm going to use real data from June 2015 as an example. Remember though, that we have already excluded outliers with our confidence interval; in our case, this excluded a single day, June 1st, with a temperature of 62 degrees. Here are the remaining 29 days of June, ordered by temperature:

 
As you can see, box plots are all about finding medians. First, you find the true median: the middle number (or the average of the two middle numbers, if there are an even number of data points). Then you take whatever is to the left of the median, and find the median of that. This becomes your lower box limit, also known as Q1. Then you do the same thing with the numbers to the right of the true median, and that becomes your upper box limit, also known as Q3. The whiskers simply represent the highest and lowest numbers that our confidence interval has left us. That's it. We end up with a picture like this:

 
So what's the point of a box plot? It shows us the shape of a distribution of data points. If that sounds familiar, that might be because it's actually another view of the bell curve we discussed above! Let's see what happens when we lay the box down on its side and overlay it with a normal distribution curve:

 
The box shows us the "meaty" part of the curve (the middle 50th percentile), and the whiskers define the full range of the curve (the top and bottom 25th percentiles) limited, as always, by our confidence interval. So now that we have the first piece of our violin plot, let's move on to the second.
 
Density Plot
The density plot is the purple part of the violin in the picture above, and actually shows something quite simple: how many total data points there are for each unique data point value. In our example, that means the number of unique dates that had a particular average temperature, represented as a line chart. In order to create the symmetrical shape of the "violin," the density metric is produced twice, once normally, and once as a mirror image (most easily achieved by multiplying the normal line by -1). At any one temperature reading, this will be represented as two dots, one on either side of the Y-axis:

 
When there are many unique temperatures and you connect the dots and hide the axes, this results in a double line graph that appears to resemble the body of a violin (or, more likely, some sort of weird squiggly shape that has no easy name):

 
QlikView-Specific Caveats
Although it supports all the underlying math, QlikView is not ideally suited for creating this type of chart from a pure visualization perspective, unfortunately. There are several challenges:
  • You cannot control the width of box plots, so cannot fit the box plots within the density plots to make it truly look violin-like
  • You cannot trellis charts that have box plot expressions
  • You cannot have a line graph be both smooth and have an area that is filled in with a solid color
  • Line graphs with filled-in areas do not work very well, in general, in combination with trellis charts
  • Trellis chart have their own limitations, such as the inability to hide column/row separators and trellised dimension labels
 
As a result, the final chart below was created by actually overlapping two separate charts. Some of the alignment is not pixel-perfect and some of the elements don't look exactly the way I would prefer. I personally don't like to mess around with extensions (primarily because of supportability concerns) so wanted to create this solution in native QlikView. But the concepts described in this article can almost certainly be accomplished more neatly with the help of extensions in Qlik Sense.
 
Change Over Time
It is very important to note that this chart tells only a part of the story, not the full story. Notably, the part we are missing is an indication of how the data changes over time. The only way to incorporate time as a component into violin plots is by selecting one year at a time and iterating the selection:

 
This may be pretty, but is not a very effective data visualization. Karl Pover has recently put together a view of the same data as a cycle plot. I recommend that everyone check out his post here. Our two views together paint a much more complete and powerful picture of average temperatures in Mexico city over the past 31 years than either one can on its own.
 
Putting It All Together

 
 
The main advantage of a violin plot is that it shows you concentrations of data. Box plots are powerful visualizations in their own right, but simply knowing the median and Q1/Q3 values leaves a lot unsaid. There are many ways to arrive at the same median. For instance, if you have 7 data points {67,68,69,70,71,72,73} then the median is 70. But if your data points are {60,60,60,70,80,80,80} the median is also 70, but the picture is very different. When the violin density plot tapers, it means that the results are less dense: in plain English, that there are less of them. When it gets wide, the density is higher. As a rule of thumb, the more curvaceous the density plot appears, especially in/around the interquartile range (i.e. the "box" from our box plot AKA the IQR), the more variance there is in the data. A stocky density plot, by contrast, indicates that the results are more evenly distributed.
 
So what does the above picture tell us about weather in Mexico? Quite a lot. For one thing, July - September look pretty evenly distributed in the part of the density plot that overlaps the IQR, but the IQRs themselves are short and stumpy. That means you were likely to experience a smaller range of average temperatures during those months, but were about as likely to experience any particular one of them.
 

In July, for instance, you had about an equal chance of getting 64, 65, or 66 degrees, with a slightly higher chance of 66. The total likelihood of getting one of these three average temperatures on a July day was 50% (one shot in two). However, notice that the density plot doesn't really taper above the IQR; the upper whisker tells us that the top 25% of days had temperatures of 67 or 68 degrees and the density plot tells me that either one of those was about as likely as the other. The lower whisker tells us that the bottom 25% of days had temperatures of 60 - 63 degrees; however, the tapered density plot tells me that 62 and 63 occurred much more often than 60 and 61. Anything higher than 68 degrees or lower than 62 degrees was an outlier excluded by our confidence interval—getting weather like that was a fluke.

 

Now let's look at November and we'll see a very different story. Obviously it was colder, because it was winter: the median temperature was 60 degrees. Notice also than the IQR is taller now, which tells us that we experienced a greater range of temperatures (between 58 and 62). However, not all of these temperatures were equally likely. We had 58, 60, and 62 degree weather much more often than 59 or 61 degrees. Without a density plot, we would never have known that strange factoid. So while we still had a 50% shot of landing in the IQR range, it was not evenly distributed. We also see that 25% of the days were colder than 58 degrees, most often 56 degrees, to be specific. Again, without a density plot all we would have known is that this bottom 25% of days was somewhere between 50 and 58 degrees—a pretty big difference compared to the picture that we see above! The 25% of days that were warmer than 62 degrees were likely going to be 64 degrees, but could also very well have been 63 or 65. It was very unusual to experience weather colder than 55 degrees or warmer than 65 degrees.

 
But all this past tense talk begs the question: does this dataset, which deals purely with past weather, actually indicate that any particular weather is more likely in the future? The answer is: possibly. But to get a better idea of whether we're seeing a repeating pattern or simply an interesting historic distribution, we need to be mindful of the types of trends that Karl has created. It's always important to keep one eye on a trend chart when looking at distributions; if you don't, you could draw wrong conclusions from a historic metric averaging out a certain way. If we take a look at Karl's chart that we discussed above, we see, for example, that while November has remained about steady over time, July seems to be getting hotter each year. This might alert us that the past might not be a good indication of the future during Mexico's summers. You might, however, be justified in assuming that the November results will continue to be more or less that same in future years and, if you're planning a visit (and feeling lucky), to not pack for weather colder than 55 degrees.
 
That's about it. Hopefully this helped demystify violin plots and their various components. What other kinds of data sets do you all think might work well with this type visualization? Let me know in the comments below!
 
A Complete Tangent
You may notice that the application defaults to Fahrenheit instead of Celsius, even though it plots temperatures in Mexico, a country that uses Celsius as standard. I know the temptation is to write this off as typical American ethnocentrism, but there's a little more to it than that. To put it simply, I prefer Fahrenheit to Celsius when dealing with Earth-normal temperature measurements because you don't need to resort to decimals to derive valuable insights. Celsius is great for describing the temperature water experiences in its liquid state (0 being the lowest and 100 the highest, of course). But Fahrenheit is superior in that it analogously describes the temperature extremes that we as humans are likely to experience living on this planet (the range of 0 to 100 covering the vast majority of populated climates year-round). It makes more sense to me to use a scale made for Earth surface temperatures when measuring, say...Earth surface temperatures. In my opinion, it's easier on both the eyes and the brain to talk about 70 and 71 degrees Fahrenheit instead of 21.1 and 21.7 degrees Celsius.
This entry was posted in Tips & Tricks, Visualization and tagged , , , , , . Bookmark the permalink.

2 Responses to Violin Plots: What They Are and Why You Should Care

  1. Karl Pover says:

    Thanks Vlad for introducing me to violin plots. I especially like the detail that is revealed with the density plot that I’ve never noticed before. For example, I can see that the plots from June to September are fatter. This time period corresponds with the rainy season when it rains almost everyday in the afternoon. Temperatures are more stable and vary slightly from the median. All the other months are marked by weather fronts that pass over and more dramatically change the temperature for several days before another front comes in. Very cool.

Leave a Reply

Your email address will not be published. Required fields are marked *

Notify via email when new comments are added

Blog Home
Categories
Archives