Waves vs. Streams
In the world of data visualization, it seems that there are a myriad of effective ways to visualize just about anything. Distribution? You've got histograms, dot plots, box plots, violin plots, frequency polygons, etc. Magnitude? You've got horizontal bars, vertical bars, Marimekko charts, proportional symbols, radar charts, etc. But the exception seems to be flow—the journey that something takes to get from an initial state to one or more final states. As far as I know, there are exactly two ways you can visualize flow that include volume of movement: Sankey diagrams and, more rarely, sunburst diagrams.
You might not know their name, but almost everyone has seen a Sankey diagram at one point or another. They are the traditional way to show flow and are reminiscent of rivers that are typically (though not always) widest at the left and "empty out" into one or more final buckets.
When used correctly, Sankeys are a very elegant and powerful visualization. But there's the rub: they have to be used correctly. The above image is a great example of appropriate Sankey usage. The number of nodes is manageable, the overall effect is clean, and we don't necessarily need to know the precise width of each streamlet, just a ballpark estimate—about half of the volume is flowing this way, about a third that way, and so on.
Unfortunately, the types of Sankeys that actually get created, particularly in a business context, more closely resemble this:
click for a larger version
This is by no means an exaggeration on my part; in fact, I would say this is a mild example. There are so many streamlets that they overlap constantly and are virtually impossible to trace. The overall effect is that one is looking at a rat's nest; it's messy, it's ugly, and you immediately want to look away. Just the thought of trying to make sense of this diagram is enough to give me a headache. If you are a visualization designer and are trying to draw in your users, to make them want to interact with your viz, this is most certainly not the way to do that.
I think there's a disconnect between how Sankeys were originally meant to be used versus how they are actually used in business settings today. Namely, I believe Sankeys were originally invented by finger-painting toddlers—and we should all aspire to their wisdom! Paint in broad strokes and don't worry overmuch about precision. What ends up happening in a business setting, however, is: (a) there are often more than just a handful of flow channels emerging from any given node, and (b) the precision that's demanded isn't ballpark at all, but out to the first or even second decimal place. Both of these uses are fundamentally incongruous with Sankeys as visualization tools.
There's one additional drawback to all multi-level Sankey diagrams, even the goods ones, and this one is more subtle. By the time you get to the end of a Sankey, when all the volume has flowed into its final destination buckets, you have lost the ability to trace the "ancestry" of the volume back to its origins. Let's look at the first Sankey example above. "Supplier" starts off with a volume of 160 and "Total Labor" ends up with a volume of 92. How much of Total Labor's volume came from Supplier? We have absolutely no idea, at least not by looking at the diagram. But I would argue that, absent a more sophisticated business-specific algorithm (more on that below), distributing the flows into final targets proportionally to all origins in the chain would more meaningfully answer the following question: so where did this volume come from?
Sunburst diagrams rely on colors to identify nodes and it's fairly rare to see them in the wild, at least used correctly. They are best used when a flow isn't necessarily linear, but partially circular. For example, when you are trying to visualize a user's browsing behavior on a website; the user can repeat the same action several times during her session (e.g. go to the Products page, go to Services, go back to Products). A sunburst can act like a more powerful version of a workflow diagram in these cases.
Their biggest weakness, in my opinion, is that they fail to effectively convey that all segments that are colored the same are really the same node. In the example above, you would have to manually go around the outside edge of the diagram and add up the blue segment widths to figure out how many people finished their session on the home page of your website. For this reason, they are a very poor choice for mapping flow towards a final destination. They are also unwieldy and difficult to read beyond a fairly small number of distinct colors.
A New Approach
Let me start off by saying that I am not looking to replace Sankey or sunburst diagrams when they are used optimally. For circular or looping flows, sunburst diagrams might well be your best bet, assuming that what you are looking to emphasize is the journey rather than the destination. For simple linear flows (think broad toddler stokes!) Sankeys are usually the right choice. But I'm keen to eliminate the shortcomings of the noble Sankey diagram that I have highlighted above. And to do that, I propose a new way of visualizing flow, particularly when designing visualizations for business:
If you can't decipher this image at first glance, don't worry, I'm going to explain: it's all about the pixels. Every cell in the visualization is of the exact same width and height. All the pixels will make their way from dark blue (left column) to green (top row). They do that by way of the "flow", which is what is visualized by the middle of the viz. The number of pixels of every color is identical.
When you first come into the visualization, the dark blue left column is going to represent the original source of volume (analogous to the left-most part of a Sankey) and the green row on top is going to represent the final volume target (the right-most part of a Sankey). So without ever touching your mouse, you can already answer two questions:
- Where is all the volume from Source X eventually flowing to? You can answer this question by looking horizontally at any row.
- Where did all the volume that flows into Destination Y originally come from? You can answer this question by looking vertically at any column.
These really are the two most important questions when you look at flow and, when you are dealing with simple two-level flows, they are also the end of the inquiry. But when you have multi-level flows, these two questions are only the beginning. Because when you click on any row header to select an initial source, the visualization drills in to the next flow level, answering the third question: But how did it get there? And it does so on and on, for as many flows levels as you want to drill into. Each click filters for the flow volume associated with only that specific branch, letting you focus on a single slice for the next drill level.
How is this possible? By proportionally distributing source flow through to subsequent channels. Let's take a look at a simple example.
In a traditional Sankey, we don't know how much "bituminous coal" makes its way to "exports." In our new way of visualizing flow, however, we say that it's a proportional amount when looking at the input, output, and middle steps. In this case, the formula is: 97*(466/995) = 45.43
Just a couple quick notes of caution before you dive headfirst into this viz. Our new flow visualization has some considerable advantages over Sankey diagrams. But that's not to imply that it's without weaknesses. For example, one of its strengths—the proportional division of volume we just discussed—may actually be a huge weakness if that sort of divvying does not fit the realities of what you are trying to visualize. In the example above, what if we happened to know that 100% of exported coal was bituminous? Our 45.43 answer would be dead wrong in that case; the correct answer would be 97 for bituminous and 0 for every other type of coal. If you have those types of business insights about your data, then I would encourage you to still use this visualization, but take the division formulas from automatic to manual. All the code is freely available in the application below for you to make what modifications are necessary.
The only other thing I'll note is that, while this chart certainly visualizes flow, it may not appear to do so to your viewers. When you see a Sankey diagram, the feeling of flow is almost tangible—after all, it looks like a bunch of rivers, streams, and islands. The way we convey flow with this chart, by contrast, is much more subtle. It is more akin to waves than streams, which is an analogy I tried to subconsciously reinforce by making the flow pixels blue. What I'm trying to tactfully say is that there may be a slight learning curve as you train people how to read this viz. I hope that won't deter you, but it's something you should be aware of going in.
I hope I didn't lose too many of you in the last section. For those who made it all the way through to the end, we have some fabulous parting gifts. First, here is the QlikView code behind this visualization. I strongly recommend running this in QlikView 12 with Roboto font installed.
The input for the application is any Sankey input you like. You just have to feed it a file that contains 3 fields: source, target, value. Set the location of this file in the Sankey Input tab of the script, reload, and the application will do the rest. Below is an example file to get you started; you can paste the contents of this same file into most Sankey diagram generators, such as this one, to compare and contrast between the traditional way of showing linear flow and the new method.
Finally, for those unfortunate souls who are not using QlikView: join us! And in the meantime, here is a hi-res image of the above "rat's nest" Sankey as visualized the new way. A static image is a poor substitute for an interactive viz, though, and I hope you get a chance to explore the power of QlikView for yourself soon. A word of warning though: it's very addictive!