Hello and welcome to SuperDataSciences Custom Chart Tutorial series. Today we're going to be teaching you how to make the Sankey Diagram. Sankey Diagram is a fantastic chart, it shows the flow and relationships between two different elements, this could also be expanded out to show the flow through an entire process, but for the purposes of our tutorial we're going to keep it straightforward with the flow between two dimensions. Our data set that we're using today is available in SuperDataScience.com. It is African mobile data. African mobile is a fictional mobile provider that is expanded across the entire African continent. They provided us with their sales data, showing the date of their sales, what country the sale happened in, the city the region, based off the UN regions, then we have what segments the sales was in, either commercial, public or residential. And then it also provides sales and profit numbers for those values. For a Sankey Chart today, going back you can see you need flow between two dimensions. What we're going to model is the flow from each segment over to each region and we're going to be looking at sales as we go through. It's also important at this point to recognize and acknowledge the fact that the only reason we know how to do Sankey Charts in Tableau is because of amazing work by people in the Tableau community, people like Jeffrey Shafer who first piloted the techniques that we'll see today and also Olivia Katherine who expanded upon his work. The method we're specifically going to use was actually perfected by Chris Love building off of work by Noah Salvaterra and Joe Mako. OK, with that quick introduction, let's start going and building our Sankey Diagram. But before we go over to Tableau and start connecting the data, it's important to talk about how we need to connect to our data. In this case a straight connection to our excel file is not going to give us what we need. When we look at this view we can see there's the data you need vertically in this column in this column, so data for our segment, data for our region and then we'll also need a horizontal data built across the view that Tableau can draw marks across. It's just kind of a way to conceptualize how the chart will be built. Because of that, we can't use a standard connection because we're actually going to need to create two copies of our data. One for this column, one for this column. So we'll come over here to Tableau. We'll connect the data, connect to our Excel file and we'll connect to our African mobile data Excel file. However instead of clicking on open we'll use this drop-down and choose open with legacy connection. That uses the old Microsoft jet engine that Tableau used to connect Excel files with. We bring in our sales data and you'll see that we have the full 1056 rows that we're looking for. Our dates, country, cities, regions all the data that we're looking for. Specifically we're gonna need segment, region and sales. Now, the reason we use the legacy connection is so it can come up to this data drop-down right here and choose convert to custom sequel. So those of you familiar with sequel, this will be a very basic statement those where you aren't, sequel is a way to access databases. You have your SELECT statement, then you tell them what columns to bring in. And then you have your FROM statement that says, what database you want it from. So in this case from the Sale data tab we want city, country, date, profit, region, sales and segment. So what we're going to do, is we're going to take this; we're going to Union all, with the exact same data set. If we click OK you can see that we now have 2112 rows, effectively having doubled our data set. Looking back at our example we're going to want to use the first copy of the data for this segment column, the second copy of the data for this region column. Now, we gonna go actually back into our custom sequel connection and put in a flag to let us know which one we're dealing with. So we'll put it in "Segment" right here, we'll call this "VizSide" because it's letting us know which side these viz were on. Then over here, we're going to call this "Region" as VizSide. For those of you who do not know sequel when you're unioning things, the columns have to be named the same thing and you need to make sure to have a comma in between each of your columns. Now we done that you'll see our new column that we've created so we have VizSid for "Segment" for the first 1056 rows and then "Region" for the rest of them. Now that we have our columns taken care of, let's go back to our example. We have our data for segment, we have our data for region, we need to start worrying about the horizontal data that exists in our view. The thing is these horizontal values, they don't have any access to they're tied to. So what we actually need to do, is somehow create these marks so Tableau will be able to draw the lines across them. There are multiple ways to do this. One way is to just create multiple multiple multiple copies of your data, kind of like as shown here, you can see all these different lines. What's happening is, each line is a new row of that same data, that's being written over. But what ends up happening with this is that you end up duplicating a ton of data and your dataset gets very large and very unwieldy. Another option that you have is to take advantage of technique in Tableau called data densification. So we're gonna come over here. I'm going to walk you through the first steps of that. Data densification is also known as padding. So create calculated field that says ToPad. Using a column we made, IF VizSide is equal to segment, THEN 1, ELSE 49, END. What we've done in this situation is created a value of "1" for Segment column, and value of 49 for a Region's column. And those effectively act as bookends, where we'll then fill in data in between them. 49 seems like a bit of an arbitrary number, it's the number chosen so that this curve has enough marks to be drawn smoothly, if you use fewer, this tends to get a bit jagged, if you use more you end up causing performance issues. So 49 is a nice kind of compromiser goes in the middle. So now we have ToPad, the thing is if you bring ToPad out to columns. Just bring out its value right there and then, you can see that right now we actually only have a value in 1 and a value in 49. But if you come down, you click on it, you say create bins. Will call this Padded, call that one for the size of the bin. Now instead of having just two values 1 and 49, we have values from 1 all the way through to 49. This happens because bins in Tableau are what are called range aware. When you create bins, it looks at the data that you created them off of, and says - ok we have a value 1 and we have a value 49, now in order for there to be a proper size bin for everything. I'm going to pad the data by adding in 2, 3, 4, 5, all the way through 49 to give you the bins that you need. So we've basically done is used Tableaus in-memory engine, to create these extra marks, even though we don't actually have them existing in our data set. This is very important and something that will keep coming back to you as we go through this tutorial. Next, we're going to create a calculated field called T. You'll see why it's named T in a little bit. And with this, we're just going to throw in the index calculation minus 25 divide that by 4. What this does is a little bit of an offset, that evenly spaces all of our marks across the view. Now see that in action, let's pull T up and then add Padded to our level of detail. And we'll make T calculate, compute not using table across, because you see, it only has two values, but if we compute it using Padded, change this to circle so you can see a little bit easier, and you'll notice all of a sudden, we have a whole bunch of data that wasn't there before. We have a value of 1 we have our value of 49, but we also have a mark for every single individual bin. This is the essence of data densification and data padding. Now that we have data densification taken care of that gives us the horizontal mark so it needed in our view. So, now we have our vertical marks, we have vertical data sets, our horizontal marks. Now we need one more calculated field that helps inform where these lines start and where they end. So for instance this goes from first position to second position from the top over here. And we need to create calculated fields that tell Tableau how to do that. Come back, we'll create something called rank one. This is going to be a running sum of the sum of sales, divided by the total of the sum of sales. For those of you familiar with table calculations this is just a percent of total across the entire data set. Now we're going to come in and we're going to duplicate that, and we will edit, call this Rank 2 - I'm actually going to leave it entirely the same. So now we have a ranking function available for segment and a rating function available for region. The next calculated field we need to make is our sigmoid function. The sigmoid function right here off Wikipedia mathematical function with an s-shaped curve. This right here is a special case and logistic function which is 1 over 1 plus e to the negative T. So we're going to create a sigmoid. It is going to be 1 over 1 plus; now we need to use e; Tableau doesn't have e as a function, but it does have an exponent function. An exponential function that returns e raised to the power of the given number. So if we do exponent 1 that's the same thing as typing e. Then we're going to raise it to the power of negative T. So if you probably already saw ahead and said, oh that's why we named it T. Doing this the sigmoid function can understand what position were at as we cross our view. So, we'll be able to draw a curve using that data densification that happened. Now that we have the function that will give us the curve that we need, which going back to our example gives us this kind of curve on the line, we're going to need to write one more calculation that actually builds out the full curve. So, this is done by taking Rank 1 plus Rank 2 minus Rank 1 multiplied by our sigmoid function. So, what you're doing is you have Rank 1 which is your starting point on this left side for segment. Then you have Rank 2 minus Rank 1 that effectively flips it on its head. If we kept just Rank 2 then we would just get a whole bunch of straight lines and this would be 1 and this would also be 1, but if we do Rank 2 minus Rank 1 then we end up on the other side. So it goes from 1 to N and then one to N back the other direction. And then multiplying it by the sigmoid makes it so that we have this nice curve between our two values. So, your Rank 1 Rank 2 minus Rank 1 we multiplied by the sigmoid. Ok those a lot of different calculations and now we get to the exciting part, where we're actually going to build out our view for a little bit. So, if you drag curve up you'll notice that nothing's actually happened yet. We have two values: one on the extreme edge to the left and one of the extreme edge to the right. This is because curve doesn't yet understand how we want to calculate. But in order to have it calculate correctly we're going to need to use something called nested table calculations. Nested table calculations are the idea that for different calculations that make up your table calculation you can have them calculated differently. However at this point we only have Padded in our view because that's the only field that's on our level of detail. At this point we'll need to add in the two sides of our table or the two sides of our visualization that we're building. So, we have Segment and then we'll add in Region up there. And now we have access to those in our table calculation. So, if a Rank 1 we're going to be telling Tableau: "Hey this is how its laid out on our left side dealing with Segment". So we want it to include Segment, we want to include Region and then we want to include Padded because it tells you go from Segment to go to Region a long Padded. Now we can also drag these up make sure they're in the right order so it processes correctly. Coming up to Rank 2 we're going to use all three again, but in this case, we want it to go from Padded, sorry from Region to Segment to Padded. And then for T we're going to make sure that it's using Padded so that our sigmoid function is able to comprehend every single mark. At this point we're going to take a break from the complex calculations and do a little bit cosmetic work. Going to come in here we will remove the gridlines and then I'm going to move the zero lines, so we have a nice white clean background. And then we have this little extra blip happening over here, because of the way the data densification is being handled by Tableau. So we'll come down here we'll edit our axis and because the Sankey Chart is intentionally designed to fill the entire view we can cut off that extra data and just make it from negative 5 to 5 fix that axis. In the same spirit we will come to our curve and make it go from 0 to 1. Now another thing, if you hover the values at the top right, now we have the residential segment than the public segment then the commercial segment this is actually in reverse alphabetical order which has to do with how the ranking functions work. So, if we come here edit the axis and then reverse that will then get our standard commercial public and then residential like we were going for. Then we can come in we can hide our headers and we have very good makings of our Sankey Chart. The next thing we're going to do is change it from circle to line. And you see that that broke our view because now it's trying to draw the marks independently on each line without understanding actually what path we want to go in. Remember we have values at 1 we have values at 49 and these other ones only exists because of the data densification. So, if we drag Padded, bring that out to path, it tells Tableau to draw the line across this padding that's taken place. So, now we need to size our different lines and we want to size them by sales. But we're going to run into the same issue we had with the line. So, if we try and draw the size you'll notice that if you look in a tooltip sales is blank for every single one of these. That's because our data set actually has no value for 17 on Padded. That's something that's just been created through Tableaus densification engine. So, what we need to do rather than using sum sales itself, we're going to create another calculation. We call this a Sales Sizing and we'll use a table calculation in order to trick Tableau into giving it a sales value across every value even those values don't exist in our data set but they do exist in our view which is why the table calculation will give us what we need. So we do the window average of sum of sales. Make sure you grab the right one here. Take it, drag it up to size we have the same issue that we've had before for computing using table across rather than Padded and then when we drag it to Padded you'll see that we have the varying sizes of our lines just like we wanted to show the relative differences. So now we have our great Sankey Chart emerging we can add segment to color, we can add region to color and we have a beautiful flow going from segment on this side to region on this side. Now that we have that flow let's go back and look at our example. Turns that they actually used a column chart as well to kind of show the percent of totals and really show where these flows start and end. We're going to add in another column chart on this side so we can have the columns showing both segments and region and their percent of totals. So, come in and we will add segments to color and then we will add our sales to our columns. Actually want that on rows and then we'll make this into a table calculation really quick make a percent of total, compute using table down, we'll bring that to our label also bring segment to our label, switch those two, make a little bit wider and we'll come in and fix the formatting here. We don't need three zeros on our percentages one, will probably be just fine. Then we have this first segment. We wanna come into our color scheme as well and rather than being blue orange red will make it blue orange green just so it matches our Sankey Chart. Then using that will come down and we'll change it from segment to region. And then calculate our percent of total for region. Now for region we're actually going to come in and we're going to use a different color palette with grays. And you'll see in a little bit when we get to our dashboard why we did that. So we're coming through just for ease we're going to rename these a little bit so we have our Sankey Chart of sales from Segment to Region. And that's just going to our Segment and this will be our Region. Now we're going to create a dashboard that has all three of them. Personally in terms of sizing I like these automatic that automatically fits your view as well as adjust if someone opens it on a different sized screen. So first we're going to bring in our Sankey Chart. Then we'll add in Segment to the side and we'll add in Region to the other side. Again get rid of our color legends because they're already labeled in our view. Then we'll do some more cosmetic stuff where we'll hide the header, hide the other header and we actually don't need these titles either. As we come through we can format we can get rid of our gridlines we can get rid of our 0 lines clean that up a little bit. Now we've manually go through and change these values a little bit. Change the size of your bar charts and we have the flow of sales from commercial to Eastern, Middle, North and Southern or Western. We have it from Eastern back to Commerce Republican residential and you have your Sankey Chart that's flowing back and forth. A couple more cosmetic things we're going to do real fast. Our tool tips are really distracting at this point. So I'm going to go, I'm going to disable them on all of our charts. We have all the information we need in the view so any additional information offered by the tooltips really doesn't help at this point. So, those of you who really want to make things match perfectly this technique is not perfect in that you need to go and manually size both of your columns at the edges. Now just eyeballing it, they're reasonably close, increase the size of that one just a little bit I guess that's approximately close enough you can spend a long time doing this. Another option is to change these two floating elements of your dashboard and then you can have them overlap so that it looks like the lines are coming straight out of those. Personally I usually just make it with a little bit of white space in there. To take advantage of Tableaus nice blocking that it does on dashboards. But that was completely up to you. The last thing we want to add is actually some dashboard actions. This is going to be you use that as a filter but it's a highlight to show off the different paths that are being made. So we're go to call this the Segment highlight. I'm going to do it on hover it's going to go from Segment and we're going to have a highlight the Sankey Chart, highlight the Segment, but we're not going to have a highlight the Region because it can go from Segment to multiple Regions. So if we highlight that it's not going to really deal what we need. Coming in next going to add a second one we'll call this the Region highlight going for Region on hover on Region and Sankey but not on Segment. So we've just done is created another way to interact with a visualization and then when you have a rover you can see exactly where those flows are going creating a really fantastic visual effect that is very exciting and really fills out that Sankey Chart that you have. So that is the end our Sankey Chart tutorial you come down we can rename this dashboard as our Sankey Chart. Hope you enjoyed it hope you learn something new these are a lot of very advanced techniques so if you need to go back and re-watch portions of the video that's totally fine. As always you could reach out to us on SuperDataScience.com with any questions or with any potential suggestions on different charts that we can do after this. Thank you so much!