Data Sketch|es: A Visualization A Month - Shirley Wu and Nadieh Bremer

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
>> Thank you very much, Irene. >> Yes! >> Yes, cool. Good morning, everyone. My name is Shirley and this is� >> Nadieh. >> And we are superexcited to be back at OpenVis Conf this year to talk about our product, data sketchies. We met online in a data visualization in 2015 and didn't meet each other in person until OpenVis Conf in Boston last year. Those are our happy faces. And at OpenVis Conf we had the pleasure of giving talks and hanging out for three whole days where we hit it off superwell, so that when two months later when Nadieh put up her OVG tutorials from our OpenVis talk, I dived in with vigor. And I started chatting with Nadieh about the tutorials, and that conversation led into us lamenting the fact that we hadn't finished as many fullblown visualization projects as we would like. And then I had an idea. And I was like, hey, Nadieh� fully expecting to be rejected� do you want to collaborate on something? And she's like, yes! Yay! And that's how data sketchies was born. >> In the following week, we both liked the idea that we would create a visualization around the same topic and do that for a year. See how two people create two visualizations starting from the same seed, the topic, but diverging into different paths on our own interests and history. We want to share the histories and write about the process. It's the three pillars that are most important. Data, sketching, and coding. And initially they thought we could pull data sketchies off in five to six hours a week. But real life doesn't agree with plans. Especially coding plans. Starting in July of 2016 we have clocked many, many hours into creating a visualization each month. And during this talk wealth like to talk about the lessons learns, the challenges, and the insights we gathered along the way. >> So let's start with the data. We often get asked this question, do you get the data first and then come up with your ideas, or get the idea first and go and find the data from that? And for us, the answer is always, always idea first. So, for example, for my November data sketchies. I wanted to look at every line in the musical, Hamilton. I could filter by any set of characters as well as their conversations and any themes and then be able to dig into the set of lines from the songs that were left over. And the idea came from a question of how do the relationships between the characters change throughout the whole musical? And then what are the reoccurring phrases and who are they associated with? Now, as you can imagine, this data set is not available anywhere online save for the lyrics themselves. So I had to go through all of the lyrics and note all of the reoccurring phrases that appear across more than one song. Group them into broader themes. Go through the lyrics manually again so that I can enter them into the computer, associating them with the right song and line numbers. And also do the same thing for the characters and conversations. Write the script to aggregate all of that information together to get the final data set. And the more extreme data sample, for October data sketchies I wanted to put the emojis on the former� Mr.�and Mrs.�Obama's faces and built this tool for exploration where you can go through any of the videos I found of late night talk show interviews and then go through the whole entire YouTube video to look at the emojis that I put on their faces. And this idea came from a conversation with Eric Cunningham, right there, where he was like, wouldn't it be cool if you could just run facial detection on the videos and correlate the emotions with what we are saying. Hey, Eric, you realize I only have one month for this project. And then I was like, challenge accepted. And so, I started with, first, manually gathering all of the late night talk show appearance off of I MDB. I then went and found all of the videos correlating with those talk show interviews from the host channels. Used a note package to download all the videos and the captions and get the time stamp from the captions so I can take a screenshot every time somebody talked. Upload that to Google vision API because they give me information about the faces and boundaries and emotions and how happy or angry they are and if they're wearing a hat. And I took that data and aggregated with the caption data to get my final data set. So what these two months taught me was, if I'm just curious� if I have a curiosity� there is some way I'm going to be able to get my hand on the data set. Whether it's to manually go flu and enter them, or write a script to automate them. I think there's a note package for literally anything I can imagine under the sun. As long as I do all of this responsibly and legally. [ Laughter ] >> Well, thankfully, not every month is as data intensive as that. So for August the obvious theme was the Olympics. Especially since we're both big fans. I decided to visualize all 5,000 medal winners since the first Olympics. Each group was a sport, water sports or ball sports. And each slice within a circle represents one. The reddish background are female events, and the bluish is male events. And each is given the color off of the continent of the country that won the medal. America's is red, Europe is blue, and so on. And you can see here there were no female events in the first editions of the Olympics. But even catching up since then. Soy actually found the data from this piece from two articles published 2012 games in London. I noticed obvious medals were missing like hockey from 2012. It so my confidence in the data set dropped, even coming from a respectful source like the guardian. I had to get a sense of the accuracy of the data set. But I didn't want to go through all 5,000 medals manually, maybe Shirley would have. So I found a proxy instead. On the Wikipedia, I could find the number of events that happened. And I compared that to the number of gold medals. For some of the years the horses were in the data set winning golds. Which makes an interesting read, princess, sissy and lady as winning gold in the Olympics. I figured out each of the adjustments to get it to the point where I trusted it again. So my lesson here was, even if I have data from a respectable source, I need to get a sense of accuracy and completeness. Missing data can be harder to find than wrong data. You don't have to find every value, but think about sums and counts and arches and comparing that to plain sums. Or even better, a different data source. So many people dive straight from data to final visual. But take some time and actually sit back and sketch out your ideas on paper. We filled many pages of our notebooks in starting because it helps us think and lay out ideas beforehand. But my sketches often are very simple. Only focusing on the main abstract shape that I want to fit my data into. Colors and layout and design, these are things I only vaguely think about, but don't act on until I have the data on my screen. There's just no use for me to think about these things until I figure out that the data works after I have morphed into a shape. For the Olympics, I had the idea of feathers. Placing emphasis on the more recent editions. But I had no idea if that would look all right when I placed all 5,000 together. I had to see how it would look. It took a few steps before I saw that luckily it showed up with the data. But sometimes there's even no use to start sketching on paper. Although I will say that's very rare for me. But networks are an exception. And for October I decided to dive into royalty. I have been intrigued how intermarried how the royals are. Are they all cousins twice removed? I found a genealogy from the royal houses. It was from 1992, so I had to add one or two more into the line of succession. Which was a fun night on Wikipedia. Not. So here's all 3,000 people in the family tree. The biggest circles are the current rulers. And everybody is connected to the parents or partners or children. And you can hover over everybody to see the six degrees of separation and how they reach into the Web. But you can click on any person. Let's see if I can get it to work here. And any other person, to see the shortest path between these two people. Because the entire Web is connected. They're all family in one way or another. But when I started out with this data set, I had no idea what it contained. So I just sort of plotted everybody using the most basic network settings. And then this happened. An explosion of points lines going out of my screen. So I pulled in gravity a bit� could have used D3 express� and then I ended up with a useless hairball. Call it points by year of birth. Still not happen helping. You can have gravity depend on variables. So I pulled the graph apart by year of birth as well. Which was better, but still a rather uninsightful bundle. And at this point I had invested several hours playing with the network settings, adjusting the data. And I was really at a point where I thought about giving up. Maybe a different angle, how much they're spending each year. But I gave it one last shot and decided to focus on the current royal leaders. Placed in a line, and let the vertical gravity be the ones you were closely related to. Insights, the queen of Denmark is central, but the Prince of Monaco who lines separates from the rest 200 years ago. And it was around this time that I started thinking about the general design aspects. And networks often remind me of constellations. And with my astronomy background, I have a bias for all things space. So I turned it into a starry night. But I could have never designed this visualization beforehand or sketched it. I had to go hand in hand with the actual data and apply to the design choices to all the data simultaneously so that I could see if the results were both interesting and engaging. >> So Nadieh gave a really great example of when you land on the right visualization early on. But that's not necessarily all the case. For our March data sketchies we had the opportunity to work with Google news lab and their data back to 2004. Which, by the way, launched this morning. Please check it out. So with access to all of that data, I wanted to look at what people were searching for and specifically what people in a country search for around the world. So each of these blocks are a topic that the U.S. searched for in spring. And I can toggle between all the different weather seasons and see the topics as well as dig into a specific country and look at top places there. And I can also expand on the topics so I can see the search interests and seasonality for that particular topic. And the question I really started out was, what are the top searched countries? Which turned out to be Brazil. And then I was like, who is searching for Brazil? Can I see actually what kind of topics are being searched for around the world in each country, including Brazil? And can I see the distribution of those searches across the years? And then let me actually get a little bit greedy, because I want to see the search interests also. Okay, maybe not. Okay. So let's step back and let's try for circles for each of the topics and size them according to the search interest, and maybe I can show the countries searching for each of those topics by overlapping circles. And this is kind of pretty and bubbly. So I kind of like it. But does geography play a part in who searches for a country's topics? So let's maybe try sorting all of the topics by distance for a year. And that doesn't look really great. So maybe I can just concentrate on one topic over all of the years. Maybe lend to a heat map. Maybe not. This is not going anywhere. I need to step back. But I did notice in an earlier exploration that seasonality is quite common in a lot of the topics. So maybe I can keep sorting by distance. That geography is quite interesting. Maybe this time around I can start to filter and group by the seasons. And that sounds quite promising, right? Nope. So it turns out that all the topics I searched before across all of the seasons making all the bar graphs look exactly the same. That's pretty sad. But wait! Then I realized that because there is seasonality, the search interest is actually different across all of the seasons, so if I could just size all of the heights of the blogs bit search interest, I actually start to get very interesting insights like this one, that the U.S. searches for travel more often in the spring than in the fall. And finally� finally� tada. My final visualization form that I'm quite happy about where each of the topics are grouped by the country that they belong to. And there's interesting insights like, for example, if we just look at the topics for summer, Mexico is actually not searched for that often, but Canada is. And the hotter countries around the world like Thailand and the Philippines aren't searched for that often. But if question go instead to winter, the opposite is true. Where Mexico peaks and Canada drops. And Belize and Thailand and Philippines, the hotter countries actually go up in search interest. So the last thing here was sometimes we don't get the right visual from our first switch or second or even third try. But be patient and go back and forth with the sketch and code. That will help figure out what works, and more importantly, what doesn't work, so that we can go on to our next step. >> So as expected, most of our hours are actually spent on getting the data on the screen. And here are some of our maybe less obvious coding lessons. So in the very first month, the topic was movies. And therefore, it was immediately clear to me I wanted to do something with the Lord of the Rings. And I found a superinterested data set that had the number of words spoken by each scene in all three extended editions of Lord of the Rings. Amazing so I decided to focus on the members of the fellowship and see how much locations they spoke in the circle. Not surprising, Gandalf speaks the most. But Boromir who is only alive in the first movie managed to speak more than Legolas does in three. But anyway, when I looked at the sketch for this project, I found it was very similar to a cord diagram. I could start there and slowly transition the cord diagram to the sketch. And I wanted to flow inward, and it took less time than anticipated, which is very rare for me in coding. But it worked. Getting rid of the excess space. And now it's ready to handle the Lord of the Rings data. And more appropriate colors. So we have nine members of the fellowship. Making sure the centers are ending up with the right vertical location. This is looking squished. So I used the same information and pulled the two halves apart. And now the cords are looking rather unnatural. I decided to dive into learning SVE paths. That took the longest. Sort of figuring out how to make the path the look more natural. That's how the new D3 came into existence, mutated from the D3 core diagram. And many people have done wonderful work that you can use. Even if you think you are creating something new, don't always have to start from scratch. Pick the thing that lies closer to the design or idea and start with that. It's out there already, maybe. >> But sometimes we dream up visuals that are unique enough there's no basis to move off of. For that same movie month, I wanted to look at top summer Blockbusters in the last two decades and reimagine them as flowers. So each of the colors are associated with a genre. And the size and number of the flowers are their IMDB ratings. There are some really, really beautiful flowers in here, I think, like "The Dark Knight Rises" and "Slumdog Millionaire." But my absolute favorite is the 1997 "Batman and Robin," this tiny thing which I think is super cute. And I have gotten questions, how did you make this? How long did it take? It's really quite simple. Just takes a good grasp of SVG paths and the cubic Bezier curve command. We start out with the starting point, in my case, zero, zero. And draw a line between the starting point and the end purple point. And then we take the two anchor points. The blue and green. And nudge them out until we get the curve that we want. And then drew some of the lines, and made the curve on the other hand, rotated the petals out. And added the colors with some motion blur and that's it. That's all that it took. So the lesson here really was when we're creating things, really understand the tools that we're using. That's how we can go beyond the prescribed examples. In particular, our favorite tool is SVG paths, because with that under our belt, we can make any shape that we imagine up for truly unique results. >> Now that you have seen two examples of adjusting paths, what about their positions? Well, going back to the Olympic feathers, all of the circles and slices they depended on each other. But they were very structured. They all followed the same concept. And at first I tried to calculate all of the rotations of the circles and slices in JavaScript. But after having written like 30 lines of code and still not achieving something I knew I could do in R in two lines, I just pulled all of these preparations into R as well. So even if they were visual variables, they have nothing to do with the data, only with how they are laid out on the screen. So, for example, I looked at the rotation each of the circles would need to have so eventual the center would end up at the bottom. I precalculated the slices they would need to have based on predecessors. The only variable in JavaScript to keep it economic was the scale from the center outward. I could scale based on the screen size. And even the medal offset is something I precalculated in R. And you can do it in networks adds well that are static and fixed. Download the final X and Y locations and the next time place them immediately, saving your viewers from having to run and wait on a heavy force algorithm. Even though they have nothing to do with the data, it's perfectly fine to precalculate visual variables and attach these to your data set. That's more often the case for fixed data sets than you may think. Sometimes it's way easier to calculate outside of JavaScript. Or it can save you a lot of browser calculations, making it easier to load. And as a bonus, it will make your JavaScript file a lot easier as well. >> So far, we have talked about the initial 80%, the data visualization, the ideation, the visualization itself. But I like to think the last 20% is important as well. So when I started thinking about the story for Hamilton, I wanted to reach a wider audience than usual. And that meant that I wanted to make sure they were engaged enough that they would keep scrolling down the screen. So the first thing I did was have the dots fly into the center to form the Hamilton logo. And as the user scrolls down, the dots fly apart and dance in the background and come back together so I can tell the reader, hey, each of these dots are actually the lyrics. And as they go into the first section of analysis, each of these sections actually correspond with a song. So then I highlight the correct song to tell them� to tell them what the song is for each section. The next thing I do� or one of the other small things I do is if the user decides they want to click on a song. That worked. Cool. I didn't expect the sound to work. You can see I put a progress bar on the righthand visualization. And you can't see much else. Okay. But it gives users the context of where the music is relative to the music itself. Another small example is for our March data sketches where I used animations to explain how to read parts of the project. But I don't actually trigger these animations until the user has scrolled into that section. So that they can always start from the beginning of the explanation, no matter how far or how fast they're navigating. And it's this small attention to detail and attempts at delight that really make a piece for me, because it tells the reader that we really care about their experience. >> And some more examples of what you can do with the light, while on a flight back to Amsterdam I was without WiFi. So I couldn't do anything essential. And therefore, I decided to animate the legend I had for my visualization about fantasy books just for fun. And other things, adding animated GIFs of the most memorable moments of Dragon Ball Z. That took like two hours, adding the GIFs. Or adding the hover for the music visual. Or turning the top ten songs into tiny vinyls. Or having annotations about weird events in the history of the Olympic games. Like Henri Pierce having to stop for ducks in the rowing event, but still managing to win gold. So getting the data on the screen in that manner and making it insightful is the other things, animations, annotations, weird legends, GIFs and more that can make it truly unique and special and more of a delight to investigate. So take some time to think about these aspects as well. >> And now we get to what I call the soft stuff, or what I call the best stuff. When we started talking about data sketches, we didn't expect the reception we have had. We thought if we had fun, maybe learned some things, and if our friends enjoyed the project, that would be really cool. But we have gotten the most amazing responses on both our visualizations and the value of our writeups. And we have gotten to meet incredible people and talk to them that we have never had the opportunity otherwise. And we have gained an amazing friendship with each other that we didn't expect at the beginning. >> We just wanted to have fun. >> But when we step back and really thought about the transferable lessons, we agreed that the most was, if you're about to take on an ambitious project, make sure to bring somebody on along the ride with you. And make sure that, if you're not too responsible like me, that the person that you bring along is very responsible so that you can keep each other accountable and relatively on track for your project. Make sure that it's someone that you really respect. And hopefully that respect is mutual. And most importantly, that it's someone that you trust or can grow to trust, because that's absolutely crucial to receiving and giving feedback. And finally, if you're about to do something ambitious, like make a visualization from scratch every single month, know that it will be hard. There have been months where we have been creatively drained and didn't know how to go on. But remember, you learn as you struggle. And it's absolutely amazing the amount that we've learned both technically and personally. And it's been absolutely worth the time. >> So over the last ten months we have learned to find data in the weirdest places. That's not blasphemy to precalculate visual variables. And sketching can help weed out errors, but you can sketch with code. And SVG paths are amazing and math too. But we knew that. And small things can add a sense of delight to your audience. And we didn't set out to learn or� but we set out to have fun. We have succeeded. There are times where we were coding into the night and would have rather been watching a TV show. But it's opened up opportunities that we weren't looking for but have gotten. Two or three more months? I can assure you we cannot keep on creating visualizations at the same breakneck pace in our own time. But we want to share data sketches. And we have had great reactions, especially about the write ups. And we will make a visualization on medium. And anyone who wants to share his or her writings can do that here. You can do one a month, collaborate with others on the topic, but it's fine if it's a standalone project. The main point is how your final visualization is a product of iterations, mistakes, and improvements. So please let us know if you ever have anything to contribute. While we hope you will join us in our final two months of data mining, sketching, finding weird, fun, and overly elaborate visualizations. Thank you. >> Thank you. [ Applause ]
Info
Channel: BocoupLLC
Views: 5,951
Rating: undefined out of 5
Keywords: Open Web, JavaScript, Programming, Open Source, Bocoup
Id: 4EOG7KwFspk
Channel Id: undefined
Length: 29min 58sec (1798 seconds)
Published: Mon May 15 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.