AI and Creativity: Using Generative Models To Make New Things

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi everybody my name is Doug I'm a researcher at Google on the Google brain team and I'm going to talk to you about some work that we're doing on trying to use generative models to make new things and the context for this connects I think very nicely with with a Nima's talk which I thought was was really fantastic and I kind of want to put in context the the amazing work that Amazon is doing at MX net and I think the Facebook folks with with Pi torch and and some of the work that's happening I'm tensorflow at Google and that's to rewind back to the year 2000 some of you maybe weren't even out of elementary school I was a postdoc in Switzerland with a guy named jurgen schmidhuber and I was working on this crazy model that no one had heard of called Ellis TM long short-term memory and I was actually turned to get it to compose music and so if you think of yourself in 2000 that's 17 long years ago the way that we worked was by hacking lots of C++ code and so Felix Pierce who was the person that invented the forget gave in LST M he had his implementation in C++ and I had to have my implementation in C++ and then this other crazy guy this young grad student named Alex graves came along and he had to have his version in C++ because I didn't want to use mine I didn't want to use Felix's and so you have this idea okay I want to try something new right well it's like it's a lot of work right because you you you have to write a bunch of new code and I saw something happen about 10 years ago at a place called university of montreal where i was a professor working in deep learning in music and some of the guys in the lab thought that it would be cool to try to automate some of this and so they built out something called piano and I thought fionna was crazy partially because it made me jealous so I saw them working and I genuinely mean that it's like watching my daughter not want to learn how to drive a stick shift a manual transmission like you should have to do that right and I felt like these students should have to write 50,000 lines of code but it turns out what I saw happening even ten years ago I think I think the key was automatic differentiation it wasn't just linear algebra you had to have linear algebra but you also needed automatic differentiation is that now it's almost like jazz improv that was the term that came to mind for me thinking about music at the time it's like they could just like decide to change out that continent for an RNN or they could decide to you know change the architecture to be able to build these huge graphs that Anna Mae was talking about and this project that I'm going to talk about now which is called magenta actually stems from asking the question what can we do next now that we have these wonderful tools what are some new things that we can try that we haven't been able to do before and for me my personal viewpoint is that you know supervised learning over over high dimensional data has been explored and that I'm not going to add much to the table by exploring it more but then we have other directions to go in including reinforcement learning unsupervised learning others forms of density estimation and of course things that none of us have dreamt of where we might go next and so for me I started talking about moving on to caring about generative models models that can generate new instances of the data upon which they're trained and I think this is you know one direction to go in and so the rest of this talk is going to be about that it's going to be about kind of the thought process that might lead you to caring about getting deep learning models and reinforcement learning models to do things like generate new text make music and maybe make art and when I talk about it in the context of trying to connect with artists and musicians because I think connecting with the creative community especially the creative coding community is going to be crucial if we're going to use generative models for something more than then some of the things we're using before now so let's start with some some of the at least for me some I think some important important models one of them is this idea that we can transform language from one transform like language in English from sorry we can translate it's all I wanted to say and I also thought it was pretty cute that we could do things like translate into multiple languages by just prepending a token that said what language we're going to learn a kind of inter lingua and and learn some things about what's general across languages that's all fine but it also gives rise to some new products and I think here is where I'm starting to think about generative models so we have this idea that we can improve translation but not only can we improve translation of the server room but like on your device you now have a much better translation giving rise to I think a whole family of products that maybe won't make sense to me but might make sense to my kids and I also just as an aside want to point out that the translation that we're seeing it's not just Google this is a Google example but using deep learning for translation has really it's been a game changer in terms of expressivity ok so if we look at the old translation from phrase based machines we see this last sentence of Hemingway's snows of Kilimanjaro being translated and back translated from Japanese whether the leopard had what the demanded at that altitude there is no that nobody explained moving towards no one can explain what leopard was seeking at that altitude and it just dropped the the and so what I would do I'm going to back up a slide the gains that we see compared to the previous systems in blue and the green is the game that we saw from using neural networks for this problem start to become very close to what humans can do in yellow on these language pairs you'll note that not all language pairs are equally easy it's harder to translate back and forth from Chinese and English than it is to Chinese and English and Spanish I think that's intuitive but that even these gains the the green that we gained moved us in terms of expressivity a long long long way and it's that expressivity that we need if we're going to care about using these models for generative purposes the other this was a Google effort came out of Google brain how many of you have seen smart reply how many of you are repelled by smart reply you think it's the worst thing that could happen ok that's ok it's ok how many of you actually think it's cool okay more of us is the techno this is this is a techno happy crowd I've given this talk in other places and many many many more people were repelled by the idea okay so let's talk about generative models this is actually not a generative model this is picking phrases that were pre composed okay but let's just move to a world just as a thought experiment because I'm the fun you know this is just an aside for you all right what what happens when we can actually use deep nuts to compose a whole paragraph and we give you an interface that maybe gives you a few suggestions but that suggestion actually is multiple sentences and then we train and we get better and by we I don't mean Google I mean we as a research community I have no idea where this is gonna come from okay now I'm already thinking of some of the side effects one of them is email the text in an email and it won't be an email because like I have kids no one uses email anymore except us right so it'll be some other form but I'm like remember like at some age it was when word processors came along to be polite to your grandmother you'd handwrite the thank you note and I think will happen is we'll handwrite the email right because because that will show respect for someone in a way like I'm not gonna let my you know I'm not gonna let software generate my response to you right because this is a love letter or this is a thank-you note so this one I'm gonna write by myself right and then and then you have the the the game theoretical version of that which is the the boy or girl you're trying to court says you didn't really write that you know yeah no I did really I wrote the whole thing by myself so so there's there's a lot of directions this isn't go and in terms of generative models and so this was the thinking that led to created this project about really about two years ago thinking can we can we use deep learning to create some new things that mean something and it says right there that we're gonna make music and art using machine learning though in some sense it's a bit of a falsehood because as you'll see from the rest of the talk it's not that interesting to just push a button and watch machines make art right that sort of sort of skips the whole point of what art is for which is I think it's some level communication right and so we'll put that aside we left that that's the title I'm not going to change the website but I'm really I think what belies this is getting back to smart reply things like watching these are my kids this is Sam and Olivia and they're in Japan and they're very adventurous eaters they actually basically ate everything that we could find there's a blast this was last last Christmas but you'll note and these were not staged Sam's phone is right there on the counter with him and I was picking up our train tickets in Tokyo and they're laughing at something on their phone and and what I'm watching these kids do is actually really live their lives through these mobile devices and and they're living them in a way that is not I think just kind of re recapitulate in the way that we lived our lives with whatever your generation is with word processors or email or nothing and so if you put yourself in the mind mindset of what we're doing with machine learning and how far we've come I thought it was very interesting to put myself in the shoes of the grad students at University of Montreal who are developing Fano which I thought was silly you know and to say okay wait a minute these kids are using things like snapchat right how many people here use snapchat a linear correlation between how old you are and whether or not you use that chat it turns out if you're not familiar with snapchat it actually helps you tell a story it does it in very simple ways it might do something as simple as put a little some whiskers and give you a little cat cat face or it might actually allow you to frame a narrative and a little small way and what I noticed these kids doing is absolutely seamlessly without any assumptions about how this really should work or does work just using snapchat to communicate right using it to tell their stories and so I think in my mind there's a very very possible future where we're using real machine learning and real generative models to generate content but we're allowing users to shape that so we're basically building tools that themselves are generative and intelligent but they're helping us communicate okay so now we're going to dive into that's that's the philosophy 101 of magenta now I'm going to talk about two or three projects and I'm going to focus on a couple of machine learning techniques this isn't a terribly technical talk but it is actually quite technical work despite the pictures of bears and things like that so the first project I'm going to talk about takes advantage of a game came out of Creative Lab at Google New York called quick-draw who's familiar with quick-draw okay so if you're not it's basically playing pictionary against a machine learning algorithm and as you draw you're told what to draw draw a bear and as you draw the Miss the your laptop is actually shouting at you it's really horrible it's like oh it's a bear no it's a truck no it's it's really very distracting and you have twenty seconds and so what we have it turns out our almost a billion drawings it's a big number we don't need that many as it turns out drawn in less than twenty seconds while being shouted at by a computer so this is these are the raw materials of art right here and what we decided to do the primary research on research around this was David ha who did some similar work in the past with kanji with Japanese Chinese characters what we're gonna do is we're gonna try to learn to draw and I focus on drawing here not pixels because it gives us a very different space in which to work so we stored the actual strokes that the users drew as they were trying to draw the cat and what we're going to do is we're going to try to try to encode that information into some sort of auto regressive model or auto encoder and then we're going to decode it and we're going to get back a cat that's nice and because we're smart machine learning researchers this bottleneck is going to be added so that the Z or Z where we're storing our information in the latent space is actually constrained otherwise it just memorizes cats and we all know it's boring to memorize and so we're going to talk about a couple of aspects of Zed so one of them is that it's compressed and so how many of you noticed that the the drawing on the left and the drawing on the right are different right so this is real data the you give the the trained model a cat with five whiskers and it gives you back a cat with six whiskers and like that's already in my mind kind of interesting right obviously there's an easy explanation for that there's some mode and the data that has faces of cats and if you look at the strokes which are actually just Delta X Delta Y that's the way the data is represented just Delta X Delta Y where did the pen move was the pen picked up into the did the drawing stop so it's very simple data representation somehow there's a mode in this data where you tend to draw that six whisker and so the model does that it's a form of underfitting but it's also I think kind of interesting if we move forward and we start to sample from a model like this what we see is that the model does a pretty good job of storing kind of what you would consider reasonable drawings this is this is a this is trained on the yoga class okay and these are unconditional samples from from yoga and we can turn up the temperature so that we can you know we sort of flatten out the softmax probability density so we're likely to sample from less probable spaces of the model and hot yoga is more dangerous than cold yoga it really is as I said they're kind of funny right like you really like you could do the ones on the top we all could probably take a stretch break and do that but we're not going to do the ones on the bottom and and one thing I can point out is is these this is one of the technical bits in the talk we'll come back to a second time these drawings are these are unconditionally sampled from the model okay and they all make sense right you can we can generate a million of these and grab one and it will make sense and one reason it makes sense is that the the Z that we build is is trained so that if you if you roll the dice now the dice that we have instead of having one die let's say that we have like a hundred dimensional Z right we're gonna roll a hundred dice right one for each dimension in Z okay in the doctor and those dice are going to be couching they're not going to be uniform they're gonna be Gaussian right so what's going to happen is we're going to treat each of the each of the dimensions of our embedding space as being spread out over a Gaussian over a bell curve okay so that when we roll the dice we get something meaningful okay so the way we're going to achieve that some of you have probably already jumped ahead and some of you don't know this yield as well we're going to actually not just Train Z to reproduce cat drawings we're going to also add an additional cost which trains Z to try to act like a random number generator in Gaussian it's kind of weird because if you ask Z to be a random number generator it doesn't get to store any information I mean you don't want to train completely on that but yet if you ask the model to just reproduce cat drawings it might give you an embedding space where if you sample from it certain areas of that sampled space don't make any sense right and so this is called the variational model and it involves training on two costs one of them being reconstruction l2 cost on the drawing and the other one is KL divergence against Gaussian noise so you basically have this mixture of saying I'm gonna let you store information but I'm gonna make you store that information in such a way that I can generate from it and there's a whole family of techniques that I grouped together as being making models able to generate in a way that makes sense I'm not gonna have time to touch on all of them but if you want a laundry list other things would involve having a counterfeit detector that forces the model to try to generate really realistic versions of what it's generating called a generative adversarial Network you might use reinforcement learning to add some sort of structural cost that forces the network to do something that it wouldn't otherwise do these are all ways to force the model to do something more interesting than just minimize sort of l2 distance the Euclidean distance / / strokes or / pixels or / whatever and that's what gives us this property of being able to sample from the model now why do I do this in the first place we're going to switch gears for a second because this model is living in the space of strokes and because it's trained on a variational loss it can kind of pick up where you left off so what David is doing here is he's drawing a mosquito and as he's drawing it we have nine versions we were sampling from the model nine times to complete that mosquito drawing and remember we also have this network predicting end of drawing so if it it will stop when it thinks it's ready if you're curious the encoders and decoders are both recurrent neural networks the decoder also uses a mixture of gaussians to predict the XY dimensions this is kind of cool the other the other thing I would point out is that because we're living in such a space we're getting this kind of regularization but that you can perhaps use this kind of reg for artistic means right if you look at an artist or a person putting in rather normal things like I think the first column looks like pretty straightforward drawings of cats you're gonna more let's get them back but if you put in an eight-legged Pig you're gonna get back a four-legged Pig and I think I think my favorite and I think this is suggestive if if you have a model for for just four pigs and you give it a truck right it gives you a truck Pig right and and I like it that you're laughing I think it's great that it's funny remember these were just drawings that were trained in this game where you had twenty seconds right but even even with with that kind of data if you take some care you can build out these these kind of really cool effects right and so I want you to think about what would happen if you had artists actually trying really hard you know you have an artist who's trying to put her thoughts into a model like this right giving it a hundred really cool drawings and then starting to sample from the model and looking at the possible futures that could come from being able to expand and the nice thing about that is if it's the artists art you know the expansions are just another extension of that art so it's not really the machine doing the hard work incidentally it's not always interesting if you give the cat class a toothbrush it doesn't really give you back a cat or a toothbrush at the bottom there and no three odd cats no chakra okay and this is my last one I love this so this is the raine class it turns out if you draw a cloud people always drew rain after the clouds so you can just kind of make it rain and do this live is fun you just as soon as you start the cloud drawing it it does that for you okay oh and you should change the shape of the drops I usually do these live but I wanted to to make sure it save time for something else okay so the closing thought on this art stuff is yeah we've we're just scratching the surface you do kind of a variational model using kind of well-known encoders and decoders there's some nice tricks to keep these models from overfitting but mostly what I think it is is just trying out new things in the space of generative models I don't expect anybody to go home and play with skin this is all an open source and on the present like play with sketch RNN and like you know like put on your head and your your smock and your easel and be like this is art but I also want you to leave thinking wait this is actually a very interesting direction for us to go and now I'm going to switch to music and some similar work and I hope you'll find evocative the work that I was doing with jurgen schmidhuber back in the day was focused on trying to take auto regressive models specifically LST m and and to see if we could write music with LST m by predicting the next note and there was actually a very serious question behind this which is ken LST M which is you know a very complicated and amical system in principle able to capture a long term structure can it actually capture the kind of structure that we see in music which is repeating and has it has strong temporal constraints okay so it's actually quite an interesting time series to work with if what you're looking at are models that can learn long term structure and it turns out that everybody else's work except mine wasn't very good but my work in 2000 was really really good 2002 read the paper and it's great no it was actually kind of boring wandering blues and we've recently and the magenta team this work was this work is also up on our blog at Chico magenta I think we've we've we've moved towards much much more interesting music using roughly the same approach but we didn't really change machine learning what we did was we thought more carefully about how to match data to to process and I think if you start caring about generative models you have to care as much about where the data is coming from and what you want to do with the data as you care about the actual did I use a rel uu or how many layers did I put in my model and so what we did I'll back this up we moved from caring about musical scores to caring about musical performances and we were lucky enough to get a number of expert pianists playing in a piano competition Yamaha runs these a piano competitions and they capture the data on a disc live er reproducing pianos and they put that data out for everyone you all it's all out there on the web to use with a great license to play around and do research with it and so now what we have is something looks like a piano roll like this it's not aligned to a score anymore there is a score the score was being played but all that's slowing down and speeding up that you see the expert pianists doing that's in the data as well and it turns out that's crucial because it gives us a lot more variance to work with and keeps us from overfitting and it makes these models much less brittle we can generate much more interesting music by by by generating from performances of these scores than we can from the scores themselves and so what we do the the model itself is is there's more to it than is in this graph I mean we're using attentional mechanisms and a lot of other kind of modern machine learning but fundamentally it's an auto regressor it's LST m and it's predicting a single softmax this says should I turn a note on should I change note velocity should I turn a note off or should I advance the tape should I advance in time by as much as little as a millisecond okay and so what we get are performances with velocities and with with with the contraction and dilation of time and what you'll hear next are some unconditional samples from this model so this is work in progress this isn't conditioned on any performance this is just rolling the dice again and starting off from some starting point and playing I hope you'll agree that they're rather evocative pieces of music [Music] and it would have gone on forever stop there's one to give you an idea of the variants let's hear a second one [Music] and it's your third one [Music] [Music] [Music] [Music] so I don't okay honest vote who's like now and there's like okay these are kind of interesting okay good that's great I think they are too I think there's a lot missing and let's be let's be humble here let's just look from a bird's eye view this is a Chopin's a tune and E major up above and these are a few of the neural network performances visualized the same way as a piano roll where time rolls horizontal then you can see there's a massive amount of structure happening in a piece of music composed by a music expert that we're not getting to it we're not even close but I feel like we're getting these evocative short term bursts this is up there's a really kind of fun demo you can play with on our blog that in actually in deep learning is so on your browser will run this model and you can play with a few conditioning variables and I think of it like wind chimes more than music but it's kind of fun to play with so what does this model do well let's compare it to the the sketch Arnett model with the cats we don't have cats we have music and so here's the proposed encoder pair with the decoder and some variational embedding it turns out this model is really just the LS TM network so if you want to think about it in terms of auto auto encoders it's really just taking in MIDI and pushing out MIDI and decoding it through an LS TM so there's a lot of work to be done where we want to be is actually here we really do want to have some encoder of music the problem is music is variable length and we can't hope to fit in our Zi the entire piece of music that doesn't make sense so we need to think of ways to chunk this and move towards more hierarchical models and we've we've got some preliminary work that I'm going to show you this hasn't been published even on our blog yet the basic idea is to have AZ but that Z's job is only to store a measure of music right only only a little bit a bar and then we're gonna have to have we'll need two decoders we'll need to decode a measure of music but we need to chain that together and be able to decode multiple measures of music so that we get something coherent and if we can train these models right what we should be able to do is actually generate some coherent long-term structure and this is what the whole graph looks like we're at the top it's a little bit complicated but those are the green boxes are measures and they're independent individually sequentially being encoded into our auto encoder and then decoded as as as measures that are then pushed back to decode it as phrases that are then pushed back down as measures okay well when I could have time to dig into that with the time wow that came fast sorry so but so Adam Roberts trained on music where he had a baseline and he had a lead and drums and they're all being pushed through the encoder and they're all being stored in this latent space and then there are three separate decoders one decoders for drums one decoders for bass and ones for lead now this is gonna sound like pop music we should be listening for it does it actually sound coherent because no one's really been able to make coherent you know traditionally generated music like this maybe you could control the volume a little bit from the back in case it's too loud [Music] little bridge at the end so I know it's kind of interesting and come back to that we're running out of time so I also want to mention one other benefit of living in this embedding space is the last thing I'll have time for is that just at the shape of a measure and the same way that we can move around the embedding space for drawings we can move around for four simple measures of music so we're gonna here as a starting measure and an ending measure and then we're gonna like more move our way through linearly a spherically linearly through through through try to get from one to the other so that's thing a [Music] that's then B to embeddings [Music] I can do that [Music] [Applause] [Music] [Applause] [Music] [Applause] [Music] [Applause] [Music] yeah that was pretty cool let's shout out to Adam Roberts for that and it's it's not easy to get from point A to point B in an embedding space and have it make sense and I think all of those made some musical sense I'm out of time we're not their successful musical instruments even musical tools as simple as drum machines and guitar pedals are super hard to build we've built a few tools but we haven't really managed to allow musicians to express themselves with these nor artists so we've got a ton of work to do please check out our blog this is also on a github there's a bunch of open source code out there and we have a lot more coming at nips as part of a creativity and design workshop apologies for going over a couple minutes thank you all very much for for your attention [Applause]
Info
Channel: Coding Tech
Views: 19,190
Rating: 4.8986177 out of 5
Keywords: machine learning, artificial intelligence, generative models, neural networks, reinforcement learning
Id: WaqlmeRfPFE
Channel Id: undefined
Length: 30min 42sec (1842 seconds)
Published: Sat Dec 23 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.