ElixirDaze 2016 - Processing 2.7 million images with Elixir (vs Ruby) by David Padilla

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
well yeah like like Peter said my name is David and I come from Mexico you can find me on internet as dabit I'm like that on Twitter and github or almost all social networks and yeah I'm here to talk to you about how to process millions of images with elixir versus Ruby it was shortened the title but I have I do have to make a warning before starting this might not be the best way to do it but I just want to tell you like this is just a project that I decided to use to learn elixir and I learned a few things along the way so this is this is for you if you're starting or you're undecided we're not to move to elixir there's going to be there's going to be a good story so let's let's start I work on on an application that's for the real estate market and it looks something like this right so people can go there and you know you can put your property for rent or for sale and you can upload pictures for for the properties right like you know like any regular website do and if you use if you're familiar with the Ruby on Rails world I'm using the gem that's called carrot wave that you know creates different versions of the images like Tom nails and maybe that the big size or smaller sized and and uploads them to s3 to a bucket in in Amazon Web Services and that's where the images live right and everywhere everything was what's going great except for one day I get a call from the designer and he tells me hey you know we have this this weird layout here on the home page well I want to change it I want to have a map on the right side of the screen and I'm like yeah that's not gonna fit oh no of course not gotta fit we need to make this images a little bit smaller oh okay sure so I thought myself yeah Carrie wave it's you know it's easy so I can just call a method in Carrie wave which is called recreate versions with the with the new sizes and it will all be magic so I'll just create a rake task that goes to all the images one by one and call that method right yep that sounded like a good idea except for one thing and that thing is that each image since you have to download it from s3 process it and then re-upload it took around one second to process each and when I you know I made a query to the database and I had you know two hundred two million seven hundred thousand images there which for you know if you make Matt if you have one second per image you have divided by sixty that will take forty five thousand minutes seven hundred and fifty hours at that total to thirty one point twenty five days and you know my boss said yeah no that's not going to work we need it like tomorrow and yeah that's not going to work either but I'll make my best so I was like oh okay let's just use threads right because Ruby is so good at threading I'll just do it and this is sort of what I did what are you laughing it's very good so this is what I did right I just created a there's there's a queue class in Ruby that's used exactly for this so I just you know pull the images in batches I created that queue object pushed every one of the image into the queue and then just cap workers in this case 20 because that's the max number of CPUs that you can get a digital ocean so I said hey yeah a worker per CPU sounds great and just you know started processing images and you know if you know how yeah know that that's that's MRI sorry and yeah so weird you mention it because if you know how Ruby works it's something like this this is an old slide but it sort of gets you a picture of how it works right like in Ruby 187 there was no like parallelism at all and then in further version there came some parallelism but it's not real because it's not really using taking care of using all the CPUs and then yeah there's implementation like your Ruby and Ruby news that do that but I didn't sort of wanted to use another version so the problem is the problem with groovy is that it has a global lock any a you know it sounds like something that why will they put a lock and in place but then you come to code like this I made some research and there's some times where for some reason you have a code like something like this where you have an array right and and you can populate that array with objects I don't know where in different threads Ruby with it's a global interpreter lock guarantees that you get the results that you're expecting so this code running on different versions of Ruby in MRI you get the 5,000 objects that you'd expect because it's locking every time you're adding to the array and it's not letting any other code modify it before it's done with it whereas with JRuby or Ruby knows that don't lock itself you get unexpected results right so the philosophy is that formats is that he's taking care of you right like he's he's making sure that you don't go ahead and do stupid stuff like trying to tangle the code or data in it and whereas the other implementation said no we're going to give you performance and you are responsible of doing the right thing like implementing a mutex or whatnot right so there's this is this is one of the examples where the the global interpreter lock is in place in Ruby but there's many more right like there's many more functions or methods on the on the language that locks you out so you are not really using parallelism or maybe you're using you but at some point it will stop you and then continue so that's why it's not that efficient but even so the new version of code that I did with threads reduced the average of processing to 0.6 seconds per image which is a little bit better but you know still 18 days so bus was still not happy at all so all right I decided to give elixir a try and and because you know I did not get into the go train for some reason like I never I never listened to the people that keep telling me like go is so great concurrency stuff and and I didn't get into that brain but elixir on the other hand called my attention because it has like a better syntax I I'm a fan of since since I became a ruby developer I I am a like a syntax nob right like the code needs to look good and and I hate like semicolons and all of that stuff that you have a known languages so elixir when I looked at some sample code it looked better so I decided to go with elixir and you know the biggest difference between groovy and elixir and it's that if you have code in Ruby your code runs in a single process and when that process like if you get an exception for some reason let's say that I'm running my rake task and for some reason you can upload to it to Amazon and then there's an exception then the whole thing dies right and I may not notice it whereas in elixir you have a model where you your process can spawn order smaller processes right it's it distributes the work between processes and then those processes can make other processes and if one of them dies then you just replace it with another one and you know the application continues working perfectly so it's a it's a better on that aspect right so I decided to give it a try and let's see what I needed to do I needed to create an app and an OTP application then my app will retrieve records from the database and then I will download the original image from Amazon create the new image sizes with imagemagick and then upload them back to Amazon s3 plus like keV does for me except there's no carrier wave in elixir so so the first part creating the app was pretty simple you probably all already have created a new app it's just you know makes new the name of the of the app and it creates the whole tree with the required files and you're done you're ready to to start coding so that's that's unimportant and then here comes the the real code you need to retrieve records from database the best way to do it right now that I found is by using ecto which is sort of like the active record of the elixir world except it's not and all I needed to do is just configure it you know added the adapter and the database name the username and the password and which of course as root user you don't need a password and create a model in the model it's it's just a module and you use the the actor model module and then you define the schema right the the the columns that are supposed to be on your database it doesn't do it for you except for the ID I think and since I only needed the file name from the database to create the URL for s3 that's all I added there so then once you have a model you can start creating queries and the queries are created sort of like functions and then you change those functions to get your results so let's say that you have a main query which is just the domain select everything from this table and then you can do all things like find maybe find only one so you change you change the main query into into this order query and then maybe you need it paged so you you can add the page thing limit an offset and and in your code you just sort of use a pipe to change all those functions and get the data that you require in my case I only needed to get everything because I was not going to pick the information so I just create a an old method and just bring them all because I need them needed everything then I needed to download the original image from Amazon s3 so that was pretty simple I just needed something to to to download via HTTP like like curl does or W get so I found this library that's called ish TT potion and in elixir which can do that for you and and it's just a wrapper of another Erlang library but it works and and I needed that and here's here's an example of something that I love in elixir which is the pipe and and look at how how it gets your code like it makes makes it look cleaner to me and and more more descriptive right and the the pipe what it does is you get the result from the first function and passes over to the second function on your pipe list as the first parameter of the other pipe of the other function so I could have written this code like this you know where this is the first parameter but it looks so much nicer when you use the pipe once you have like four or five functions change you just start to get the benefits of using the pipe and it just looks so good um and that will download like I said that will download the image from Amazon into into my local storage right so the second sorry the next problem that I had is that I needed to create new image sizes so I decided to look figure out what I could use for you know image magic manipulation or anything and I found this just package that's called magnify which is like it says right there and Lexi wrapper for image magic on the command line and it did what I wanted to do but it didn't had all the tools that here's a here's another example of how the how the pipe looks this is this is so awesome it have methods to resize the images but if you have used Kara wave there's there's other methods like resides to fail resize to fit precise - I don't remember the other ones that give you different behavior on how it's that it does it resizing so I didn't wanted to like put code in there without knowing sucked it what it was doing so I decided to port those methods from Kara wave into magnify and sent a pull request and get it accepted so yeah me I love doing open source stuff like that so now we had these methods for me to use on on my on my own application now the next problem was uploading to Amazon s3 and this is where it got a little bit tricky for me you know the first thing that you do when you do when you don't know the language is to like Google how do I upload files to history with elixir and then I got no results and was oh oh what's what's going to like - elixir programmers don't use Amazon at all or what's going on and this is this is dramatization but yeah I found no results about uploading files to s3 weed elixir so it was like alright that's not going to stop me because I have the command line and I can use Amazon's own tools like s3 as cm CMD and I'll just make a system call and upload the files through the system and do something like this so I was done and ran the script and every image I was taking one point six seconds per image so at that point I would like what's going on someone lied to me because you know this is this is not working right this was me at the moment like am I am I wrong so I fortunately I know people that's experience with elixir and I you know I call them and say hey your elixir things not working like this what's happening to me I'm doing this and doing that and it's taking longer like what's going on and as I explained to my friend what I was doing his face went like this basically like dude he told me look at your code what you're doing right there is that you're opening a operating system process and then doing your thing and then closing it and then opening closing for 2.7 million times that's going to take a lot of time what you're doing is totally wrong and I was like okay yeah I understand what's going on yeah of course makes sense right open the process close the process open the closest that's that's a lot of work he told me well let's let's talk about the the library they're using s3 CMD is a Python library so why don't you use airport and open which is used to connect Erlang to other languages and open a Python process and then load the amazons code and use that that process for the 2.7 million images and it's only going to be like once right and then you'll be you'll be processing like the images way way faster and that sounded like a good idea except I didn't have the time to do that so at some point when when I was when I was talking about this I was like wait wait so I can you you're telling me that I can call any Erlang library from elixir you're like yeah you can just you know use the syntax like this like that and whatever code that's written in Erlang it's library whatever you can just call it in elixir and I oh that's that's interesting so what I did was went back to Google and Google please tell me a way to upload files to s3 using Erlang and that's what I found it there's a library to do that in Erlang not an elixir yet but there's something for a long so I was oh yeah I should have started there so once I found this it was pretty simple you just need to add your library to your to your dependencies and then just call it like this and you're basically on your way except for one small thing that I forgot to make it you know more clear is that when I found this I started getting a lot of errors from from the from the call to the to the thread library because it was not there was no matching of the of the of the arguments that I was sending to the function and and because there was something missing and the problem was that I was using strings in elixir and when you call Erlang libraries you most most of the time you're going to want to send character lists so I needed to convert that all of the all of the arguments into character list before sending them to Erlang that's this way I also learned that there's a difference between the two double comas and the single ones in elixir and it's you know I learned it the bad way basically because it was just blowing up and I didn't know what like what's going on it's it's the ride there's a string it has what I'm looking for and and it still said that it there was no match because I was sending a string and it was expecting a character list so if you ever use Erlang libraries you're probably gonna need this advice obviously when I when I when I told my friend he told me he was right there in the example I send you and I was like I didn't read your example but are talking about so anyway I can now upload files to Amazon and I can do it linearly so what about concurrence right because that was the whole point of this so if you think about what I need to do to process every image is just retrieve the records from the database and then download the image from s3 and then create your new image sizes and then upload the result to the s3 those things are sort of unrelated right like I can download I can receive the records from the database and as soon as I get them I can start downloading images and then as they are being downloaded another process can start processing them and then another process can start uploading them so there are separate things that you can just separate into different processes and that will be like the optimal thing I didn't have the time to do that but at least what I did is separate all the processing per image on different Erland processes so the only thing that happens is that I retrieved the records from the database and then I created several processes for the whole thing of downloading processing and uploading and for that I found a tool that's called po boy which is basically that it's just a worker pool factory and do you use it for cases like this it's actually I was actually trying to add it into into my dependencies and it was nice that it was already there because apparently ecto uses it to handle the connections or whatever so i already had and that was double school so what you do with pool boy is that you create a worker model which you know it just basically just starts the gen server and then you put some code that you want to initialize it with every every worker this is where you put the code that's that you don't want to be repeated every time so in this case I just initialized the connection to to Amazon when when the when the worker starts and then you have the code that actually process whatever whatever you want to process so in this case I'm just calling the process method that will do everything from downloading to images all the way to upload it again and then return you know a reply and and the result of that and you keep the state because there's no state per se in electric classes so you need to you know pass it over true to the whole life of the of the worker and then you need a supervisor which is the one that's going to be handling that's going to know when when everything's up and working and do you initialize it well basically you just need to tell it its name which I put in a function in case I needed it for future reference and the model that's gonna that's going to handle over which is the worker that we just saw and then you can handle the the size of the cube which is a it has it's very flexible and I found it very very interesting how to pull is work because you can set a size first of of the cue let's say that I say 20 and if you start sending work to the cue then you can set another parameter which is max overflow and you can say hey if you if you get a lot of work then you can grow maybe up to 50 or 100 or whatever so the 20 the 20 processes that I that I specify there are always going to be up but the other ones we only excess if the 20 is not enough so your pool if it gets a lot of work it will grow and then as soon as it's done then it shrinks back to do whatever you want so I could have done also like said hey this ice is zero I want I want two processes to be off and then max over hole 20 so it will grow as as the word question and this is this is relevant because maybe you don't want your resources on your server to be spent if they're not being used right so you don't want maybe you want 200 cues but you don't want them to be up if no one's no one is using them right because that will waste CPU memory and whatnot so that's that's very powerful in terms of flexibility you can you can grow and shrink the cue pool and and it will just do it for you you don't have to do anything but but stating it there then so the main model the one that starts everything is it looks like this so I just start the program and start that supervisor which is going to handle the cue then call this method in cue which is going to just send all the records to the database and then to the cue and just return the supervisor sort of process stays up a cue what it does like I said it just pulls all the records from the database then goes one by one and creates this this piece of code is what creates the actual process soup process that is sending to the queue and then the queue will manage when to run the process and just you know as soon as it it will automatically say hey I have a process available for you I will handle it and then just discard it as soon as it's done and so it just works basically like magic so my server just started working like insane look at that it's using all its CPU power to to do all the processing of images with so little memory used like this is this is this amazing it's it's so it's great how how elixir just takes care of it and it's it's it's really their the Erlang virtual machine but still it's elixir what's what's to me it's elixir what's doing all the magic so in conclusion it took about four days to process the 2.7 million images you know it still took some time but it was way better than the the month that I had forecast with with Ruby which if you if you played if you do the math to split it it took like 96 hours five thousand seven hundred and sixty minutes and all those seconds and if you average it it took around 0.1 twenty-eight second per per image which is insanely fast right and it's only problem well it's sort of solved my problem because it took me like 12 days to figure it out so so in total you know it it was like 16 18 days so my buzz was still not happy but it was it was you know what's quite the learning experience and the second dilution will be this is like elixir is is so so great and not just because of elixir but because juror you can use technology that's been there for years in in Erlang right Erlang has existed for 25 something years and you know at this point you'd expect them to do two for 40 airline developers to have done and solve all the problems in the world so you're not you're not reinventing the wheel you're just making it a little bit better with the syntax and that's that's cool right and and I know because I know people that do Ireland for living and now that elixir is getting hyping when I talk to them about all the other amazing stuff that I can do it they're like you know that yeah I've been telling you for years you're wrong Erlang is the right it's the right path right so so yeah they were right for they have been right for 25 years all these problems that that you that we are still solving in other languages like threading and currency oh but it's already there it has been there for for for for years and years and it was actually designed for it you know you're not patching a language that already existed to handle threads no no this Erlang was designed to handle you know multiple processes at the same time so so you're when you use the lexer you're using all that experience in your code and and that's great because you're not it's it's harder to find where box or unexpected behavior and if you do find it it's probably because you cause that right and like like the whole strings and chars a difference like I did right so so so this is this is great there's also a lot to learn obviously and and another thing that that makes me excited it's a lot to to give it's all the libraries in elixir right now are still you know looking for for help in terms of of code and I like to do that like I like to find gems or hex packages that it can be improved or that can can can be better and just code something and and make a pull request so for me that's that's great it's the state of the current state of elixir and to me it's like there's a lot of opportunities to to give back to the community by you know patching stuff so that's great the other thing is that the syntax is very very beautiful like I like the syntax like I said I'm a syntax guy and and you know that the whole pipeline thing and and and everything makes it look very very elegant and and I really like that and I guess the other part which I haven't yet explored but I will at some point is that whole airports thing where you can open processes in other languages I can I can at least think I have it on my bucket list to create what if I could create a server a web server that can handles all the other connections via or but then it can open a process of Ruby and just you know send a request to to be Iraq and then you know the web server will be in Erlang but handling Rails applications that sounds that sounds like something that will definitely someone to explore even if just for hobby and that's it I hope I hope you you guys are really enjoying coding with elixir like I am like I said at the beginning there were probably ten different ways of doing what I did it will have probably being faster if I just refactor the code to use JRuby or Rubina's but that was not the point well maybe for my boss it was the point but not for me I decided to do it will exert because I wanted to try it out in a real case scenario and and you know figuring things out and I did and so that's that's how I ended up handling all those images and learning a lot about ecto about you know threads concurrency and all that so I guess my last piece of advice is if you have a project that you think you can do with with elixir just go ahead and do it and and learn you some stuff and that's it thanks Oh
Info
Channel: Confreaks
Views: 32,694
Rating: 4.9136691 out of 5
Keywords:
Id: xoNRtWl4fZU
Channel Id: undefined
Length: 32min 39sec (1959 seconds)
Published: Wed Mar 16 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.