Amazon Sagemaker Tutorial | AWS SageMaker Tutorial | How to Use Amazon SageMaker | Machine Learning

Video Statistics and Information

Video

Captions Word Cloud

Captions

machine learning and ai have changed the face of the industry in recent years most business and i.t related researchers employ the use of multiple machine learning techniques in order to ensure specific analysis of gathered data it has become such an essential part of the industry today that many cloud service providers have also provisioned ml capabilities as a part of their managed services amazon sage maker is one such managed service provisioned by aws which allows developer to build train and deploy machine learning model in a production ready environment so if you feel the need to dig deeper and know more about it stay right where you are as we are going to bring you all the information that you need right on your screens [Music] if you haven't subscribed to our channel yet i want to request you to hit the subscribe button and turn on the notification bell so that you don't miss out on any new update or video releases from great learning if you enjoyed this video show us some love and like this video knowledge increases by sharing so make sure you share this video with your friends and colleagues make sure to comment on the video for any query or suggestions and i will respond to your comments now very quickly let me walk you through the agenda for today's course in today's course we are going through the introduction to amazon stage maker then amazon sagemaker's architectural review and then we will look at hands-on on amazon sage maker now we start with what exactly is sage maker right so visualize an environment where you have you as a data scientist was writing your code in python or you've been using any of those live existing library in tensorflow and then you have been doing your training testing and validation in your own infrastructure in terms in your laptop or on your on-prem environment now you have around terabytes of data or petabytes of data that is needs to be churned right so how exactly are you going to do it in your existing environment it's going to take you maybe weeks time before you're going to get your complete data trained and when you get back your result now how can you sort out that issue and that's where we're going to see um amazon sage maker now what exactly is this product you're going to launch sagemaker immediately it's going to come and give it to you a jupiter-based notebook it's going to say okay i'm going to provide you with a jupiter-based notebook you can start writing your code over there many of the existing framework which is which is available um popular frameworks that is available in the market almost close to 15 to 20 frameworks are available and so you can just pick it up from existing sample notebook which is available inside amazon examples and you can start tailoring that particular notebook and create your training models from there itself you're going to see all those things so that's the first options available for you and pay per second right so you don't need to pay um um the moment you invoke that sage maker you will be charged based on second so you don't need to be in case you're not going to use it immediately your billing cycle stops over there and it supports all mxnet gluon tensorflow and all those models are all the frameworks are available for you uh in it right now moving on so this is what we have discussed right now the the typical machine learning process problem right so in your current and on-prem environment what exactly the process looks like your machine learning frameworks looks like first data collection you're going to collect your data there should be some source for doing that you need to do your data integration you have to do your preparation once you're done with your data preparation you have to find figure out your inference statistics right so you want to do uh kind of like how exactly your data looks like is this is the proper data where you want to run your algorithm what sort of an algorithm that you want to run it so so first you want to do a visualization of your data how exactly your data is going to look like so after that you're going to do your feature engineering model training and all these are going to be a process which you will be going through in your environment right in case you're not going to use as a platform as a service you will be relying on n number of aws services for doing this task instead what you want to do is you want to rely on a platform as service which is going to be amazon sage maker that's right so the moment you launch sagemaker first and foremost you're going to get a jupiter notebook it's going to be a reliable gpu powered productivity ready workspace of a data scientist and developer so that's exactly your jupyter notebook next one is going to be a sagemaker algorithm as i said the n number of built-in algorithms available for you next you are going to do your training service now what you need to understand over here is the machine that you have taken for launching this notebook is not the guy who's going to come for training for you i just want to make this point repeat it again so the point number one is nothing but a small instance is going to come in your environment please try to visualize it exactly the way it is so that it's easier for you to digest the concepts point number one is one small machine which is maybe an m4 x large machine game for you where you're going to write your code where you're going to write your python code or or you're going to do your the kind of data engineering that's exactly what you're going to do over here that's going to be a smaller machine right a cpu based machine right then you're going to select your algorithm and the next when you want to run your training service what you need to understand is immediately a bigger machine is going to a gpu based machine is going to come for you just for running your training process right a bigger machine massive bigger machine is going to pick up your data from somewhere that somewhere is going to be our s3 bucket it's going to pick it up from s3 bucket do this process your training process the moment it gets completed it's going to automatically get diced by itself and your end point is going to get deployed in the in the sagemaker as well right so that's how exactly it's going to run so this machine is going to be different and this training machine is going to be different so that's how exactly we're going to see this how exactly this particular sagemaker is going to work for you any question over here before we deep dive into this component so this is the component that you're going to deep dive into and understand how exactly it's going to work okay so no worries i mean all this while if you're not able to completely come and sync with me remaining slides are going to be something which we are going to slowly and steadily go one by one and understand component by component and that's exactly what you're going to do and a hands-on demo as well so i expect a lot more interaction during that time right now so this is the fair level hawks eye view of what exactly sagemaker as a product has given you okay so this is the complete architectural overview and till now i have not asked you to be completely attentive over here but now i want you guys complete attention over here this is where we're going to actually start our session right so all this while it was a lot of stories but this is the meat of this program fair enough okay you can see multiple process over here start your training program blah blah blah training algorithms each and every of this step right we are going to see what does this stuff means first and foremost that you need to understand from this is as i said in amazon sage maker there are built-in algorithms available for you that means your xg boost your pca your k-means algorithm all these algorithms are available as a docker container for you right and that is residing inside elastic container registry or ecr are we all good with that there is an ecr container registry where i have my algorithm kept inside a docker container that's exactly what how i want you guys to visualize are we all in sync right now let's see how exactly this process works now the moment i kick start my or i start my stage maker or my training process what i want to do is uh the moment i can start my sh mega what i want to do is i want to bring in my container so think of this as my hxg boost algorithm that is available inside a docker container right i'm going to bring that into my sagemaker session that's the first step right once i'm done with it i'm going to bring uh so so you can see over here start with my training algorithm packed with into docker image published into amazon emr now sage maker algorithm is available inside your session and visualize this this great box as your session right so sagemaker pulls the algorithm inside the session now what what more is required by this guy he needs actual data which you need to supply and you will be pushing that data inside ns3 bucket the next task for you would be pull that data from a s3 bucket inside the sagemaker session i'll be all in sync till here i may just need a confirmation now at least forget about all this previous slides what we covered but this is the time where i want you guys to be in sync with me are we all good till here whatever the process has been done great uh thank you folks right so now once our process is completed so my container is sitting inside my sagemaker session my data is also inside my sagemaker session now how big you want this container that's up to you right you can say i want my container to be as big as a p4 i think it's a p3 8x large machine or it needs to be a bigger bigger machine where do you need to specify that it's inside your code and where i'm going to show you that part right or you can even mention that i wanted to bring in a parallelism inside what i'm doing i want not just one docker container coming up i want around five docker container parallely coming up all those things you're going to supply as a parameter right so the moment i do that in parallel all those things are going to come up for you fair enough and i have my data also data there are multiple ways of bringing in data which we're going to see i can either pull my entire data at one shot or i can i can have a different format altogether where my data can be streamed less less data can be pushed into each container and things like that right okay now moving on so now think of now the training process started because the docker container came over here it started the training process so once the training process is completed it's going to come up with something known as a model artifact which will be stored inside an s3 bucket and an inference code image also will be re created inside uh a registry and an ecr elastic container registry right so there will be a model artifact as well as your image right so inference image also will be available over here for you right now when you deploy your model because you created a model now you need to deploy that model somewhere right so when you deploy it that's my next process when i deploy it that's where you're going to get your inference code also your model artifact will be called from your s3 bucket and i'm going to package it and create an end point right so that's exactly what you can see over here i create an end point now what's happening is when you create this endpoint you may have a massive bigger machines running over here and this also you can specify how exactly you want your endpoint to be catered whether you want one single smaller machine so the model that we have built now we have created an end point and understand folks this is the end point which you'll be calling from an external application pull it up and i can use it in my in my program right so that's that's what exactly this end point is right and the final thing that's nothing that i want to discuss right now that's that's nothing but the ground truth ground truth is a service an additional service available in sagemaker where you can utilize a labeling functionality um inside amazon ground truth we'll talk about that slightly later right so that's going to be my end-to-end flow of creating a model or building a model by sagemaker in our environment any questions over here before we actually go through now complete hands-on and all these informations are picked up from amazon documentation you have every single things available how step-by-step what process is working on good so far great okay so we are going to see all these things live in action now right so make sure you are having registered yourself this process picking up a docker container bringing up into a session take your data bring up into your sagemaker session during your the moment you build your training the training artifacts is going to get stored inside s3 bucket also your interf inference code image also is being created and when you run your inference both this will be combined together and i'm going to create an end point and from an external application you will be calling up this endpoint to get your predicted models right so i'm going to call this and it's going to give your predictions back that's how exactly the entire life cycle runs fair enough okay so that's how it works and just going to stop uh push this because we may not have enough time to cover the ground truth we may be all right so that being said we are going to directly jump into our stage maker sessions and things like that i just created some backup environment as well in case um you guys know how exactly demonstrations would be right it may not work exactly the time when you really need it i've just created some backups environment somewhere so that we are all good um first and foremost what i want to show you guys is um is there an area or a region where i can start everything together okay let's start with okay sage maker services where do i get it i'm assuming everyone know how to get into our this aws um console provide your user id password and get into aws console you guys are an expert in that i'm pretty much sure i've selected my north california region right the reason behind nothing specific i just want to have my demo started over here but i'm going to switch back to an environment where i have already created and kept it something so that we don't waste time doing that okay i'm in my north california now and the amazon sagemaker i'm going to type in amazon um sorry i missed it that's better just just put in sagemaker right click on sagemaker which is there on my left hand side of my screen all right it's going to say amazon stage maker build train and deploy machine learning models at scale right so that's exactly what you can see there's a build process there's a train process there's a tuning uh piper parameterized parameter tuning that's also we're going to discuss and then i can deploy my model right make sure you're visualizing that image which we have spoken about um some time back all right so where do i start my journey with click on notebook instance now what exactly is this this is the place where you're going to create your jupiter notebook right so how do i do that click on create notebook instance i'm going to say my notebook instance name great learning notebook uh i just want to have north cal um california nc over here right notebook is coming back and asking you okay how big a machine you need right does your data crunching or do you have a humongous amount of data with you do your data engineering process is going to take you does that make you go for a massive bigger machine or are you okay with a smaller machine that's a question it's going to ask you so that's where you can see over here understand one thing folks this is not a machine that is responsible for training right let me just come back to that slide once again so this part not this part for training there is a different machine altogether going to come up so first to write your code what is the machine that you require now you may have your question so why do i need to have a massive big machine just for writing my code understand even your data engineering your data the future engineering whatever manipulation whatever the data massaging data crunching all those things what you want to do that should be done in this machine that's the reason you have to choose an appropriate machine which can actually do that task for you i'm going to pick up d2 medium i would request you guys to just go for aws.amazon.com free which says for two months there is one machine available for you i am assuming that's going to be t2 medium we don't have a t2 micromachines over here right um i would assume it should be t2 medium so that you also can exactly do this exercise by usl without i mean free of cost that's exactly what i mean right now the permission part of it so you may have to create an amazon changemaker role in case if you don't have one you're just going to say create a new role option and you're going to select which s3 bucket you're interested or you can create all s3 bucket i i do already have a role in place so i don't need to worry about it and my assumption is you guys are pretty much aware of what exactly the role and things like that now if you have an s3 bucket where your data is residing you can even specify only that bucket name over here or i'm going to say any s3 bucket that means every single sd bucket you have access to it right i don't want to do that right now because i do have a role already available okay um i'm going to stick on to complete default mode over here and what i'm going to do is create notebook instance click on this that means a t2 medium machine is going to come up for me okay and this process is going to take some time for you um it's going to take some time before that's our first process that's got completed building up so we selected the d2 medium here right so what is the parameter uh on what places i can decide uh whether to select medium or large or extra large or something like that yeah so that visibility you should be having basically in case if you're if your data is going to be pretty heavy and heavy or huge and then you're going to see my data crunching for doing my data crunching i may need a bigger machine right so so you know that there are t2 medium t2 8x large 4x large sort of a machines so i need to understand for churning my data 1gb ram and one core processor that's not enough right so that kind of an insight you should be having yourself because otherwise what's going to happen it's going to take you a massive huge amount of time doing it and even at times what happens if your entire data needs to reside in memory even it can fail also so that decision you should be knowing based on your cloud expertise so that's that's where you need to so just go and see the configuration for t2 medium and see how much is it it's going to be two core uh four core processor and 2gb but now when you're going to have around 500 gb of data to be played around in this environment it's going to say i'm not happy with you right so that's exactly where you are going to make a decision no i cannot survive in d2 medium machine with 500 gb of data i may want to go for a bigger machine so that i can get my data engineering process or range engineering process being processed much faster otherwise it's going to take you almost four or five hours for your data crunching process okay but this is only during development time right absolutely yes this is your development time great any questions or can we move forward yeah i have one question uh this instance right uh it says ml.t2 so normally we don't say this ml is this specific to machine learning that's correct that's correct you will be allowed to pick up only those machine learning so this is customized for machine learning so behind architecture is made in such a way that your machine learning kind of an engineering process this machines are customized for that so we would not recommend you to pick up a t2 medium or or uh like what is that c4 or c5 machines to run this pick up machines only which start with ml dot blah blah blah yes you had um okay let me just real quick show you that part so you can get your code bringing up from git repository as well so if you have a git connected right so you can get your code dot from your you can mention your repository over here and you can get your i mean the kind of csd pipeline established over here the moment you do uh your upload your code over there you want to get it your credential needs to be supplied over here and that data can be pushed in it can be pulled from your kit so that's exactly i mean i have not done much of that activities but yeah so once just for a poc purpose i've done that that part so that's what the git is and okay one thing which you mentioned i guess i i miss that part where did you see in front part um elastic inference or or your own experience you're talking about or you did not see it over here right no i will uh no it's not an additional configuration so the elastic inference is not available in north california right now that that's the reason for it okay so i'm just going to show you that i'm just going to come out of this i'm just going to cancel this okay so elastic inference is available maybe yeah so our jupiter notebook is ready but we're not going to run over here we're going to run this process inside ohio where let me just real quick show you elastic inference as well uh create a notebook i guess it should be available over here right you can see over here elastic inference yeah so that feature is not so i'm assuming you guys know that the moment aws uh deploy a feature they are not going to do it a ga right that means global availability so they don't they don't do it to the entire i guess now aws have around 21 or 22 regions they are not going to deploy this new feature to the entire 23 region instead they're going to do first a pilot or the beta version in north virginia and maybe a couple of more other region right after that once they are completely confident about uh the services features a customer traction and all those bug fixing and everything has been done or or more features customers are asking once they introduce more and feature and functionality to it then they are going to push that to the remaining part of the region so that's that's the kind of approach um aws does so you would not have seen that feature in north california but you can see that feature over here now elastic inferences this guy um i'm going to go back to my slide now you can see over here the last thing which you spoke about yeah the end point remember this guy yeah this end point when i deploy this endpoint think of now where do you call this from maybe you have greatlearnings.com web page and when you click on one single button that's the time that api will be called and along with that button you would have already supplied the parameter as well i mean the your your your training data or sorry not the training your actual data right and immediately it's going to hit this endpoint and it's going to do that model will be run on top of it and it's going to immediately give you back the result i'm not sure whether you guys are all able to sync it up right so think of now great learning.com the website and from there you have supplied data to the a small text box is available over there where you supplied your data and the moment i click on a button immediately that needs to call up this endpoint do the modeling come back with the prediction result and that result should be displayed at the next line on the great learning website page right now think of you're going to create a massive big csv file it has around 1 million records available inside that that's the kind of data you're supplying the moment you go and hit this end point what's going to happen it's going to take a huge amount of time before it can process and send back the result to you right that's when your elastic inference is going to come in handy for you what it's going to tell you is in case your data is going to be massive big based on that multiple similar kind of an instance will keep coming up and it's going to process it all for you and it's going to send back result the moment you are done with it it's going to die by itself right that's the elasticity we are talking about over here so we're going to go back and cancel this and we have our notebook available over here right so now our notebook is ready you can see over here the sagemaker open jupyter notebook yeah that's it i thought it was there or somewhere over here i think they changed some features again all right so what you did is you have created a jupyter notebook now right so the moment you create a jupyter notebook and you have some machine learning data to be crunched and and things like that it took you almost five hours but you don't need to keep this d2 medium machines running throughout right it's going to incur some cost for you instead what i want to do is i'm just going to immediately take this and stop it over here no charge incurred right so that's how exactly you should be doing it in case if you don't want to just terminate it in case you want to do some experiment even tomorrow you may just end or you may just stop this machine and no charge is incurred for the compute purpose otherwise your clock is ticking now fair enough now what i'm going to do is i'm going to just open a jupyter notebook bike sharing demand data so i'll send this repository to you folks yeah this is exactly the data which you have which you're playing around with right so i have the the kaggle so what we are trying to find over here is uh there is um this is the 2011 and 2012 uh rental bike data that means uh on 2000 2011 how much how many number of bike on what hour has been actually taken up from some some location so that's the kind of data it's available over here so think of now in real time scenario if you are someone uh a ceo of eurobikes that is available in bangalore now right so using this data you can figure out in 2020 or 2019 december how many number of bike that you require so that i meet my customer needs and i can run much profitable business and what are the things i need to take care in terms of which location should i be placing my bikes and things like that right so what they have done is they've given you a data set a live data set of one of the company um and their test uh or the training data is from um jan first to jan 19th feb first to feb 19 right so till 19th or 20th i believe yeah so let's make it as i think it should be available till 20th right so the first month of till 20th data is available as a training data for you every month so jan till 20th feb till 28th march till 20th all those data is available for you and from 21st till end of the month they made it as a test data for you right so what you have to do is using your model you have to do the prediction what would be your actual sales or how many number of bikes will be going for rental during 21st till the end of month so that is something which you need to predict and give it to the kegel i mean if you submit it you have to see your prediction score and things like that a lot more details available for you over here if you want to try it on your own as well right so this is how exactly the data looks like and i have downloaded all this data this is where and then this is how the data looks like for you so you have inside the data your date time your season holiday working day whether you have some categorical data some numerical data and all those things are available for you right this is how exactly it looks like um okay so now i'll go back so i have just taken this data did some customization here and there it's not working perfectly fine for me but i will be sharing this um ipython look with you guys once i get my entire setup there is something getting goofed up in this environment open jupyter notebook so i should not have anything available right now so just want to show you how exactly you can get your notebook instance or your code being brought in right so first and foremost i have around three sets of data um so i wanted to bring in my data as well as my yeah that's that's going to be my data so these are the data set which i want to bring into my environment right so so test and train data which you would have already seen in the kegel data set that's the first yeah and so let me just upload that so it looks like i'm just uploading it and the remaining files as well let me just pick up the remaining files that's going to be my three files which i am having open and upload so once i've done with my complete um i may have to do some tweaking to get it uh working perfectly it's it's showing some sort of uh pay wire result right now so once i'm sorted out that issue i will be sharing this uh jupiter notebook with you guys so that so you can you can play around from your end okay so that's my test in the first and foremost what i want to do is i want to do some data preparation that's my sage maker notebook one other thing what i want to tell you guys is in case if you're pretty new to this sage maker environment and if you want to just kick start you have not so familiar with writing any model creating all those things amazon itself has created a sagemaker example one you can see over here deep learning ar amazon algorithm you can have your lda k nearest knife neighbor and all those available notebooks are available for you click on it you can start putting your data and start your innovation over here or even there are many of those notebooks which is even having the notebook sorry even having the data inbuilt inside it's going to retrieve it from one of the s3 bucket and it's going to start doing your processing for you so it's a much easier environment to start your journey with right but what we are trying to do over here is we already have some built-in algorithm available so we're going to play around with that um so in files i have all this down uploaded in inside my sagemaker now first and foremost my data preparation that's my data re-engineering part feature engineering or i want to do some data massaging over here right so let me just do uh so i can so i have already run this process and i'm i'm assuming you guys are familiar with jupiter notebook right so you can you can use it as a cell you can you can have your comments being written over here and all those things right that might be slightly beyond the scope over here and you can have it as a markup code or a markup language blah blah blah right all those things and what i want to do is i just want to do a restart and clear output so that we can run it all from the beginning right restart and clear all output so in sagemaker you have all your panda libraries numpy everything matlab everything install for it so you don't need to do a separate installation of any of these libraries it comes with pre-built libraries of all so i'm just doing these are my entries i'm importing matplotlib so for my so that i can just plot and see how exactly my visualization station looks like i'm importing numpy pandas for doing my uh data processing right and i can just do a run over here or in your windows as well as the mac it should be control enter the moment i do that i should have my process running right it's importing numpy yeah this process completed right and i'm going to go back this is nothing but i just want to fit this is how exactly your columns look like right so if you would have seen that kegel data set that's how exactly your kegel data looks like right so that's that's where we have set up the column and kept it exactly the way it is uh not really so relevant over here now what i'm trying to do is i'm reading train dot csv as well as test.csv which is available from your um kegel data set that's exactly where we picked it up from i'm going to load that into my sagemaker it did not like something what it did not what is giving me so this i may have to use one more file where i have done my passing with date and time so i may have to just use my existing process which i have run um yeah i i i kind of know what exactly is that so i have another library available which i was doing for my data because i don't have that much time to do that so what we're going to do is so i'll correct this this module when i'm going to send it across to you guys so instead what i rely now on would be on my data set that's available on right and we'll try to see um a processed yeah i did want to start from scratch and fresh um yeah so that's gl sagemaker that's exactly what i did so this is this is the entire thing that you can see over here and my bike training preparation this is my data preparation part let me just show you how exactly an um already executed one looks like right so this is where we have executed uh my train and test data got executed i've just loading my frame.csv as well as test.csv into my dataset and i'm naming it as gf and df underscore test uh just to see uh remember what i said one two let's see whether it's 19 or 20th no it's actually 19th right so january 1st to january 19th uh feb first to feb 19. so that's exactly the kind of data you can see so just to show that i've just listed down over here right so till 19th data is what is getting listed over here right which is available inside here uh trained.csv which will be provided by kaggle for you so that's how exactly so you can see 420 30th records of 19 and 31st is belong to feb first right fair enough okay and i'm just trying to display over here my pet records you can see over here how exactly your data looks like now what you want to do is i just want to split this particular date uh into year month day and week so that's that's exactly what you're trying to do over here right so add features and i have extracted this date and i'm just pushing it into so that i can have my hourly based data available for me that's the whole and so reason behind it right so that's what we are trying to achieve over here so once i'm done with that um so that's that's the uh feature i'm adding to it and you can see over here dft type so i have just added based on this function i've added year month day and week hour so that's exactly what i added at the end of my data set all right so it's just cleansing my data i'm doing my data uh feature engineering part that's exactly what i'm trying to do now i'm not sure whether i have executed this part uh no right so yeah so it i i may end up having trouble showing you this part so [Music] so i'm just doing my correlation um that's that's that's so what i'm seeing over here is i have my maximum correlation with our so i just boofed it up now so you should have you can see in this recording just previous um and then when i'm going to center across this particular file for you you can try it out try it out yourself right and what we are trying to do is we are trying to take a mean hour right so if so so how exactly you need to visualize this i have this entire data set available with me this is my data set so what you're trying to do over here is um so what you need to understand is i have just appended my hour day time everything with my existing data and i'm just plotting it that's that's as simple as that and the plot what i'm seeing over here is hourly base this is how exactly my data looks like that means zeroth hour the number of bikes that has gone for running from one location is going to be like close to maybe 60 or something this is uh 12 in the midnight and 1 o'clock two o'clock three o'clock you can see it's slightly dipping and then when it slightly increases when the day proceeds right so eight in the morning you can see a huge number of guys are using those rental bikes so 350 bikes have been used eight in the morning and nine again it started dipping and somewhere over over here at 10 it's again dipped and you can see over here at the evening time again the rental bikes maximum number of guys are using the rental bikes right 450 plus at what is the time period you can see over here at around five o'clock again the maximum number of usage happens so that's exactly what this plot is showing you fair enough so i mean it's it's quite obvious right so on on a normal day when you go to office many of people might be using it it's a us-based data many of those guys are using those rental bike and also they're using those rental bikes during um during evening when they come back from office so that's why the peak peak over here right so now we are just grouping it based on count of means that's that's exactly what we did now what we want to do is we want to use this data to be processed by two years so it's available for 2011 as well as 2012. so that's the data it's available for you over here so we're just going to take this data and use average hour or index several and i'm just going to utilize this exact same code which which you have utilized and this is what i can see as my data right i can see my data in 2011 it looks like this blue line and 2012 it's exactly the similar kind of a trend that has been followed but only thing is the slightly higher in number right so if you are a hulu guy and if you want to make a prediction 2013 it's much easier for you to figure out how many number of bike should i be keeping it one location and and if you have an entire different data set which talks about which location am i concentrating on we're using that data you can say okay in the month in the year of 2013 i should be having somewhere close to close to 600 plus bike kept in in somewhere in new york or somewhere somewhere exactly on what city or what location right so that's the kind of insight that you can derive from this from this um plotting all right group by hours i'm just gonna and then what we are trying to do over here is we have split this data into working hours and so you can see in the kegel data set you had one was a working day if it's going to be um whether the days later or weekend okay so that's going to be if it's going to be a one that's going to be a working day and if it's going to be 0 that's going to be a non-working day right so that's that's how exactly the data looks like so in case and that is exactly the kind we utilized over here sorry it's yeah right so what i'm doing i'm doing another plot over here what it says is if it's going to be one that means it's a working day my chart looks exactly the same right uh that means eight in the eight in the morning it's the peak increases and five in the evening also my peak is pretty much high whereas i can see the data set over here i it's on on a non-working day or on a holiday i can see later close to 12 o'clock because people people might be lazy around they might be sleeping till uh till 10 in the morning and close to 12 o'clock they might be setting up to move somewhere so on a non-working day this is the time period when you can see maximum guys are going to use uh use the bike right so that's the kind of influence we have made over here all right df.csv and now what we're trying to do is save all the data so i have skipped all the data inside bike underscore all csv that's what we did and now what we're trying to do is we're trying to split that existing data into 70 30 ratio so i'm assuming everyone is aware of this concept of we are trying to split thirty percentage of the training data and the remaining three percent um thirty percent seventy percentage is going to be my training data and the remaining thirty percentage of data i'm gonna use for my test purpose right so that's the split i'm going to do and that's exactly what this remaining code is gonna do for me right so when i do this you can see over here count given over here right so all together there were 10 86 and my training data is going to be this much and my test data is going to be like 3 000 plus training is going to be 7000 less so that's how exactly i split my data and i create it as bike underscore train and i create as bike underscore validation um and split it up into two right the 30 days is going to be and bike underscore validation and this is my data which is from um from 20th till end of the end of the what month that data is residing over here of every month so jan 28 till 30th data is available feb 28 till 20th 28th or 29th or whatever that data is available over here fred so all those data combined together it's available for you in the for your testing purpose right and i just create that particular data and this is available for me as this this format okay now moving on i'm assuming no question over here because you guys would have done this n number of times you are now the key part what you want to talk about is would be how exactly i wanted to do a training process that's the meat of this program right and the prediction one we may not be having enough time for that part so what i will be doing is i'll be sharing these files with you guys we have a small data prediction done with this file itself let me just check whether i have um and what we'll do is we you can run this on your own uh with this this particular bike train underscore prediction right so what you're going to do is the existing deployment that we have done that deployment you will be running through prediction right so that's that's the task which is there for you right now okay now bike training how exactly what we did my i have imported all my modules and blah blah blah now what you want to do is you wanted to get create an s3 bucket first so i have created spk underscore sagemaker underscore edu or blah blah blah so let me just go and show you that part uh i'm just going to go to services [Music] right click open up and then show my s3 bucket services and s3 yeah so the name of this sp switch maker so spk sage maker that's that's my bucket right inside that i created a bike train folder i did not create it basically i just have to keep that my process over here is going to automatically do that for you right so training file is bike train slash um everything is done for you so you just need to create this spk underscore sagemaker sagemaker.edu or basically whatever the s3 bucket that you want to create right and the moment you execute this you're going to set it up like you will have something like bike train spk stage maker edu slash bike train slash bike underscore train slash bike underscore validation all this data will be going and residing over here the moment you do this execution right so you can see over here this is how exactly it looks like enough nothing nothing i mean this is something which you guys should be familiar with so not not going to go deep over there and now what you're trying to do over here is this is where you're actually going to upload your data right so you can see over here write to s3 that's the function that i've written file name bucket and key all right so we're going to call this function somewhere over here write to s3 this is my bike underscore train dot csv file which is going to get uploaded inside this particular data set so that's my bike underscore model bike underscore train and by console validation these three datasets are going to get uploaded when you execute this command right so that three data set is uploaded and what you can see over here is you remember we have spoke about uh our model artifacts will be stored in s3 and that's exactly what you can see over here s3 model output location where exactly you want those once you process your data where exactly you want your artifact to get stored over here that's exactly what you're going to mention over here right and you can see where exactly that particular variable is getting used now all these things were something which you would have done n number of times no need to have a lot of clarification or something is required over here now over here the next cell is pretty pretty important for you now what you're trying to do over here is we are trying to do an xg boost algorithm on top of that uh by train data set i'm my assumption is you guys are pretty familiar what exactly is an xg boost right so we are not going to get into that topic um fair enough understanding or maybe you just need to google it and figure it out what exactly is xc boost so just a layman terminology multiple trees okay just to give you a very very high level hawkside view so those of you guys who does not understand layman's term um a multiple model combined or a multiple pre-combined together to form a pretty pretty good uh estimation for me that's exactly what how i would i would call an xg bose race right so um so that that's at a high level what you should take it as right or if you're interested just go search for xc boost there's a pretty good explanation given over there uh to understand what exactly is that now the the core understanding that you need to have over here is now where do i get this extreme algorithm that's something which you need to know i'm going to go back to my slide remember we spoke about there is a container available over here right i need to bring a container which is filled with xgbos algorithm written inside that is what i want to bring into my sagemaker session now how do i do that right so where do i get that container that's the question i should be asking so i'm going to open up this particular documentation from aws control v and enter right so this documentation let me even send that across to you guys straight away yeah so now what it's talking about is in case whatever algorithm that you're going to use it's going to be blazing text this is exactly the path you need to find out colon tag is nothing but if you're going to use a previous version or the latest version so by default if you don't use a tag it's going to be treating it as a latest version you can see over here the instance class can be a gpu based instance or a cpu based instance for my blazing text uh so i'm going to search for my existing algorithm which i'm going to use you can see over here completely i'm going to search for xgboost so basically what i need to use is i need to use ecr underscore path slash xg boost inside my uh my algorithm which is which looks like this so where did i get this one so so this is exactly what i should be using now where would i get this information sorry where would i get this information ecr path just scroll down you can see over here if you're going to use which whichever environment you're going to use or whichever the model for we are searching for xg boost right so in xc boost lda yeah there you go for xc boost i need to rely on uh yeah 4.9 i believe this was pointless which you utilized yeah so i need to rely on this particular so in case i'm going to run it in us west one best two or east one are used to this is what i need to use right so blah blah blah this is where exactly this is exactly what my ecr is elastic container registry all right and that is exactly what i have used over here you can see over here contain a registry that has been used and slash xc boost and latest right we want to pick up the latest one so it is so you can get this run in u.s west 2 east one east to usc's 2 and eu west one right in case if you want to go for any other region that's up to you so you can just pick up corresponding region of your choice and you have to just fill in that particular container registry information over here fair enough i'm assuming everyone is fine with that part so this is where your container is going to get pulled right now you have to have a corresponding role also right the one which remember the first when we created our sagemaker way creator role that's exactly what this role is right and now you can see over here how exactly i can build my model now the beauty of sagemaker is okay let's start with this part and then talk about the the sagemaker one right so now remember we spoke about how exactly your sagemaker would look like so i'm going to say for training my instance i need an m1 m4 x large machines right remember for our for our training purpose we have actually spinned up a machine which was like remember this machine was just um t2 media machine right this is our jupiter notebook machines but for the training purpose i'm going to define over here i need an instance which is going to be my m1 dot uh m4 dot x large machine right i'm sorry ml.m4.exe machine right it's not m1 it's ml ml dot right so machine learning and i'm going to say my account is just needs to be one now you're thinking you have a humongous amount of data now you want to have a parallel process also in place in that case you may want to increase this count now you may ask a question how do i know whether i need one count or whether i need four count that's mere sheer experience we have just tried to play around with at times we get a better result with one at times it has been turned out that we are going to make it three or four machines that gave a better result it's just a parallelism how exactly you can bring in the parallelism in case your model execution you may have to slightly play around with because if you just bring in parallelism but your data cannot do that slicing or streaming still your parallelism is not going to work and it's not of any good use for you you need to understand all those parameters in place right and then i'm going to name my a job as this one and that's that's all the parameter that you require right and now i need to have any questions over here let me just take a pause over here before we for the next part any question still here because that that's where you may have a logical are we all good till here or oh sorry uh there was some question from someone you did execute yeah so um sorry uh someone says you have to df no it's not going to work because because i've done some data processing prior to this there was a multiple version of my data processing one so that's not going to do even if i do that df that's not going to work for me because i've done some tweaking with a different ipython model module available for me which which i slightly goofed up but yeah so even if i execute it exactly the same way you will get exact same results sorry any questions are we all good till here or is it is it becoming so make sure you are you guys are in sync with me so this process is nothing but the docker container that was being brought in from here right so as i said from here whether should i be bringing one docker container or should i be bringing four different docker container over here that's exactly the question that you should be asking over here so that that part has been covered over here okay my assumption is silence is actually a little deadly you either you understood it thoroughly completely or you didn't understand it anywhere right if it's the in-between stage there is going to be a lot of questions but okay i will take this assumption as your your ok until till here right okay now this is one of the key part which you need to understand in terms of hyper parameter right so a beauty of sage maker is um you can supply your hyper parameter to your algorithm for now for an xg boost these are my hyper parameters maximum depth maximum depth is nothing but the number of uh leaf that i can go that's exactly what um sorry the number of hierarchical structure that i can go is for your maximum depth this fourier x g boost right and something num round is talks about an objective function which i'm going to use is linear and num round is nothing but how many attrition that i need to go um or how many jobs should be run executing that i'm i'm specifying over here my total number of job execution should be 150 right so 150 times uh this job is going to run for you before it's going to finalize and give you a um optimal model right so that's exactly what it means now so i've just listed down over here that's how exactly it looks like estimator dot hyper parameter is going to be this value but the key point of what you need to understand over here is there is a function available in terms of hyper parameter i'll even send you this link i have not stored this link for you so i can have this num round parameter the continuous parameter which can go for 1 to 200 my lambda or my learning rate right when i talk about my eta is nothing but my learning rate over here i'm going to say here my learning rate is fixed 0.1 so instead i can set my learning rate from 0.1 to 0.9 right so what this range means is sage maker behind the scene it's going to play around with this hyper parameter values right and it's going to find out which is the best algorithm which is resulted into and it's going to flash out that particular hyper parameter for you and that's a very very great uh information or an insight for you so i can finalize okay if i use maximum depth of five and if i use objective as a linear and if my eta is 0.1 and my sub sample is 0.7 that's going to be my best model right but where did i get this information from i have just given a range specified over here it it should range from point zero one to ten for my alpha for my lambda it needs to be point zero one and ten and for my num round it should range from one to two hundred that means the number of iteration can range from one to two so what sagemaker behind the scene is going to run is it's going to pick up randomly one of these parameters and it's going to keep trying up right every single parameter which is range specified over here will be picked as a parameter and it's going to do that execution and in one of these places if you go over here in sage makers console if you see in hyper parameter tuning job if you click on this right now you will not be seeing any data over here right over here you can see that 150 execution over here and you can see on the extreme right what is your objective function and which has given you a minimal data available over there that means minimal functions whichever the metrics that you define over here for that corresponding what are the parameter that is getting used right which is going to be in my case i have picked it up as these were the best parameter that i can supply for this execution and i picked it up from there and i start utilizing it so in case if you want to use this model in the later point of time i know that that's the function or that's the metric that i should be using for each of these parameters and that's going to give you a much better result right so this is not something which i use i've just picked it up from my sample um sagemaker documentation from from amazon documentation so that you know this is how exactly you can specify a hyper parameter range right you can stay stated as alpha continuous parameter range and this is the range i'm specifying over here and then i'm going to have that tuner linear hyper parameter tuner and you can see over here this range hyper range linear is what i mentioned over here right and then i'm going to call my fit function right the tuner underscore linear dot fit function the moment i call up my fit function is what your training is going to happen you can see that happening over here right estimator dot fit you're going to have supply xc boost parameter over here and you can see your training jobs is going to get kicked off and this is how exactly your training job is going to get kicked off right you can see we have specified the number of iteration which is going to be this feed num job field or num round field which is going to be 150 you can see your job has executed for 150 times over here right and you can see over here to start with your rmsc error is going to be 242 to start with and your validation error was 240. with each iteration you can see that the parameter is going to get reduced for you you can see that parameter is going to get reduced when i reach almost 100 my parameter is close to somewhere around 50 and when i reach close to 150th execution of that particular algorithm i can see it's pretty much close to um vote for my best rmse root mean square error right i'm assuming you guys know all this what exactly the rmsc is right you can see over here it it is going to be my 41 and that's where 150th execution has resulted into 41.45 it's slightly lesser than this yeah right so that's where we define our learning rate and our rmsc is defined over here and you can see the number of execution the billable seconds for that particular execution is 64 seconds right you will be charged for this m4 machine that you spin up this ml.m4.excharge machine you will be charged for 64 seconds that's exactly what it means apart from the data transfer and all those costs but your training instant charge will be just for 64 seconds compare this with the way you are doing traditionally traditionally what you might might have been doing is you would have taken up an ml dot p3 dot machine which is going to be a massive gpu based machine right and for writing your code also you're using exactly that particular gpu machine also for training purpose also you're going to use your machine so that the time that you're using it is completely wasted when you're actually you as a data scientist when you're writing your model when you're preparing your data you're doing your data engineering during that time you are actually using a gpu based machine but over here only during the training training time your gpu based machines are going to come up and which is going to be just 64 seconds and you have been billed for that gpu based machine only for 64 seconds whatever you specified over here right so that's exactly what's happening over here and once i'm done with my execution i get my model created and next the code i'm going to explain is instance type for my inference purpose that's going to be my again m and for x large and i can have my endpoint treated as xc boost dot by train hyphen v1 so what is going to happen is i'm just going to show you this which i have already executed so the moment i do this execution what's going to happen is i'm just going to go to my north virginia region where i could see i'm assuming i still have my endpoint deployed over there or did i remove it or no yeah so that's exactly what my so the moment you're done with the execution part you are going to get automatically an end point created over here and this is the endpoint if i click on this endpoint you can see over here an https link is going to get created and from your external application you're going to call this api that you're going to make an api call for this endpoint right and you're going to get your results executed that's how exactly you're going to perform your training over here so we have done i've just done a small data prediction over here just applying this particular data to this predictor and it's giving me a result back so for this based on this particular data the number of bikes that is required would be 39 uh so three in the evening that that's exactly what it means so 39 bikes are required over here that's exactly the count it has resulted into right so that's how you're going to do your end to end um deployment as well as prediction part um any quick questions let me just open the floor four questions for the remaining 10 minutes that's available and i'll i'll i'll do those corrections whatever required over here and i'll i'll post it into your repository wherever you can just pick it up and run it on your own environment itself right uh any question anywhere if you want to go back um i have not done that apart from ecr i doubt that's about so let me just take down the question um yeah that didn't come in my mind either so i would assume that may not be possible but just to i mean yeah it's a nice way of uh a trainer doing it right so instead of answering your direct question like pick up another question out of your question and then answering it so that question whatever you ask i um frankly and openly i don't have an answer to it can we go for anywhere i doubt that will be possible because easier then you may have to have a lot of configuration changes so my gut feeling is would not be possible but you can bring your own model you can create a docker container out of it and then you can bring it into an ecr and you can make a call from there right so that is possible for you but if you don't like ecr hmm i doubt sure thank you no problem yeah so that's a pretty good question um i should have given you that insight as well so in case if you have if you want to bring in your own custom model create that model and that can be pushed into our ecr registry and i can just call it from there or that's possible for you i'll just check back but my gut feeling or my ninety percent we're talking about confidence interval right so that's ninety percentage of the confidence interval of mind says it may not be possible because that's there might be a huge amount of changes required in that case yeah uh any more question folks yeah and also also um just want to give you a heads up so the predict one the one which i have file over here i think i may have to do some tweaking over there as well the prediction i'm going to send it across you it's a pretty small file you can just execute on your own end with the test data remember we have uncreated bike underscore test the moment you execute it you're going to get that result result part so this is how so the the reason behind i said there is tweaking required because i am getting slightly ahead wide results i have goofed it up somewhere and i was not able to find out solutions so i can see my data is coming up with massive um it's predicting around a huge number in fact so somewhere i have goofed it up i may have to do some tweaking and find out where exactly and you yourself can do that part it's it's this notebook is something which i'm going to share it with you guys so you can run it on bike underscore test.csv which will be available with you when you the moment you execute those commands you're able to link to the concept um my uh is my assumption okay i did not actually clarify my assumption first but i even if you say no i'm not a data scientist we it would not have been a room where we will be able to spend time of doing that but my because our assumption or the prerequisite was you have to have some sort of a model building exercise being carried out you know what is training validation testing and all those things so that was expectation and also expectation was you need to have a fair bit of information about how exactly cloud environment works so these two information in place i guess it should be we should be all in sync but any questions please feel free to shoot we have enough time what's what do you think was a missing link apart from one execution part of it i just wanted to have a fresh environment created that's that's something which was goofed up slightly because otherwise we could have done an execution over here uh which is available for me over here so if you see my north virginia uh it's not virginia yeah and my notebook instance uh is it up and running yeah my jupiters have been running so i have multiple data processing files available with me right so i have just tweaked it and kept it on the other one and siege maker one yeah so these were the multiple version available for this so which i created took this and embedded on top of this and made this so in between i missed some couple of steps uh which may take a lot of time for again doing that repair activity but that's exactly what the execution i did some time back so that should be fine there's nothing different that you're going to see as an output apart from a slow motion device you may see one by one that's getting executed other than that nothing but yeah i mean i just want to have your valid feedback like uh do you think anything can be done um to improve this program or or what do you think uh what do you think can be best done it's just it's a mutual um help right so where do you think if this was done slightly in this manner that would have been much beneficial for me as a participant it's fair fair enough us so if you have something of that sort please do throw in your lights or your comments so that we can improve this program based on how you think that can be improved if you don't have any questions by the way if you have questions please shoot your questions but otherwise if you think no if this this module has been slightly more elaborated yeah i i think that's that's something which i would be able to understand much better or this portion i was not very clear about such kind of discussion or information as what we would like to have absolutely nothing so we are seven minutes ahead of time so is it like the moment i give this file you will be able to create it on your own and then then run the process without i mean without any hassle is that is that what you think so you are becoming a sage maker enabled data scientist from julian hi uh can we uh integrate any visualization uh data tools uh with this with sagemaker uh sagemaker itself are you telling can i connect it with a power bi power bi or um i may not know what would be the real because because as a visualization part we actually can have any of the splotting or anything being done over here itself is there any other additional services that you can connect i may have to check my gut feeling is can not be can we connect any visualization tool but um where would you think would that be a real necessity because i mean you can have an um an api call for quick site can be enabled i'm not sure i have never done it because i was always happy with the kind of other matlab or even the existing python visualization some of those apis that has been given is what i have always used so i have not tried connecting with any other additional services but i uh can check and come back to you on that because do you do that uh whenever you're running that on your jupyter notebook in your current current environment do you actually rely on some other visualization as well not really so we have a different team uh so in that case uh what they do is they actually pull up the data manually and create uh like utilize that and create a like use it and we call it got it got it you separately you want to have a much more enriched kind of visualization so you rely on tableau or any of the other tool and before you actually bring in your data got it got it yeah but in that case maybe you may have to do this exact same exercise over here as well as what i feel because um i think so this is a container way blah blah blah i i doubt whether that will be possible but but yeah i to be frank i don't have the yeah that that's exactly what it is you can use matt probably been cbone packages which is quite good yeah true so that's the map plot level is what we use and the c1 packages is exactly so exactly the way you're using your jupyter notebook you can use exactly the same way but connecting to an external source that's something something normally a data scientist don't use it but but yeah you never know so there might there can be a remote kind of scenario that can you can end up with right but yeah that's that's something new that's something even i will try to find out from my investment yeah that's true data scientists don't business analyst mostly this yeah but but i would say when i'm doing my data engineering and trying to run my training testing there was no need ever i have never seen a practical scenario where uh a business analyst also had to do it unless and until i'm processed my data and send it across and then they're doing it right so that's that's a different arena altogether right so that's not really my data enriching in terms of model building before i get my absolutely i mean the gentleman who asked this question is absolutely right right so before i even get into this part i just want to do a good amount of visualization yeah that that's a fair ask but in between in a jupiter notebook i have not seen many guys doing that now very quickly let's summarize today's video in today's video we discussed the functions that amazon sage maker helps in performing elaborate on amazon's sagemaker architecture then we demonstrated a practical application using amazon sagemaker if you haven't subscribed to our channel yet i want to request you to hit the subscribe button and turn on the notification bell so that you don't miss out on any new update or video releases from great learning if you enjoyed this video show us some love and like this video knowledge increases by sharing so make sure you share this video with your friends and colleagues make sure to comment on the video for any query or suggestions and i will respond to your comments you

Info

Channel: Great Learning

Views: 19,751

Rating: undefined out of 5

Keywords: Machine learning, Great Learning Academy, free courses, Amazon SageMaker, Amazon SageMaker Tutorial, What is Amazon SageMaker, Machine Learning Process Problems, Amazon Sagemaker Architecture, How to Use Amazon SageMaker, Aws sagemaker Sagemaker, sagemaker tutorial, Sagemaker tutorial for beginners, Jupyter notebook, How to Set up amazon sagemaker, AWS, Machine Learning problems, amazon sagemaker example, aws machine learning, amazon sagemaker studio, amazon sagemaker setup

Id: 1eQC259cVcI

Channel Id: undefined

Length: 77min 18sec (4638 seconds)

Published: Tue Aug 02 2022