Beginners Guide To AWS Glue ML Transformations

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi guys i'm donald trumpers welcome back to my youtube channel i'm a data engineer with over 10 years experience working primarily monday to friday in the financial services sector five times aws certified and i like nothing more in my free time than making videos for this very youtube channel so in today's beginners guide series we're going to take a look at using glue to do machine learning algorithms we've already covered glue in other lessons so i'll put a link to the top now to the 101 series where we go over glue in depth today is purely looking at the machine learning aspect of aws glue we're going to take an example that aws give us as a tutorial where we're going to data match a series of records together that are publication lists so this is just a big file that aws give us that contain information about publications such as title author and year we're then going to run an ml algorithm over to say that this record looks like this record and we think it's the same book or we think it's the same scientific paper i we think it's the same publication i'm going to take you through all the steps now on the console and we'll just run through it from start to finish it takes about 15 minutes of the video lesson but the actual processing time will bring it up to 45 minutes to an r for yourself so set aside an r and then let's get going join me on the console okay guys that's me logged into the aws console there's a link in the description where i've put the commands and a code file of what i'm going to use during this you can just follow along with me on screen or you can copy and paste from here whatever works out the easiest for you so the first thing to do is go to aws globe i am working under the north virginia region if you are following along with me i am doing everything in north virginia and that includes the commands so if you're working in a different region you'll have to alter them i'll show you where to alter them but it is easier just for us all to be in the same region so i am in north virginia so the first thing i want to do is add a crawler so if we go down the left hand side and we go crawlers and we go add crawler give your crawler a name i'm going to go over to the text and i want to call it this name here and we hit enter we're going to use data stores and we're going to crawl all folders it's an s3 connection it's in another account and we want to take this s3 link so this is an aws tutorial so all the data is provided for us so it's that s3 link i'm in us east one so this remains the same if you're in a different region you need to go in and change that us east one to the region you're in not all regions have this data set so be careful again is easier if we're all just in north virginia hit next add another data store hit next i am roll either choose an existing role which i have plenty of to give it access or alternatively you can go and create a new i am rule that has the permissions you will have to assign those permissions so for me and if you've done plenty of tutorials with me before i have a glue full access rule so i'm just going to use that iam roll if not go and create your own rule come back add it in run on demand and choose a database so we're going to add a database and then the database name is going to be the same as the crawler name just for simplicity there we go and hit create then hit next accept everything with finish crawler is created click on it run crawler and then i'm just going to pause the video here this takes about two minutes usually to run just keep clicking this refresh button on the top right hand side as always and once it's completed it will tell you that there has been a table added so i'll pause the video here and then we can pick it up once that crawler is finished okay guys as you can see that took about three minutes in total i just kept hitting that refresh button until my one tail was added nothing really happens for two minutes on the ui sometimes and then it just all clicks together at the last minute so that's our crawler added and if we go into databases and you hit the refresh button again and you click on the demo crawler you go to tables and demo crawler and our table is there if you click on the table you can see that we have id title authors venue year and source as explained at the end of the video if you watch that part what we're going to do is just match record so the ml algorithm is going to go over title authors venue year and source and it's going to say okay what other rows or records in this data set look similar and then match them together of a pre-formed algorithm so the next thing we need to do is set up that transformation so down the left hand side we have ml transforms let's click on that and go add transform transform name again i'm going to use the one that is over here without that full stop it's copy and paste and then the role as i said i always have a full access role on hand so i'm just going to use that one again you can use the one that you set up for the crawler or you can configure a new one for the purposes of this ml transformation i'll leave the ion rules completely up to you click next then in terms of the source data well i've only got one source in this uh region so it's going to be it is the one we just scanned in so if you've more data than what you have from the crawler then just make sure you pick up the right one primary key is the id column next tune we don't need to do anything here and then we say finish excellent it's set up as you can see it needs training so if we just click onto and we say actions teach transformation you will click i have labels and you will want to upload labels from s3 click that and over here i put the s3 location to the labels again i am working out of us east one so that's why it says us east one here aws provide all these labels for us change that code part there's that little section to the region you're working out of if you're not in north virginia but again we're rewarding not every region has this data set so then click upload that's uploading give it about 10 seconds and we should see that complete there you go next we want to go estimate quality so we say next and we hit estimate transform quality this will take a few seconds so i'll pause the video here and we can pick it up once we're ready to go okay guys this estimate transform quality has been going for about five minutes now what i'm actually going to do is click finish and then i'm going to just watch it from its actual tab so click finish and then hopefully when i click onto it and i go to estimated quality it's actually finished actually finished oh finished right now yeah so it actually finished right now so there's a little tip for you just click finish and then click into the actual estimated quality tab and you can see that it's ready to go and because this is a predefined uh data you can see everything's 100 great so next thing we need to do is actually set up the job itself so over the left hand side we go jobs we go a job name wise i'm just going to call this demo etl roll i always have a trusty as i said full access rule on hand for these demos glue version we need to choose uh glue 1.0 with scala glue 2 is not support at the moment uh script file name we'll just call it the same as the job name i already have buckets that it's going to write to and we click next data source well we only have one so let's click next then we want to do find matching uh records leave everything else do not click remove duplicate records because we're trying to data match the records we want to leave them in and click next then for select a transform we hit this one and we go to the transform name that we set up previous you can see that that primary key is id and we click next and then we just want to create a table in our data target we'll use s3 we'll save it in csv format no compression target path so we can use one of the buckets that you may have already set up depending on your account and what you're doing if not go find somewhere so i'm just going to use a demo bucket that i have lying around um for glue demo bucket again you can put it anywhere you want it's completely up to you i'm just going to put it in this input bucket and then save job and edit script okay and that's our code so i'm just going to add a little bit of custom code to make this quicker so word has find matches too what we want to do is actually transform this into a single partition so again you can follow along so i'm just going to put in val single partition equals then it's find matches 2 which is this dot re partition double check got spelled correctly r e p a r t i t ion then inside here we put a one so we want to just make this a single partition then when we're actually writing other options it's down here and we're writing out that single partition so rather than writing out the data frame we're writing out that single partition just ensure we haven't actually deleted anything else here we can check to make sure that job commit is there once that's done we want to save and then we want to click run job so run job run job job is off and running we can just watch it down here in the execution logs and again that was just important there that single partition till one and then we're writing out that partition here i'll pause the video this can take about 10 minutes sometimes and then we can pick it up okay guys again this has been running for about five minutes so what i'm going to do is hit this x button and then i'm actually going to monitor it from the job page because it's easier so you can see it's currently running as you can see don't hit that stop button and then i'm just going to keep hitting refresh until we're done okay guys that's a complete as you can see it took nine minutes to actually start the ml transformation but it finished in three and it has succeeded so let's take a look at the results it's over on s3 and obviously you'll have a different bucket than me so go to the bucket that makes sense uh for you where you put those results i did it in my demo glue bucket that i always have the hand then go into the part input file that you can see here this is the one that i did and i'm just going to open that which will download then i want to show in finder i'm just going to open this with text edit and drag it across on the screen so as you can see we have all this data and what has happened here is that basically anything that has the same match id on this last column has been a matched record so that's a bit complicated but basically what it's saying is that anything that has a match id on this last column is a record but as you can see it's kind of hard to see anything so i'm going to bring this across and what i'm going to do is actually change what the file type is so if i go to get info and that's the same as doing right hand click on your windows pc if that's what it is name and file extension and i'm going to change this to dot csv and i want to add that then i have a csv file and i want to open that with numbers and you can open that with excel and i'm going to sort my entire table so that's the same as doing a sort in um in excel and i want to sort it on match id which is here so anything as you can see that has the same match id is decided as the same record so much id11 and it's decided that these two records are the same here and as you can see that makes sense you can see here that it's got match id2 and nope doesn't have any matches then as you can see from here match id 3 and three and again you can see that yeah probably the same record uh although as you can see on authors they're very much different but then very large data set bases vlvb has got it together same title the orthler's as you can see the wrong way round but it's picked it up same year and this goes on for the entire data set if you download it as you can see then it's decided you know on this kind of side here these two are the same and if you look at it you know the names are very messed up here because it's in a foreign language but it said these are the same people actually and then if we look at the book title it's pretty much the same and it's decided that that is the same title so that's kind of it for today guys it's a pretty in-depth one where you just need to keep doing things until you get it right um just practice and practice and practice setting everything up and you'll get there eventually so as usual i've been johnny chavers i'll make all this information for free on my website and until next time guys thanks for watching
Info
Channel: Johnny Chivers
Views: 420
Rating: undefined out of 5
Keywords:
Id: NFaSA0WMknE
Channel Id: undefined
Length: 13min 23sec (803 seconds)
Published: Tue May 25 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.