Importing CSV files from S3 into Redshift with AWS Glue

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi and welcome to this majestic cloud video my name is Laszlo beador and I will show you today how to import a CSV file from Amazon s3 into Amazon redshift with a service called AWS blue also from home so we have a CSV file in add in a stream it's called taxable CSV it's a simple to CSV file it looks like this I kept it short so we can do it easily there are only a few lines this file is uploaded into a stream as you can see it here and the goal is to import it into redshift with AWS P so let's see how can we do this I have already launched rachet cluster I will not show you this because this is basic stuff and you can probably do it alone let's go any studying to blue and start this process so I'm in the AWS blue console the first thing we need to do is that we will create a classifier okay I can click on classifiers at classifier and this will be a CSV classifier we choose the type CSV and here we can define the options like what's the delimiter in my cases comma the code symbol is double quote so this is the default stuff I will change this to heading so it has the CSV file as hangings and we just click on create the next thing we need to do is to create a crawler for the CSV file that's in a stream so click on add crowler this will be the CSV crawler from that stream here we go to the with the default things but we need to add this classifier we created in the previous step so it's added here then click Next and here we have the default datastore we can you need to choose what's the datastore it will be here we need to select the s3 file this is my s3 data and click on this is the taxable data data bucket and the taxable CSV is selected so this is the file here we don't put anything at the moment go next to add an another data so here we select no then we need to choose an I am role or create a new iron so here we create the involved role it will be automatically created by doing for us click on next the frequency of the scroller how often it should run you can set it on a schedule or you can run it on demand so I will choose run on demand because I will click Next and now we need to define where the crawler will put the output of process so here we add an database the database will be called CSV data from a stream we don't enter a location to be automatically determined click on create and here we leave all the default options click Next this is a confirmation screen and we just click on finish ok so this is the crowler or the CSV the datum this crawler will basically go to a string and identify the structure of of us CSV file and we'll create the database in the table here with that metadata but in order to do this we need to run the crowler so I am running it now while it runs I will go to connections and we will need to setup here the connection the redshift connection where we will put the data so we will click on add connection connection name will be redshifted cluster and the database engine will be Amazon redshift click on next we will choose our cluster this is my cluster that I have defined it automatically selects the default database and the user I would only need to add my password here click on next finish I will click on test connection I will have to choose the role that I cream the contest connection and we'll test the faction from my experience this can take a bit of time so they're not wait for this anyway it will notify us when it's finished so let's look at our Prowler our Crowder is running starting meantime let's go into SQL work bench which I used to connect to the redshift cluster and look at our table so this is the table in redshift I've created it beforehand and you can see it has only the four columns that I need if we look it has no data whatsoever we received an arrow not sure why yeah so it was an error some temporary ebooks I have no items in this table and here we'll go my data that will be imported with a WSB let's go back to boom okay we have the connection which is still still running this test of the connection our Prowler is now stopped in the CSV crowler which run and created the first created database CSV data from a stream and which should also create table with a metadata a so this is a taxable CSV yes we identified our columns you can see item cost tax total and all kinds of properties for Istanbul okay let's go back with our connection our connection is still in progress so I will pause the video a bit until the connection is tested okay so now the connection was successfully tested it says that redshift Custer connected successfully now what we need to do is define another Prowler this crowler will connect to the redshift database and we'll identify the schema of all t table so ad crawler this will be the redshift table Prowler okay click on next let's see what are the options here we leave everything on the default and click on crawl or suicide datastore here we need to choose a JDBC type of connection and we choose the question we defined in the step every Russian cluster and here we need to put in the table path so basically this is the name of the database in my case is products in the schema which is public and then the name of the table that talks about CSV okay let's confirm this so we are in the database products levy the schema is public and the table name is taxable CSV so this is right this is the path you could substitute this with percent character and it would detect all tables with all within this scheme so it's useful to know that so let's go to the next step we don't add another datastore we choose the row that we created and then choose the frequency which will be also on demand the output will go to another schema database which would be red for example okay let's go on the next step and finish so we have now another problem which is the red shift table crowler and we we need to run this one also and it runs and it detects our the table configuration and schema and stuff like that okay so our crowler for red shift ran successfully and it created a table with the information for the rush of schema and now we have the two two prowlers the red shift called crowler and sv crowler now we need to go to jobs and create a new job this will be transfer from s3 we choose the I am we created here really all the default options what it's useful to know here it advanced at properties the job job bookmark if you enable this then if you run the job multiple times it will remember what records will be transferred already so we're not transferring twice click on next then we need to choose the data source in this case is the stream table the destination will be the redshift table okay so you see here it has the schema for the two tables this is the source and this is the target we click on everything is fine but you could change here the types if you want to or add columns we click on save job and edit script it takes us here into this this script where you could add some advanced options like transformations you want to do on the data or stuff like that but in this case we want writing we just save the job close it and we will run it so here we select it and run job here are the properties which will ever turn on default and run job so what happens our job is preparing to run from my experience when we have a cold start like in this case where everything is new and our job didn't run before it takes about 10 minutes until AWS spins up the necessary infrastructure to load our data into a chip even if our table is pretty small and the data is small it still takes about 10 minutes to spin up the cluster for us to on this job so I will pause the video and resume it when everything is ready and our job starts to run okay so we're back it took about ten minutes for our import cluster to spin up but the execution time actually was just one minute because we have a very small table so if you have like a very very large CSV file probably it will take a lot more time than one we need to run and then the ten minutes Pinnock time justifies somewhat for such a small table it it's really a long time but let's see if our data are really easy here so here we are in the SQL work bench tool and I will do a refresh on the table and you see that the data which we had in the CSV file it's all here it was imported by AWS cool so let's see here here's the yeah the data in the CSV I have it opened with the spreadsheet software and yeah it says that it imported and everything is fine so this is how you import a CSV file with AWS be as I said earlier for very small CSV files it's an overkill to use AWS blue you could just use the copy operation in redshift I will do a video on that sometime in the next couple of weeks so this only justifies for like very very large CSV files and also in cases where you need to do also some ETL processes on it for example transform data you have you have here the that script and he script this script which you could actually edit and do some stuff with the data some transformations or add some other stuff to it so there are lots of things that you could do here with the data not just a simple import but I just wanted to show you the basics of how to import the CSV and then maybe in a future video we will dive deeper into how you can do transformations here in this screen so that's it for the moment I hope you like this video if you did please subscribe to the channel and I will see you soon in another video about the AWS vault thank you again and have a great day
Info
Channel: Majestic.cloud
Views: 53,627
Rating: 4.8957653 out of 5
Keywords: AWS, Redshift, Amazon, S3, AWS Glue, CSV, import
Id: c5hWNEYp4g0
Channel Id: undefined
Length: 17min 3sec (1023 seconds)
Published: Wed Jul 03 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.