How to Use SQL Joins when Using Python on Snowflake to Get Pandas Dataframes

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
i'm your host john mckenzie thanks for joining me once again on my channel on data engineering in this episode we go back to our python on snowflake playlist and i want to talk about how to do sql joins and get your pandas data frames as a result now this is a very very handy way of approaching getting data from snowflake and most python programmers are tempted to to use merges and things like that when actually you could be doing the joins on the back end and leveraging all that power that's back there to speed things up so without further ado let's get to our sql joins on pandas on snowflake need to hire additional programmers for your project make sure to check out the links in the description okay so i'm using the idle shell here and i'll start a new file and then i've created some variables um when i save it here i've created some variables which i've scrolled off the top of the screen so that you can't see them but uh they're called the snow user and snow pass which we'll use later in our procedure here and what i'll do is i'll import snowflake.connector and i'm going to import pandas as well and i'll give some feedback here just saying you know into the console just saying uh you know opening snowflake connection and uh and then we'll create our connection for snowflake so we'll use cnn equals snow flake.connector.connect and then we're going to throw in our username and password and as well for this connection i'm going to put in the the also the database name and the schema name which is kind of a nice way of doing it when you're connecting so that you immediately go directly into the the database and schema that you need now you can make a connection without specifying the uh you know the database and the schema but at that point then you need to you know send the you know the sql command for using so we'll use the use command so you'd say use database use schema and those will get you to where you want to go but you can include the warehouse the database and the schema right in your connection which is really nice because it sends you right where you want to go to do the work that you want to do and so in this case i've got a warehouse that i created called project warehouse i've got a database called project database and a schema called project schema and this is a project related database and schema and so what i did was i created some tables that have a few records in them that can be joined um now you could you know think of these as having a ton of records in them uh something uh that you might you know it might take a long time to merge and and that's sort of how we're going to move forward with this so i'm going to create a cursor the cs equals cnn.cursor and then i'll do sql is equal to select star from projects just to show you what's in our database here and what i'm going to do first is i'll do a cs dot execute on my sql there and then i'll say my data frame my pandas data frame is equal to cs dot fetch pandas all which is a very handy command that comes with you know comes with the snowflake connector and that's going to create that data frame in pandas that we can use um and it's gonna you know have all the you know characteristics of a data frame pandas data frame and that we've sort of gotten from snowflake and so if i go ahead i can uh give some feedback to the user just saying we're going to close that connection because it's good to close the connection before you start doing things that might you know throw an error or whatever you should actually put this in a in a you know try accept block um but just generally speaking it's good to close your connections before you move on once you've got your data out of your data source and so um i'll print you know data results and then uh we'll print off the the data frame head 10. i don't even think there's 10 records in there but i'll just say dot head there and uh and then we'll hit f5 and that's going to execute our sql and as you can see it grabbed you know an id project name and some details about the project i'll do the same thing for our our staff so that we can see the staff entries so i'll say sql is equal to select star from staff and then we'll grab that one in much a similar way that we grabbed the earlier one just to show you what the data looks like i'll call it df s here in order to to designate that its first staff but first we'll uh execute that sql so that we can get it out uh using fetchpand as all uh cs.fetchpandasall [Music] and we'll do that now so now we've got a staff table and we'll take a look at that one as well and i'm also going to grab the the table now i'm just going to copy and paste it this time i'm going to copy the junction table which is called project underscore staff so that has that table creates the many-to-many relationship that we're sort of looking at uh here so i'll call that one dfps and uh and so what i'll do is i'll print both of those tables and so that you can sort of get an idea of what the data looks like and how you might go ahead and you know either just directly export or import this into pandas or whether you might you know use a pandas join or whether you might do sql join on the back end but this is what the data looks like and so you can see there's the id and project name project description and then there's a staff you know with the staff id id first name last name and there's uh you know there's a project staff table where it's got you know the project id and the staff id and that creates a many-to-many relationship uh between the tables and so you could you know if you needed to put these into one table you could merge this using uh you know panda's data frames using a join or something like that but generally speaking i like using sql for these in relational databases and actually you can use sql quite well in snowflake as well it's very very handy to be able to use the sql for those statements and and get back the result sets that you need and that allows you to sort of um use the power of the back end that's on snowflake and uh to do these you know massive joints now in this case it's just a few records but if these had a lot of rows in them you'd probably start to see performance decline if you you know brought them all over and we're using them uh in in pandas joins as opposed to just you know doing a select statement with your joins in sql against the snowflake database so in my first sql statement here that where i'm going to do a join i'm going to take the last name first name and you know the project name and the date they started on that project and i'm going to grab that from the project staff table so that's the the junction table so i'm going to use that to start with and then i'm going to do an inner join with the project table and an interjoin with the staff table so i'll use the short form i'll just say join project is join project prj on prga.i is equal to you know psf.project id and then i can that's going to give me a join there and then i'll also join staff as stf on you know stf.id is equal to psf dot staff id and this is uh sort of what you might see in typical real-world situations with uh you know having to join these together like this and uh as you can see i've missed a i missed a quote there double quote so now if i close that off looks like it checks out okay for that statement and we can go ahead and we can grab that result from our query without having to join anything we can just grab it right out of snowflake using joins and that's gonna go a lot quicker than if we you know did this you know locally so i'll create this additional data frame here i'll call it uh data frame all assigned and we'll use our our trusty cs dot fetch pandas all to give us a data frame back and then i'll just get rid of these print statements here because you've seen what the data looks like already and i'll go ahead and i'll just print that data frame off so that we can see you know the results of that um using the sql against the snowflake back end so there we go i'll print the df all assigned and i'll grab the head of that as well even though in this case there's there there's only a few rows but if you have 10 million records you probably only want to grab the head to see your some result but there we go so this is all the assignments so you can see wanda cooper's on the blueberry farm and you know you can see uh mcduff there archie mcduff he's working on the skyscraper he started on april 14th and you know james duncan he's on a marathon but you can see there's two people on the marathon project and and so we've got some assignments there and one of the nice things that we can do you know using the sql is we can do similar kinds of um sql operations so we could do a mismatch query where we show you know all of show me all of the projects that don't have any staff or all of the prod staff that don't that aren't on any project those kind of queries we can run those in just the same way which is really great and and so that can save a lot of time so like if i did this you know a similar query and i said you know select the project name uh from project and if i did a left join and you know i i did project staff at you know on our id field just like we did before but now it's a left join but we'll say where you know the id in the project staff table is null and that's going to give us all of the projects that don't have any people on them yet so you might have a similar situation to this in your snowflake database where you need to tell you know who's in who's not assigned or which ones don't have assignments or you know those kind of those kind of situations so we'll do the same thing we'll do our cs.execute sql and then i'll say df project no staff give a nice big long name there and and then we'll use our cs.fetch pandas all and and then we'll we'll go from there and now i'll print that one off instead of the df all assigned so i'll just copy that name there and i'll go down and i'd like to show show that in our output here and our data results which i can do and i'll just paste that over that one and save it and then hit f5 and i broke it already there we go missing a double quote so there we go so now our statement looks better so i'll hit f5 and let's see what we get here so we're opening our snowflake connection and there you go so the data warehouse project which is the last project in our list up top there it has no staff on it so the you know so our query was able to tell us that and we could do the reverse of of that by uh you know finding all the staff that aren't assigned on any projects yet which we can we can do in a similar fashion in that case i could do you know select last name first name from staff and and then do the sort of the opposite of what we did before where i can say you know join uh you know project staff which is our junction table on you know on the id fields just like we did before and then we'll we'll also use the um we'll use where the where where clause where our id in the project staff table is null and so we can do that here so i'll go ahead and where the psf dot id is null and that's going to give us all of the staff that are you know that don't have any assignments yet we'll do our cs.execute sql and uh and then i'll call this one df staff no not assigned and and then that'll give us give us a data frame that we can use to to get that and use our fetchpan as all again and i'll copy that one down there and just uh take a look and and see here we'll scroll down and put that in our our print statement just so that we can get that one statement out just for demonstration here and then we'll hit f5 and and see what we get from that so let's go ahead and hit that one and uh oh that one gave that gave no results i also did something wrong uh looks like a left joint yeah i didn't put left in front of my joint there so you need to put left join it's very important if you're trying to do a mismatch query so now if i hit f5 there we go okay so we've got mary smith and she is not on any project so we can assign her somewhere else and that's how you can do sql joins on snowflake need help for coaching on your project make sure to check out my patreon the link is in the description hope you enjoyed today's discussion on how to get data into your pandas data frames by using sql joins on snowflake if you like what you saw today please give the video a thumbs up make sure to subscribe to the channel if you have not subscribed yet click the bell when you see the bell and if you have any questions or comments please put those in the comment section below have a great day have a safe day and i'll catch you next time
Info
Channel: Sean MacKenzie Data Engineering
Views: 140
Rating: undefined out of 5
Keywords: how to use sql joins when using python on snowflake to get pandas dataframes, snowflake inner join, snowflake left join, snowflake python, python snowflake join, python snowflake sql, sql in snowflake using python, python snowflake dataframe, snowflake to pandas, snowflake connector for python, snowflake database, snowflake data warehouse, snowflake schema, python code to connect with snowflake, connect snowflake with python, sean mackenzie data engineering
Id: n3MmuEjLuzM
Channel Id: undefined
Length: 16min 34sec (994 seconds)
Published: Fri Nov 26 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.