SQL Summer Camp: Getting started with SQL | Kaggle

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
good morning that sounds good doing some last minute checks and it should start on YouTube pretty quickly um I always stall for a little bit so for those of you who are joining for the first livestream there's usually a little bit of a gap between the let's see where we are right now this leaders a little bit of a gap between where I am in time and where you guys are in time especially because I'm trying to stream in a little bit higher quality but welcome so quick rundown and what we're gonna do today we're gonna start by talking about what relational databases are and why you might want to use them I made some slides and then we're going to go through the lesson and if you didn't get the emails the links to the courses are all on cago learn so if you go to Kegel comm slash learn it'll take you to this overview page and the course that we're doing is the intro to sequel course with some additional stuff from me and if you did get the email this morning I went out it looks like everybody hopefully should have gotten it the links that you got are the links to this first exercise here this lesson and then this exercise and if you fork this exercise so clicking this will start a new fork for me if you are not signed in your work will disappear after 15 minutes so I would recommend that you sign in to your hangul account just you can have a permanent copy and I don't actually want this right now so we're just gonna leave um yeah let's get started and I'm not gonna look at the chat because I know in the past I get really distracted answering questions and I'm gonna save it until we're done talking about the lesson so let's get started what are relational databases and why would you want to use them so relational databases are a way of storing data so a database is just a way to store data on a disk and a relational database specific quickly has a collection of tables and tables are basically spreadsheets so they have rows and columns and what makes it relational is that there's a relationship between the what's in the rows and what's in the columns basically so we have a little example here from the Wikipedia page on relational databases it's in the public domain and you can see that we have three different tables or spreadsheets in this particular database and that we have one spreadsheet is telling us the activity name and then the code for that activity so it looks like this is probably some sort of road work it looks like and then we have another database that tells us another table in the database excuse me that tells us the activity code the date and the route number so this is the road where the particular what's it called road work was being done the date it was done and the activity code so we can see that on January 12th because this is American there was an overlay done on i-95 and then we have another table that has similar information slightly rearranged so all of these together would be part of a database and generally relational databases are used for what's called tidy data so this is a term I'm taking from Hadley Wickham and he has a 2014 paper that I'd really recommend if you are interested in that particular type of hmm okay that's not centered one sec let me really quickly I was saying all those things and you couldn't see any of it there you go so there's the database that I was talking about so in tidy data each variable forms a column so that would be things like dates or route numbers or activity codes and there's one variable per column so we wouldn't have a column where sometimes there's activity codes and sometimes there's the activity name we would want it to always be codes each observation forms a row so this is I guess entries in the activity code this is specific types of roadwork that we're done and then each type of observational unit forms a table so we have you know our sort of dictionary here and that's one type of unit and here we have a list of all the different specific instances of patching sorry overlaying specifically as its own table is this sort of idea and if you're familiar with spreadsheets which I think a lot of people tend to be there are some words that we use when we talk about databases that have equivalents when we're talking about spreadsheets so a table in a database one of these guys we would call you know would call a table rows are sometimes called records so these are all records instead of rows and then columns are sometimes called attributes so this would have the attribute this table would have the attribute date activity code and route number or route number and that's sort of the general idea of databases relational databases specifically and how they work so when would you want to use a relational database there's a couple of places so the biggest one especially for data scientists is the data you need is already in a database and you need to get it out they are also very helpful if you have a lot of data so I get a lot of questions about hey I have too much data to load into you know Python locally what should I do with it the best answer almost always is to put it in a database and either just pull the specific data you need or even work in the database so later on into the sorry neccessity later on in the summer of sequel will actually talk about how to do machine learning in bigquery which is the specific database platform we're using for these courses excuse me another plate time when you'd want to use databases is if there's a lot of people that are trying to access and change the data so just an example you have a website and it's an e-commerce website and you have a database that tells you about customers and a database that tells you about what's for sale and then another database about sales and every time somebody you you make a sale on the database you know wherever that's from someone will you know a new row in the sales database the sales table will be created and let's say you restock something so someone's gonna have to go in and we edit the the table that has information about what's on the website and let's say some customers want to be removed for gdpr reasons so someone's gonna have to go in and update the customers data table and you are also probably as a data scientist gonna want to be doing analytics like hey when should we reorder this as their pattern and sales so you need to get data out multiple people are needing to change that data and if you have sort of a a group situation like that a database is really helpful it's pretty much the only way to do that or if you're using learning to use relational databases so I use my first relational database when I was learning to use relational databases because it's a handy skill and it has served me very well sometimes when you probably don't want to use a relational database that's a little bit overkill if you have a very small amount of data you don't expect to get more a database might not be the best place for that especially if you don't already have a database set up and you need to make an entire database for your teeny data file probably not the best use of your time and relational databases specifically are for things that you can put in a spreadsheet and things that you can't put in a spreadsheet things like images or like documents or things that have a more hierarchical data structure you're probably going to want a different tool for so no sequel is a database that's not sequel yeah so it's not relational it doesn't look for tables it deals with some different sorts of data structures and blobs are binary large objects so that's any file that you'd want to store it could be image files it could be text it could be you know a bunch of different things and those you're probably gonna want to store using something to store blob so blob store is one blob storage is another they all have pretty much the same name and pretty much all the big cloud providers are going to have an option for you little bit key so that's when you should use relational database the specific relational database service that we're going to be using is called bigquery so bigquery is just where the data that we want to query is that's what the service is called and the syntax that we use the sequel syntax is going to be the same more or less maybe there might be some slight differences no matter what service you use so P queries 1 there's a bunch of others that I'm blanking on but I definitely do the know the name of maybe you have somebody hosting a my sequel database the syntax might be slightly different again pretty much all the big cloud service providers are gonna have one DB that's one it is from Google which is also what Kaggle is owned by so it's a nice you know it's been pretty easy for us to set up the ability to query from in Kaggle and one really nice thing is that you don't actually have to set up any authentication to use bigquery data on Cagle which is nice so if you're going to use a bigquery data set in your kernel which is what we call our hosted notebooks you need to add it as a data set and I'll show you what I mean a little bit later but if you get an error that's like I don't know have you checked your bigquery settings it's probably because you didn't add the data set to your kernel also and this probably won't come up but each goggle user gets 50 terabytes of scanning every 30 days for free just because that's what we have for our bigquery access if you go over that limit you're gonna need to wait for it to reset and when we talk about things where you might start bumping up against that limit oh no I'll give you some caveats and just we'll talk a little bit about query optimization and how to make sure that you're not doing huge things all the time and if you do go over the limit it resets on a rolling basis you'll need to wait 30 days from the first time when you started using that your your query util not query quota that's the that's a general idea and I've had some questions already about what to do if you want to work locally on your computer so if you want to use the same syntax that we're using again the sequel will be the same the Python will be different if you want to use the same Python where you do using you need to install the client library and let me make that fit good um and to do that there is bigquery client library so I just googled bigquery client libraries no my my picture went away sorry my beautiful background of the trees let's bring that back there we go and it is a little bit of futzing you're gonna need to install it you're gonna need to setup authentication you need to have a Google cloud platform account you're probably going to need to set up billing so there's more steps if you want to set it up locally but you are of course perfectly welcome to do that so that's sort of where we are why you might want to use relational databases let's talk about how we're actually going to be able to interact with bigquery from inside catechol and this is where we're gonna talk a little bit about the python syntax the python syntax will be the same every time so you can copy and paste it's okay I copy and paste a lot we'll drink okay so excuse me these are all in a slightly different viewer from our normal kernels viewer I'm just making it a little bit more streamlined so if you're used to our normal kernels viewer this make it look a little bit different and you'll notice that I can't edit here so this is just the compiled notebook and I can't like if I start typing nothing happens so if I want to actually run any of these commands to try them out I'm gonna need to start a new and let's do that by going to kernels and when you start your when you click on the link for the new notebook kernel when you click on the link for the lesson with an exercise it will automatically start a kernel for you I just want a little scratch pad to play around it so let's call this big query scratch pad there we go and this is what I'll be using to interact with my datasets just so I'm showing you how stuff works all right so SQL is or sequel is a programming language for use with databases I don't think I mentioned that so sequel is the way that you would interact with a relational database there's other languages sequels by far the most common it's what we use at Kaggle if I need to get title data I need to write a sequel query and then usually get somebody to look at my sequel query to make sure those fields mean what I think they mean and it's extremely extremely common in industry so we are going to be interacting using sequel with bigquery datasets from within Python so we need a way to write Python commands that tell the data set on bigquery that we're looking at what we're looking for who we are and you know all of that sort of stuff so if you are really interested and like working directly with api's you can work directly with a bigquery API you're gonna have to do a lot more manual futzing to get it to work it's recommended that you use the bigquery client so I'm gonna import and I'm just gonna copy paste from Google Cloud I'm gonna import bigquery so this is the client library and the client I think of it as a little medieval messenger right so I give the client my command I give it like my signet ring I guess so that it knows who I am and it can tell the database Who I am so that's the authentication I don't have to do that every time and actually if you're querying public public big query datasets from inside cago we do all the authentication so you don't even have to touch it which is nice and it'll take your little command over to the database it'll hand it to the database and the database has a pretty limited vocabulary it's gonna be looking for sequel and if it doesn't see good sequel or says sequel that doesn't make sense it will hand in error to the client and client will try it back to you and you err and it's sort of a go-between that takes care of some of the you know frustrating stuff so I need to set up a instance of the client and this is extremely object oriented so Big C client is the class which means it's sort of like a blueprint a blueprint of what a client should be and we are creating a an object that has a specific copy of that class so we don't need to do anything here it's just setting it up and it tells us okay you're on Kaggle you're using the public integration from here I can say hey I want you to create a relationship with a specific bigquery data set so you have your client your little messenger and you're like here's a set of directions to get to the castle of King Olaf and here they are so these are the directions to get to the castle of King all Olaf which in this case is the hacker news data set let's make a new cell and when I run this I'm gonna get an error so I'm saying all right we're gonna look for the data set hacker news it's in the project bigquery public data which is where all the bigquery public data is if you have your own data set that you're working with the project is probably going to be different and the name is probably going to be different and then I tell my client - hey know about this data set and we're gonna get an error and the error is permission denied if you are working with your own bigquery data sets which you can do down here you can link them with your Kaggle account that's gonna be for you to troubleshoot but for us the troubleshoot is very easy all we need to do is add data and I know that it's the hacker news data set and so I see here hacker news and it has the tag bigquery so I know it's a big query data set and I can add it to my kernel give it a moment to refresh and think and this time when I say here's the directions to count Olaf's Castle he'll be like oh yes Count Olaf I have a map to count Olaf's Council I don't know where Count Olaf is coming from that's that's well not very camp themed I guess it's a map to the rival camp the woulding Tanaka's they keep beating us at dodgeball and we have to go I don't know what's a harmless prank you could do relate it to dodgeball paint their dodge balls a different color and then they won't be able to find them as easily anyway so we have run this cell and you see that we have no error so the map was successfully uploaded so for here once we have our messenger and he's going back and forth we might want to know what's in count Olaf's Castle so it's a big castle maybe it has a lot of rooms I'm back on the castles it has a lot of rooms and you want to know what's in the data set so we can do that using these commands oh no that's the same commands that we used last time using these commands so no no paste nope okay well I failed to copy and then I failed to paste so this is going super well there we go so what this will do is it will remember the data set is this reference specifically to hacker news if you were working with the data set you'd use a different data set name here here what we're doing is we're telling our little messenger hey list all the tables in that data set so remember that a table is like like a spreadsheet tell me all of the spreadsheets that are in this specific data set and then so that will give us a list and then we're just gonna print out the list we're just gonna step through the list and print out every every item in that list and we're calling each of the items in the list table so we're just assigning that a name in the loop so if we do this we can see there are four tables in the list so one is comments one is full one is full 2015 ten so this is probably a version from October of 2015 and one his stories so these are all the tables that we could run queries against and the next thing that we would want to do is we want to create a pointer to a specific table so we don't have to every single time say hey go to this specific table and here's the complete list of directions I can say oh hey remember that one table I sent you to that one time I want you to keep going back there so here I'm going to create a reference to this table called full so it's one of the tables in the data set and I'm just gonna call it table reference but if I wanted it to be a reference to a different table I could change this text here and it would change which table it was looking at and it has to be one of these so if I try to do something that's not a table I can see that I get an error requested resource doesn't exist and if I change it back to something that's actually in the data set it works fine and my my client my little messengers like oh yeah I remember that one table you told me about that one time do you need anything from it they don't ask you that but if you did need something from it you could go get it so we have this I can't there we go nope nope I think that's the biggest it can be so we have a sort of a little nested hierarchy here so our client a little messenger we've told him about bigquery data which is this project and then within that the specific data set hacker news and then within that those tables so this is the full set of directions for a little medieval messenger to trot along and get all the way to the specific cell or row or column that we're looking for and we want to get data from that's the general idea but we know where the tables are and we know what's in them sorry we know where the tables are now but we don't necessarily know what's in them so let's check out the information about one of these specific tables that we've created a reference to here so this will print out information and I'll just make that smaller side a little bit more room about this table full and you can see that we have a bunch of schema fields and basically each schema field is a column and it's telling me about the in like what's in the column what can I can expect from them so we have one of them by this is a string so it's going to be some sort of text and we have a little helpful piece of information here the user name of the items author so hacker news is like a forum site where people can can make text make text make posts and comment on them then we have score so this was gonna be an integer it's a number and this is the score of the story so I've posted a story people are maybe uploading it or downloading it and then I have a score of you know how people felt about it timestamp is going to be a time-stamped time is an integer so this is going to the only difference between these that looks like is that this is just a number and this is a timestamp so that'll be our a minute seconds from the start of the eunich epoch which was in the seven DS I think we just at some point we just the unix time is there was a point in time and they were like this is zero and we go up from here and I'm pretty sure it was in the 70s and you can see if we go through this field by field we can look at all of the columns and what's in the columns and if we wanted to look at let's say a different a different table we could change the text here so instead of pointing to a full I'm pointing to comments I could also create a second variable if I like and now when we ask for the schema we get something different so we get the ID the unique comment ID the username of the commenter the username of the author because comments are applying to authors unix time whether it's deleted and so on so we get different information from a different table and this is just sort of walking through here and if you would like we can use the list ros method to look at the first five rows so we've looked at all the columns we know it's in them let's see what it looks like actually in the data set so this would be similar to using the head command in pandas or our and this will be in the lesson this is for the full dataset because we have said that table is now pointing to comments we are instead going to get the first five rows of the comments so you can see here are five comments we index from zero because we're in Python the ID the who the comment is by the author the comment is replying to so this person is replying to their own post text what's in the comment the post that the comment is a reply to and then just various other information so that's sort of how you would get from the from setting up big query being like hey I want to create a little messenger and I want him to go and get things for me and I want them to look in this particular project which is an art case big creative public I wanted to look in this particular dataset which is in this case as hacker news I want them to look in this particular table and from there you can more or less treat it like a spreadsheet I'm going to talk a little bit more in the coming days about how to get rows how to get columns how to aggregate so how to get things like averages over columns and things like that but this is the general idea of how you interact with big data sets on Kangol and again feel free to copy and paste in fact I encourage it yeah so I am 926 and I'm going to take questions and there's a little bit more in the lesson you're free to read through in your own time and the recording for this video will be available immediately after the livestream ends so you'll be able to go back and watch this because I know I do talk a little bit fast especially when I've had coffee but I want to make sure that I have a chance to answer your guys's questions so um let's see I'm going to actually start on the and actually let me make this public as well so if you want to see this you can so to make it public I'm gonna go under sharing settings public save and then it's just called bigquery scratchpad and my account that I use for demos is under the name Rebecca Turner super secret it's not secret so it's under the account Rebecca Turner and let me commit so you can actually see that and you can go and look at that if you would like and from here let's actually head to the learn forums and I'm gonna see if we have questions there and then I will go to the chat because it looks like I could see yeah I can see that in my peripheral vision that I'm trying really hard not to look at it no sorry this isn't this isn't the learn forums discussion and learn alright so it looks like we have a question here permission denied using tags bigquery integration I'm new here and I'm seeing the following sequel error I cannot connect to the database because I see the error permission deny using haggles public bigquery interface so this is that error you'll get sorry it's a little bits in doubt this is the error you'll get if you don't add your data say it on taggle so I needed to add the hacker news dataset to my kernel or if you if you're working locally if you haven't set up your authentication correctly and if you have trouble setting up your authentication locally I would send you to stack overflow because they're gonna be better able to answer your questions they're the bigquery team monitors stack overflow pretty closely so that's how you solve that particular error blackjack julia posted about this I know someone has already responded to that and looks like the only other question is about this so Dixie says is the GCP free trial only applicable for new users she set it up in 2015 and it looks like it's no longer available um I don't know the answer to that yeah I think it's only for new users if you're using bigquery on Kaggle I believe that the 50 terabyte five terabyte the quota query query quota who we have is separate from your GCP quota I don't know if you're running queries about your private data set and I will reach out to the reach out to the GCP team and see if they can answer that all right let's so we should be able to see this now yeah okay so here's all the commands that we just print and this view if you can see the title up here it's a Jupiter notebook but it's rendered so I can't actually interact with it it's a static HTML I know there were some questions about that a little bit earlier um let's go to the chat I'm on kid says how can I review this video later so the link that I sent you will continue to work into the future it will also be in the kaggle youtube channel so youtube.com slash Kegel and I will start a sequel summer clamp playlist that I haven't done yet because I've been real busy but I will do that so that's probably going to be the easiest way questions yes the tutorial does use pandas methods to work with data tables if you would prefer to use are probably the easiest way to do that would be to run your query using Python so the reason that were using Python this is actually a good question is here are all the libraries sorry here's all the computer languages you can use with the bigquery client and as you can see our is not one of them but Python is so that's why we're we you can only query bigquery from within Kegel using Python it just because there is no client library from our data maybe there will be one day that would be nice but if you want to choose our what you can do is you can run your queries using bigquery and then and I can actually show you how to do this right now you can save your let me zoom in a little bit you can save out your data so let's run all of these from the top and we should get out a little data frame here and let's actually do 25 so it's a little bit bigger if I wanted to work with this data in are the easiest way for me would be to write it to a file so let's call this Q file and I believe it's I'm trying to remember the the pandas the panda syntax for to CSV is it just like just like to CSV no mm white nope save nope save here maybe let's find out nope that's not it so I not important pandas oh no I didn't import pianos okay so this is directly in this is directly in Python okay so in raw Python it would be with sorry I'm trying to remember the syntax for saving files out save file to know save file oh excuse me sorry about that um so with I'm just gonna look it up pythons a file syntax I can't remember if the file name or the name that you give the file comes open let's see ah bah bah bah bah yeah ok so I'm pretty sure it's with open my file dot CSV and we want to make sure that we we want to make sure that we can write as file file dot my indents Rauf file dot right BQ underscore file I think that will work let's see no right argument must be a string not a data frame Python save data frame to CSV nope I don't want to pickle it okay it is in pandas so sorry Oh No there we go so we are going to have to import pandas import pandas as PD so then we should be able to say big file dot right to know all right the data frame to CSV what's the syntax for this you can tell that I do this really often oh I see ok so that is just a method that creates a text file so it should be filed or nope right nope right so we're going to write that file and we should be able to yeah excellent I'm going to add this import up at the top so that's run beforehand so now sorry there's a little bit of a tangent um but it's a good good question to show you how to save files so that you can use them later so once it's run you can see it's still thinking about itself it will create a new file and when I open that version it's so you're seeing JSON because the the file is still running so once it's run I can go down to my file that CSV here so any files you save and a kernel will show up at the bottom of your compiled kernel and I can either download it and work with it locally or re-upload it or I should be able to create should be able to I will be able to create a new data set from a kernel output and if I search for what did I call it big query SC and let's say my work actually I misspelled big query that's very possible there we go so I can create a new data set from the output of a kernel and I can call this big query sample file and then once I've created is the data set I can add it to a our kernel and analyze it voiding analyze it using our rather than Python some questions uh I saw it says we technically we don't technically do you need the width that's true I just like it because I if you don't use with you need to remember to open and close the file and I tend to forget to close my files which is not good for memory usage reasons so and the reason that you'd want to do it in our is I tend to prefer to work with tabular data in our there's just a lot of really nice tooling around it and the visualization libraries I prefer ggplot2 pretty much any other visualization library that exists so if I want to work with ggplot and make let's say some beautiful visualizations for a report based on data that was in bigquery I could use bigquery to pull down the data let's say about the last quarter and then create a new data set from that and then make beautiful visualizations for my stakeholders okay other questions so Sylvia says can we use the same bigquery data using spider via ant akanda or do we have to be in the calcite you can absolutely do the same thing without being on Cagle your syntax is going to be a little bit different and I would point you towards the Python reference for the bigquery client that's going to be the simplest way to work locally that's a good question Colin says what problem does bigquery solve so the biggest one is that it's if you have a very large database it can take a long time for you to run your queries so let's say my query is I want information on all my customers whose last name is Anderson and generally databases aren't sorted so I can't just go through until I find the A's and then grab all the Anderson's usually they're in you know a different order and the thing that bigquery is optimized for is to do things like that very very quickly even if your data set is very large so that's the that's the whole thing people asking you about ggplot and python there are a couple of ports but it's not it's not fully supported and it's just sort of people's pet projects and unfortunately some of my my favorite ggplot ports TIFF I thaw and have you know quietly you know rolled to a standstill shall we say questions about this certificate I believe the certificate will be emailed to people who fork every lesson so there'll be a number of lessons and nope nope go away so if we go to learn all of the exercises that have the little code symbol here if you have forked all of the exercises I believe that's when we all email you the completion certificate at the end of all of the different events predicta says is it better than redshift I have no idea I'll see that my attitude toward sequel is somewhere between utilitarian and extremely utilitarian and I will use whatever tool is closest to hand and easiest for me to use and right now that's bigquery so yeah if you guys have input on on other sequel platforms please feel free to you know share that I think would be very useful some questions about credentials yeah so if you're running it when you run this code on kygo we handle the credentials in the authentication if you're running it locally you need to handle the authentication and the credentials so if you're having trouble I would refer to the client library documentation or if you're still running into problems again I'd ask on Stack Overflow and I would tag your post to bigquery the bigquery team is pretty good about keeping on top of Stack Overflow paste posts and I believe that's the main place that they look to offer support so good question Leslie says what's the difference between sequel and pandas yeah that's a good question so sequel is a language that interacts directly with databases pandas is a library within Python it's actually a pretty much eerie implementation of our or I should say some of the features of besar it's not a one-to-one not a one-to-one clone of our in Python so I'd say the biggest difference is that the fundamental unit of sequel is the table inside of a database and the fundamental unit of pandas is the data frame and there are some slightly different assumptions even though they sort of like look the same visually so one of the the big things about data frames and are is that every row has an index I don't know if that's true in pandas or not yeah no I actually don't know off the top of my head if that's true in pandas or not but there are that has like implementation implementations and as implications for performance in the way that you write code so they are very similar and you can pass things back and forth between sequel and pandas pretty easily and that's that's that's the difference the syntax is also fairly different I guess is another big difference so again if you're familiar with our the tidy verse collection of packages actually uses sequel syntax more or less to interact with data frames generally pandas dataframes don't have sequel like syntax to interact with them Shekar says any mobile applications or web applications that use bigquery any practical applications of bigquery yes a lot of them I don't actually know what it's public in terms of like people using bigquery but many people do enterprise applications do so bigquery is specifically designed for enterprise level applications and really big stuff so yes one sec I'm just gonna do I'm just gonna do a sneaky little query down here in my other hidden window just to make sure that this is public yeah okay so a good example of a no go in there a good example of a company that uses bigquery would be Spotify so yes absolutely practical applications of great queries specifically but I mean it doesn't have to be big query right sequel syntax is mostly the same across different database hosting structures and stuff so I would I would say that sequel is a skill is extremely lactic 'el and transferable big queries a tool is useful and people use it but it's not the only tool out there police says are we gonna learn no sequel Postgres sequel what do you think about graph databases I think they're cool and also out of scope of this course so the the summer event is specifically about sequel if there's a lot of demand for you know no sequel and graph databases I'm let us know we'd be I'd be happy to talk about them I think they're really cool yeah uh let's see priyanka says python is just being used as an interface to access the bigquery data set right correct so the client the little medieval messenger i kept talking about that's all written in Python and it's what you're using to send sequel back and forth yeah um let's see Oh Disney use this bigquery that's cool I didn't know that enric says what's the difference between bigquery and other databases like oracle post gree etc octa I don't know again I will use the database that has handed to me by my engineering colleagues and I have not looked that closely into the different different platforms I would consider that to be more in like the data engineering vertical and that there's so much to keep up with in data science and machine learning that I just haven't spent that much time working on it so um Preet says do you have a document we can refer for all the commands like Client List rose or client data set I believe all of that is in here Shh select install where is API reference client okay so here is the list of all of the commands so we had a couple different objects that we interacted with so we created a client object and these are all the commands that you can use with that client object we had a data set object and here are all the commands that you can use with the data set object I believe that we only use data set reference we had the table command and we used quite a few commands from here we used table to create the item we used I don't think we used row iterator I think we use like row index schema we created a schema item and then we looked at it and then there's a bunch of other stuff so I think this is the this is probably your best bet best bet for reference and I will I guess I can put it in here I'm trying to think the best place for it linked to all the big query I don't want to just put it in chat because then people who are looking at the notebook later won't have access to it and that is going to give an error if I try to treat it like Python so this will be in the latest version of the bigquery sketch scratchpad Colonel Varun says can we use bigquery with are not on cackle but there are I think there's two libraries actually so there's I think there's big R query and bigquery R yeah yeah yeah so this is again we don't have it set up on on Kangol but there is an r library for bigquery specifically and it's written by Jenny Brian who's amazing if you guys don't know Jenny she's fantastic she's written the absolute best guide to get for version control for data scientists that I highly highly recommend anyway I love Jenny she does such good work and that was me being a little bit off topic radar says how many columns in the crime table have timestamp data so to find that you're going to want to and this is in the exercise you're going to want to use the schema here and you'll see in the schema we have the name of the so these are all columns we have the name of the column and then we have the data type of the column and sequel has a number of data types and one of them is timestamp so we're asking you to count the number of columns that have the timestamp data data type Mockingbird says can we finish all the exercises in one day feel free just like go nuts we're just doing one at a time just so that it's not you know overwhelming and I can can walk you through it you don't have to watch the live streams they're just an additional reference for you feel free to work at your own pace general question is it difficult to remember every syntax from Python libraries is it okay to copy and paste sometimes yes please you can always copy and paste I mean professional engineers copy and paste and the more you use a specific library the more you'll you'll remember the commands another thing that can be helpful that I know might be new to some people is tab complete so if I have my cursor in a cell and I start typing let's say I type T and then I hit tab it'll show me all of the possible ways that I could finish T that would refer to something that's already in my environment so here table table ref and you can see that these are objects that we created earlier for a specific library so let's say PD so I'm looking in the pandas library if I hit tab here I'm not seeing anything I should see all of the that I not important to speedy I did import it as PD why can't I see my commands yeah interesting so I should see all of the commands from the Python oh it just took a minute so it did eventually come up I just wasn't patient enough so here I can see all of the things that I can do with this table object so these are the different the different methods that have been implemented for this particular this particular class again very object oriented the client library was built by software engineers which is good because that's who you would want to build something like that this time I'll I'll wait for it to be done so it's very software engineer EE so you can see in pandas here are all of my different methods so in Python functions or lower case and classes so things that you would use to create an object are upper case usually yeah so that's that's one tip that can be yeah I know that's invalid syntax so that's one tip that can be helpful if you remember like part of a command is to use tab complete all right Munira says can s-someone join you did not register yeah we feel free all of the courses are on the intro to sequel course I can't go all of the the notebooks and livestreams will be here every day also if you register now we'll send you emails for the next three events you just won't get emails for this one all right so can we easily connect Python to sequel so for bigquery specifically I'd use the client library but I believe you can also there's also a specific set of libraries that make it a little bit more streamlined there's lots of options basically to write sequel when is the next course that's a good question so we wanted to give you guys a little bit of time so you can finish up each of the courses and turn because I know people are busy the next one will be July 16th through 18th so that's again Tuesday through Thursday the one after that will be August 7th and that's big query machine learning so we're going to be writing code to run machine learning inside the sequel database so instead of you know reading in your data doing stuff in pandas or you know scikit-learn or whatever you're using and then you know pulling data from the database to run through your pipeline and you actually build your pipeline in the database which i think is super cool and then we have an advanced sequel course that is not on my calendar apparently but it's definitely going to happen so and ma2 has some examples of Python libraries sequel like three sequel Academy and PI ODBC some examples so again sequel is a very commonly used tool so there's lots and lots of things that people have built around it to make it a little bit easier to use all right um are there more summer courses from Kegel so this summer we are doing specifically sequel based stuff if people enjoy it we'll do more in the future all right and lots of questions about the exercises or the exercises evaluated so there is a little exercise checking system so you can see whether your answers are correct or not it is I will tell you a little bit finicky it is looking for pretty specific answers so if you're having questions with that I would post on the learn forums which are fufufu if you go to Kaggle it's under discussion and then learn the orange one for those of you who can see color good and we'll we'll be monitoring these throughout the week and making sure to answer your questions on here Suman says how can sequel do ml jobs we'll talk about that on August 7th stay tuned manu says can you make it public oh it is private oh no I'm sorry I failed to keep making it public I'm sorry that's my bad okay now it should be public sorry about that Smith says is the course finished yep pretty much so that's what I wanted to talk about just to get you guys started figuring out how to get to the table so that you can run queries against it and queries I think I mentioned haven't mentioned this I've just said the word query a sequel command or series of commands that return a piece of data is called a query that's all it is it's just like a little like a script or just you know I'm trying to think of another thing or like the cell of a notebook right so it's a bunch of code that you run together and it sends you something back yeah how do you access the content of a table ah that's what we'll talk about for the rest of the week all right let's see the guy I think I got time for maybe one or two more questions Salim says what can you not do with pandas that you can do with sequel after you've imported data set to a data frame not much pandas replicates most of the the functionality of sequel the benefit of sequel is that you don't have to have your data locally so if your data set is very big then you're probably going to want it in a data set in a data set in a database but yeah the functionality is pretty similar the syntax is a little bit different and they're just designed to work at different scales and also sequel tends to be a little bit faster to do things like look through databases and grab specific pieces of information then python would be for the for the same tasks on the same size data so it's it's good for again enterprise stuff because it's a little bit faster its bigquery more than enough to handle real life data which can be used to make an ml model oh yeah oh yeah again not super familiar with other database platforms but I I don't know how big a database would have to be to be too big for bigquery but it would have to be extremely large like petabytes again I don't know that the theoretical upper bound of bigquery but I certainly have never and probably will never work with the data set that's too big for it Brandon says is it possible to offload the processing in the query though I think that's a comment tonight or reply to another comment someone says Dylan used redshift is your company uses redshift use bigquery if it uses bigquery yes do not set up a new database because then it's gonna be your problem and that's that's nobody's idea of a fun time that's not true that's a DBA s-sorry a database analyst what's the a stand for associate awesome anyway that's somebody who work with databases full-time job if you're a data scientist you don't you don't need to do that to you that's what friends are for that's what your colleagues are for you're too busy being frustrated by math and this should be public now hopefully so you can see it Fran says how do we add data so you're not actually going to be able to add data as the public bigquery datasets because they're you know right protected if you are adding data to your data set and you're using bigquery I would check out the bigquery documentation and there's a again there's a lot of documentation online and if you're having trouble again if you're working locally I would recommend going to stack overflow because you may not get an answer on kaggle administrator thank you so much everybody we know is this video recorded yes and it should be available immediately after we stopped recording so you guys can watch it back whenever you like all right so it says how are we accessing the database using kaggle if we're eventually adding the data set to the notebook itself oh okay so that's actually a really good question usually when you add a dataset to kaggle you are loading the whole data set into your local file system when you add a bigquery data set to kaggle your kygo colonel you are adding a connection to the bigquery data set you are not actually reading in the whole data set to your file system these data sets are really really really big like many terabytes big you wouldn't want to load it into a virtual machine to query it you want to make sure that you have that connection and we again we take care of it for you because if you are learning to write sequel queries I personally don't think you should also learn how to build and design databases but that's just me all right thank you so much everybody umm we will be back tomorrow at the same time 9:00 a.m. Pacific which i think is 10:30 p.m. ist and 4:00 p.m. GMT times of their heart and this will be recorded and will at the end of the week we'll send you a link to all the videos as well if you are getting the emails and just something that I'll mention in the email tomorrow we are having superlatives so they'll be little like camp awards for the best question and the best answer and if you want to be in consideration for those just post questions or answers on the learned forums and we'll we'll pick some of our favorites from this alright thank you so much for joining everybody I hope you found this helpful I know a lot of database content is written as if you already knew what databases are and what they're for and I hope that this helped bridge the gap for some of you and I will see you tomorrow I can't yeah all right I'll see you guys then bye
Info
Channel: Kaggle
Views: 31,443
Rating: 4.9654746 out of 5
Keywords: data science, deep learning, nlp, neural networks, nlu, natural language, python, programming, coding, machine learning, ai, artificial intelligence, kaggle, research, technology, reading group
Id: jYQoQfFzJRw
Channel Id: undefined
Length: 62min 22sec (3742 seconds)
Published: Tue Jun 25 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.