CS50 2020 - Lecture 7 - SQL

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] so [Music] [Music] all right this is cs50 and this is week seven and today's focus is going to be entirely on data the process of collecting it the process of storing it the process of searching it and so much more you'll recall that last week we started off by playing around with a relatively small data set we asked everyone for what their preferred house at hogwarts might be and then we proceeded to analyze that data a little bit using some python and counting up how many people wanted gryffindor slytherin or the others as well and we ultimately did that by using a google form to collect it and we stored all of the data in a google spreadsheet which we then exported of course as a csv file so this week we thought we'd collect a little more data and see what kinds of problems arise when we start using only a spreadsheet or in turn a csv file to store the data that we care about so in fact if you could go ahead and go to this url here that you see you should see another google form this one asking you some different questions all of us probably have some preferred tv shows now more than ever perhaps and what we'd like to do is ask everyone to input into that form their favorite tv show followed by the genre or genres into which that particular tv show falls so go ahead and take a moment to do that and if you're unable to follow along at home what folks are looking at is a form quite like this one here whereby we're just asking them for the title of their preferred tv show and the genre or genres of that specific tv show all right so let's go ahead and start to look at some of this data that's come in here is the resulting google spreadsheet that google forms has created for us and you'll notice that by default google forms this particular tool has three different columns at least for this form one is a time stamp and google automatically gives us that based on what day and time everyone was buzzing in with their responses then they have a header row beyond that for title and genres i've manually bold faced it in advance just to make it stand out but you'll notice that the headings here title and genres perfectly matches the question that we asked in the google form that allows us to therefore line up your responses with our questions and you can see here punisher was the first favorite tv show to be inputted followed by the office breaking bad new girl archer another office and so forth and in the third column under genres you'll see that there's something curious here while some of the cells that is the little boxes of text have just single words like comedy or drama you'll notice that some of them have a comma separated list and that comma separated list is because some of you checked as you could multiple check boxes to indicate that uh breaking bad is a crime genre drama and also thriller and so the way google forms handles this is a bit lazily in the sense that they just drop all of those values as a comma separated list inside of the spreadsheet itself and that's potentially a problem if we ultimately download this as a csv file comma separated values because if now you have commas inside in between the commas fortunately there's a solution to that that we'll ultimately see so we've got a good amount of data here in fact if i keep scrolling down we'll see a few hundred responses now and would be nice to analyze this data in some way and figure out what the most popular tv show is uh maybe search for new shows i might like via their genre so you can imagine some number of queries that could be answered by way of this data set but let's first consider the limitations of leaving this data in just a spreadsheet like this all of us are probably in the habit of using occasionally google spreadsheets apple numbers microsoft excel or some other tool so let's consider what spreadsheets are good at and what they are bad at would anyone like to volunteer an answer to the first of those what is a spreadsheet good at or good for yeah andrew what's your thinking on spreadsheets they're very good for quickly sorting okay very good for quickly sorting i like that i could click on the top of the title column for instance and immediately sort all of those titles by alphabetically i like that other reasons to use a spreadsheet what problems do they solve what are they good at other thoughts on spreadsheets yeah how about peter storing large amounts of data that you can later analyze okay so storing large amounts of data that you can later analyze it's kind of a nice model for storing lots of rows of data so to speak i will say that there actually is a limit and in fact back in the day i learned what this limit is long story short in graduate school i was using a spreadsheet to analyze some research data and at one point i had more data than excel supported rows 4. specifically i had some 65 536 rows which was too many at that point for excel at the time because long story short if you recall from a spreadsheet program like google spreadsheets every row is numbered from one on up well unfortunately at the time microsoft had used a 16-bit integer 16 bits or two bytes to represent each of those numbers and it turns out the 2 to the 16th power is roughly 65 000 so at that point i maxed out the total number of rows now to peter's point they've increased that in recent years and you can actually store a lot more data so spreadsheets are indeed good at that but they're not necessarily good at everything because at some point you're going to have more data potentially in a spreadsheet then your mac or pc can handle in fact if you're actually trying to build an application whether it's twitter or instagram or facebook or anything of that scale those companies are certainly not storing their data suffice it to say in a spreadsheet because there would just be way too much data to use and no one could literally open it on their computer so we'll need a solution to that problem of scale but i don't think we need to throw out what works well about spreadsheets so you can store indeed a lot of data in row form but it would seem that you can also store a lot of data in column form and even though i'm only showing columns a b and c of course you've probably used spreadsheets where you add more columns d e f and so forth so what's the right mental model for how to think about rows versus columns in a spreadsheet i feel like we probably use them in a somewhat different way conceptually we might think about them a little differently what's the difference between rows and columns in a spreadsheet sophia um adding more entries like adding more data is those are within the rows but then like the actual attributes or characteristics of the data should be in columns exactly when you add more data to the spreadsheet you should really be adding to the bottom of it adding more and more rows so these things sort of grow vertically even though of course that's just a human's perception of it they grow from top to bottom by adding more and more rows but to sophia's point your columns represent what we might called attributes or fields uh or any other such characteristic that kind of is a type of data that you're storing so in this case of our form timestamp is the first column title is the second column genres is the third column and those columns can indeed be thought of as fields or attributes properties of your data and those are properties that you should really decide on in advance when you're first creating the form in our case or when you're manually creating the spreadsheet in your in another case you should not really be in the habit when using spreadsheets be in the habit of adding data from left to right adding more and more columns unless you decide to collect more types of data so just because someone adds a new favorite tv show to your data set you shouldn't be adding that from left to right in a new column you should indeed be adding it from top to bottom but suppose that we actually decided to collect more information from everyone maybe that form had instead asked you for your name or your email address or any other questions those properties or attributes or fields would belong as new columns so this is to say we generally decide on the the layout of our data the schema of our data in advance and then from there on out we proceed to add add add more rows not columns unless we change our mind and need to change the schema of our particular data so it turns out that spreadsheets are indeed wonderfully useful to peter's point for you know large or reasonably large data sets that we might collect and we can of course per last week export those data sets as csv files and so we can go from a spreadsheet to a simple text file rep stored in ascii or unicode more generally on your own hard drive or somewhere in the cloud and you can actually think of that file that dot csv file is what we might call a flat file database a database is generally speaking a file that stores data or it's a program that stores data for you and all of us have probably thought about or use databases in some sense you're probably uh familiar with the fact that all of those same big websites google and twitter and facebook and others use databases to store our data well those databases are either just really big files containing lots of data or special programs that are storing our data for us and a flat file is just referring to the fact that it really is a very simple design in fact years ago decades ago humans decided when storing data in simple text files that if you want to store different types of data like to sophia's point different properties or attributes well let's keep it simple let's just separate those columns with commas in our flat file database aka a csv you can use other things you can use tabs there's things called tsvs for tab separated values and frankly you can use anything you want but there is a corner case and we've already seen a preview of it what if your actual data has a comma in it what if the title of your favorite tv show has a comma what if google is presuming to store genres as a comma separated list bad things can happen if using a csv as your flat file database but there's solutions to that and in fact what the world typically does is whenever you have commas inside of your csv file you just make sure that the whole string is double quoted on the far left and far right and anything inside of double quotes is not mistaken thereafter as delineating a column as the other commas in the file might so that's all that's meant by a flat file database and csv is perhaps one of the most common the most common formats thereof if only because all of these programs like google spreadsheets and excel and numbers allow you to save your files as csvs now long story short those of you who have used fancier features of spreadsheets like built-in functions and formulas and those kinds of things those are built-in and proprietary to google spreadsheets and excel and numbers you cannot use formulas in a csv file or a tsv file or in a flat file database more generally you can only store static that is unchanging values so when you export the data what you see is what you get and that's why people use fancier programs like excel and numbers in google spreadsheets because you get more functionality but if you want to export the data you can only get indeed the raw textual data out of it but i dare say that's going to be okay in fact brian do you mind if i go ahead and download this spreadsheet as a csv file now yep go ahead all right i'm going to go ahead and google spreadsheets and go to file download and you can see a whole bunch of options pdf web page comma separated values which is the one i want so i'm going to indeed go ahead and choose csv from this drop down in spreadsheets that of course downloaded that file for me and now i'm going to go ahead and go into our familiar cs50 ide you'll recall that last week i was able to upload a file into the ide and i'm going to go ahead and do the same here this week as well i'm going to go ahead and grab my file which ended up in my downloads folder on my particular computer here and i'm going to go ahead and drag and drop this into the ide such that it ends up in my home directory so to speak so now i have this file favorite tv shows forms and in fact if i double click this within the ide you'll see familiar data now timestamp comma title comma genres is our header row that contains the names of the properties or attributes in this file then we've got our timestamps comma favorite title comma and then a comma separated list of genres and here indeed notice that google took care to use double quotes around any values that themselves had commas so it's a relatively simple file format and i could certainly just kind of skim through this figuring out who likes the office who likes breaking bad or other shows but per last week we now have a pretty useful programming language at our disposal python that could allow us to start manipulating and analyzing this data more readily and here to my point last week about using the right tool for the job you could absolutely do everything we're about to do in all weeks prior of cs50 we could have used c for what we're about to do but as you can probably glean c tends to be painful for certain things like anything involving string manipulation changing strings analyzing strings is just a real pain right god forbid you had to take the csv file and load it all into memory not unlike your spell checker you would have to be using malloc all over the place or reallock or the like like there's just a lot of heavy lifting involved in just analyzing a text file so python does all of that for us by just giving us more functions at our disposal with which to start analyzing and opening data so let me go ahead and close this file let me go ahead and create a new one called favorites.pi wherein i'm going to start playing with this data set and see if we can't start answering some questions about it and frankly to this day 20 plus years after learning how to program for the first time i myself am very much in the habit when writing a new program of just starting simple and not solving the problem i ultimately want to but something simpler just as a sort of proof of concept to make sure i have the right plumbing in place so by that i mean this let's go ahead and write a quick program that simply opens up this file the csv file iterates over it top to bottom and just prints out each of the titles just as a quick sanity check that i know what i'm doing and i have access to the data they're in so let me go ahead and import csv and then i can do this in a few different ways but by now you've probably seen or remembered my using something like the open command and the with keyword to sort of open and eventually automatically close this file for me this file is called favorite tv shows dash form responses 1 dot csv and i'm going to open this up in read mode strictly speaking the r is not required you might see examples online not including it that's because read is the default but for parity with c and f open i'm going to be explicit and actually do quote-unquote r and i'm going to go ahead and give this a variable name of file so this line 3 here has the effect of opening that csv file in read-only mode and creating a variable called file via which i can reference it now i'm going to go ahead and use some of that csv functionality i'm going to give myself what we keep calling a reader which i could call it xyz anything else but reader kind of describes what this variable is going to do and it's going to be the return value of calling csv.reader on that file and so essentially the csv library per last week has a lot of fancy features built in and all it needs as input is an already opened text file and then it will then wrap that file so to speak with a whole bunch of more useful functionality like the ability to uh read it column and row at a time all right now i'm going to go ahead and you know what just for now i'm going to skip the first row i'm going to skip the first row because the first row has my headings timestamp title and genres and i know what my columns are so i'm just going to ignore that line for now and now i'm going to do this for row in reader let me go ahead and print out quite simply row and i only want title so i think if it's three columns from left to right it's 0 1 2 so i want to print out column bracket 1 which is going to be the second column 0 indexed all right let me go ahead and save that go down to my terminal window and run python of favorites.pi and cross my fingers okay voila it looks like it flied it flew by super fast but it looks like indeed these are all of the tv shows that folks have inputted indeed there's a few hundred if i keep scrolling up so it looks like my program is working but let's improve it just a little bit it turns out that using the csv reader isn't necessarily the best approach in python many of you have already discovered a dict reader a dictionary reader which is nice because then you don't have to know or keep double checking what number column your data is in you can instead refer it to by the header itself so by title quote unquote or by genres this is also good because if you or maybe a colleague are sort of messing around with the spreadsheet and they rearrange the columns by dragging them left or right any numbers you have used in your code 0 1 2 on up could suddenly be incorrect if your colleague has reordered those columns so using a dictionary reader tends to be a little more robust because it uses the titles not the mere numbers it's still fallible if someone yourself or someone else changes the values in that very first row and renames titles or genres then things are going to break but at that point we kind of have to blame you for not having kept track of your code versus your data but still a risk so i'm going to change this to dictionary reader or dict reader here and pretty much the rest of my code can be the same except i don't need this hack here on line five i don't need to just skip over to the next row from the get go because i now want the dictionary reader to handle the process of reading that first row for me but otherwise everything else stays the same except for this last line where now i think i can now use row as a dictionary not as a list per se and print out specifically the title from each given row so let me go ahead and run python of favorites.pi again and voila it looks like i got the same results several hundred of them but let me stipulate that it's doing the same thing if we actually compared both of those side by side all right before i forge ahead now to actually augment this with new functionality any questions or confusion on this python script we just wrote to open a file wrap it with a reader or dict reader and then iterate over the rows one at a time printing the titles any questions confusion on syntax at all it's okay we've only known or seen python for a week it's fine if it's still quite new anything brian we should address yeah so why is it that you don't need to close the file using the syntax that you're using right here really good question last week i more pedantically used open on its own and then i later used a close function that was associated with the file that i just opened now the more pythonic way to do things if you will is actually to use this with keyword which didn't exist in c and it just tends to be a useful feature in python whereby if you say with open dot dot dot it will open the file for you then it will remain open so long as your code is indented inside of that with keywords block and as soon as you get to the end of your program it will automatically be closed for you so this is one of these features where python in some sense is trying to protect us from ourselves it's probably pretty common for humans myself included to forget to close your file that can create problems with saving things permanently it can create memory leaks as we know from c so the with keyword just assumes that i'm not going to be an idiot and forget to close the file python is going to do it for me automatically other questions or confusions brian how does dict reader know that title is the name of the key inside of the dictionary really good question too so it is designed by the authors of the python language to look at the very first row in the file split it on the commas in that very first row and just assume that the first word or phrase before the first comma is the name of the first column that the second word or phrase after the first comma is the name of the second column and so forth so a dict reader just presumes as is the convention with csvs that your first row is going to contain the headings that you want to use to refer to those columns if your csv happens not to have such a heading whereby it just jumps right in on the first row to real data then you're not going to be able to use a dict reader correctly at least not without some manual configuration all right so let's go ahead and now i feel like there's a whole mess here and you know some of these shows are pretty popular and as i'm glancing over this i definitely see some duplication a whole bunch of you like the office a whole bunch of you like breaking bad game of thrones and a whole bunch of other shows as well so it would be nicer i think if we kind of narrow the scope of our look at this data but just looking at unique values looking at unique values so rather than just iterate over the file top to bottom printing out one title after another why don't we go ahead and sort of accumulate all of this data in some kind of data structure so that we can throw away duplicate values and then only print out the unique titles that we've accumulated so i bet we can do this in a few ways but if we think back to last week's demonstration of our dictionary you'll recall that i used what was called a set and i'm going to go ahead and create a variable called titles and set it equal to something called set and a set is just a collection of values it's kind of like a list but it eliminates duplicates for me and that would seem to be exactly the characteristic that i want for this program now instead of printing each title which is now premature if i want to first filter out duplicates i'm going to go ahead and do this i'm going to go ahead and add to the titles set using the add function the current rows title so again i'm not printing it now i'm instead adding to the title set that particular title and if it's there already no big deal the set data structure in python is going to throw away the duplicates for me and it's only going to go ahead and keep the uniques now at the bottom of my file i need to do a little more work admittedly now i have to iterate over the set to print out only those unique titles so let me do this for title in titles go ahead and print out title and this is where python just gets really user friendly right you don't have to do int i get 0 i less than n or whatever you can just say for title in titles and if the titles variable is the type of data structure that you can iterate over which it will be if it's a list or if it's a set or even if it's a dictionary another data structure we saw last week in python the for loop in python will just know what to do this will loop over all of the titles in the titles sets so let me go ahead and save this file and go ahead now and run python of favorites.pi and it looks like yeah the list is different in some way but i'm seeing fewer results as i scroll up definitely fewer than before because my scroll bar didn't jump nearly as far down but honestly this is kind of a mess let's go ahead and sort this now in c it would have been kind of a pain to sort things we'd have to whip out the pseudo code probably for bubble sort selection sword or god forbid merge sort and then implement it ourselves but no with python comes really the proverbial kitchen sink of functions so if you want to sort this set you know what just say you want it sorted there is a function in python called sorted that will use one of those better algorithms maybe it's merge sort maybe it's something called quick sort maybe it's something else altogether it's not going to use a big o of n squared sort someone at python probably have spent the time implementing a better sort for us but it will go ahead and sort the set for me now let me go ahead and do this again let me increase the size of my terminal window and rerun python of favorites.pi okay and now we have an interesting assortment of shows that's easier for me to wrap my mind around because i have it now sorted here and indeed if i scroll all the way up we should see all of the shows beginning with numbers or a period which might have just been someone playing around followed by the a words the b words and so forth so now it's a little easier to wrap our minds around this but something's up i feel like a lot of you like avatar the last airbender and yet i'm seeing it indeed four different times but i thought we were filtering this down to uniques by using that set structure so what's going on and in fact if i keep scrolling i'm pretty sure i saw more duplicates in here bojack horseman breaking bad breaking bad brooklyn nine-nine brooklyn nine-nine cs50 and several different flavors um and yes uh keeps going friends so i see a lot of duplicate value so what's going on yeah uh kadana yeah so your current sort is case insensitive this sorry is case sensitive meaning that if someone spells avatar with capital a's in some places then it's going to be a different result each time yeah exactly some of you weren't quite diligent when it came to capitalization and so in fact the reality is is kudana notes that there's differences in capitalization now we've addressed this before in fact when you implemented your own spell checker you had to deal with this already when you were spell checking in arbitrary text some words might be capitalized somewhat might be all lowercase all uppercase and you wanted to tolerate different casings and so we probably solved this by just forcing everything to uppercase or everything to lower case and doing things therefore case insensitively so give me just a moment here and i'm going to go ahead and make a quick change to my form here let's go ahead and change this in such a way that we actually force everything to uppercase or lowercase it doesn't really matter which but we need to canonicalize things so to speak in some way and to canonicalize things just means to format all of your data in some standard way so to katana's point let's just standardize the capitalization of things maybe all uppercase all lowercase we just need to make a judgment call so i'm going to go ahead and make a few tweaks here i'm still going to use a set i'm still going to read the csv as before but instead of just adding the title with row bracket title i'm going to go ahead and force it to uppercase just arbitrarily just for the sake of uniformity and then let's go ahead and check what exactly has happened here i'm not going to change anything else but let me go ahead and increase the size of my terminal window rerun python of favorites.pie and voila it's a little harder to read just because i'm not used to reading all caps kind of looks like we're yelling at ourselves but i don't see wait a minute i still see the office over here twice if i keep scrolling here so far i see strangers things and strange stranger things that just looks like a typo i see two sherlocks though this is a little suspicious so kadana you and i don't seem to have solved things fully and this one's a little more subtle what more should i perhaps do to my data to ensure we get duplicates removed olivia maybe trimmer around the edges and we'll trim around the edges i like the sound of that but what do you mean what is that oh like trim off the extra spaces in case someone put a space before or after the words yeah exactly it's pretty common for humans intentionally or accidentally to hit the space bar where they shouldn't and in fact i'm kind of inferring that i bet one or more of you accidentally typed sherlock space and then decided nope that's it i'm not typing anything else but that's space even though we can't quite see it obviously is there and when we do a string comparison or when the set data structure does that it's actually going to be noticed when doing those comparisons and therefore they're not going to be the same so i can do this in a few different ways but it turns out in python you can chain functions together which is also too kind of a fancy feature notice what i'm doing here i'm still accessing the titles set i'm adding the following value to it i'm adding the value row bracket title but not quite i'm that is a string or an str in python speak i'm going to go ahead and strip it which means if we look up the documentation for this function to olivia's point it's going to strip off or trim all of the white space to the left all of the white space to the right whether that's the space bar or the enter key or the tab character or a few other things as well it's just going to get rid of leading and trailing white space and then whatever is left over i'm going to go ahead and force everything to uppercase in the spirit of kadana suggestion 2. so we're sort of combining two good ideas now to really massage the data if you will into a cleaner format and this is such a real world reality like humans you and i cannot be trusted to input data the way we are supposed to sometimes it's all lowercase because we're being a little lazy or a little social media like even if we're checking out from amazon and trying to input a valid postal address uh sometimes it's all capitals because uh i can think of a few people in my life who don't quite understand the caps lock thing just yet and so things might be all capitalized instead this is not good for computer systems that require precision to our emphasis in week zero and so massaging data means cleaning it up doing some mutations that don't really change the meaning of the data but canonicalize it standardize it so that you're comparing apples and apples so to speak not apples and oranges well let me go ahead and run this again in my bigger terminal window python of favorites stop high voila and scrolling up up up i think we're in a better place i only see one office now and if i keep scrolling up and up and up i'm seeing typos still but nothing related to white space and i think we have a much cleaner unique list of titles at this point of course if we scroll up i would have to be a lot more clever if i want to detect things like typographical errors it looks like one of you was very diligent about putting f period r period i period and so forth but then got bored at the end and left off the last period but that's going to happen when you're taking in user input we've of course got all these variants of cs50 that's going to be a mess to clean up because now you can imagine having to add a whole bunch of if conditions and else's and l-tips to sort of clean all of that up if we do want to canonicalize all different flavors of cs50 as quote unquote cs50 so this is a very slippery slope like you and i could start writing a huge amount of data just to clean this up but that's the reality when dealing with real world data well let's go ahead now and improve this program further do something a little fancier because i now can trust that my data has been canonicalized except for the actual typos or the weird variants of cs50 and the like let's go ahead and figure out what's the most popular favorite tv show among the audience here so i'm going to start where i have before with my current code because i think i have most of the building blocks in place i'm going to go ahead and clean up my code a little bit in here i'm going to go ahead and give myself a separate variable now called title just so that i can think about things in a little more orderly fashion but i'm not going to start adding things to this set anymore in fact a set i don't think is really going to be sufficient to keep track of the popularity of tv shows because by definition the set is throwing away duplicates but the goal now is kind of the opposite i want to know which are the duplicates so that i can tell you that this many people like the office this many people like breaking bad and the like so what tools do we have in python's toolkit via which we could accumulate or figure out that information any thoughts on what data structure might help us here if we want to figure out show popularity show popularity and by popularity i just mean the frequency of it in the csv file santiago um i guess one option could be to use dictionaries so that you could have like the office i don't know 20 votes and then game of thrones another one so that a dictionary could really help you visualize that yeah perfect instincts recall that a dictionary at the end of the day no matter how sophisticated it's implemented underneath the hood like your spell checker it's just a collection of key value pairs and indeed it's maybe one of the most useful data structures in any language because this ability to associate one piece of data with another is just a very general purpose solution to problems and indeed to santiago's point if the problem at hand is to figure out the popularity of shows well let's make the keys the titles of our shows and the frequencies thereof the votes so to speak the values of those keys we're going to map title to votes title to vote title to vote and so forth so a dictionary is exactly that so let me go ahead and scroll up and i can make a little tweak here instead of a set i can instead say dict and give myself just an empty dictionary there's actually shorthand notation for that that's a little more common to use two empty curly braces that just means the exact same thing give me a dictionary that's initially empty there's no fancy shortcut for a set you have to literally type out set open paren close paren but dictionaries are so common so popular so powerful they have this little syntactic shortcut of just two curly braces open and close so now that i have that let me go ahead and do this inside of my for loop instead of printing the title which i don't want to do and instead of adding it to the set i now want to add it to the dictionary so how do i do that well if my dictionary is called titles i think i can essentially do something like this titles bracket title equals or maybe plus equals one maybe i can kind of use the dictionary as just a little cheat sheet of counts numbers that start at zero and then just add one add two add three so every time i see the office the office the alphas do plus equals one plus equals one we can't do plus plus because that's not a thing in python it only exists in c but this would seem to go into the dictionary called titles look up the key that matches this specific title and then increment whatever value is there by one but i'm gonna go ahead and run this a little naively here let me go ahead and run python of favorites dot pi and wow or it broke already on line nine so it's sort of an apt uh choice of show to begin with we have a key error with punisher so punisher is bad something bad has just happened but what does that mean a key error is referring to the fact that i tried to access an invalid key in a dictionary this is saying that literally in this line of code here even though titles is a dictionary and even though the value of title singular is quote unquote punisher i'm getting a key error because that title does not yet exist so even if you're not sure of the python syntax for fixing this problem what's the uh intuitive solution here i cannot increment the frequency of the punisher because punisher is not in the dictionary it almost feels like a catch-22 uh great uh i think that you need first of all to create a for loop and maybe assign a value to every thing in the dictionary for example the value zero and then at one yeah so good instincts and here i can use another metaphor i worry we might have a chicken in the egg problem there because i don't think i can go to the top of my code add a loop that initializes all of the values in the dictionary to zero because i would need to know all of the names of the shows at that point now that's fine i think i could take you maybe more literally gray and open up the csv file iterate over it top to bottom and any time i see a title just initialize it in the dictionary as having a value of zero zero zero then have another for loop maybe reopen the file and do the same and that would work but it's arguably not very efficient it is asymptotically in terms of big o but that would seem to be doing twice as much work iterate over the file once just to initialize everything to zero then iterate over the file a second time just to increment the counts i think we can do things a little more efficiently i think we can achieve not only correctness but better design any thoughts on how we can still solve this problem without having to iterate over the whole thing twice yeah some of it um i think we can add in an if statement to check if that key is in the dictionary and if it's not then add it and then go ahead and increment the value after nice and we can do exactly that so let's just apply that intuition if the problem is that i'm trying to access a key that does not yet exist well let's just be a little smarter about it and to some of its point let's check whether the key exists and if it does then increment it but if it does not then and only then to grid's advice initialize it to zero so let me do that let me go ahead and say if title in titles which is the very pythonic beautiful way of asking a question like that way cleaner than in c let me go ahead then and say uh exactly the line from before else though if the that title is not yet in the dictionary called titles well that's okay too i can go ahead and say titles bracket title equals zero so the difference here is that i can certainly inc i can certainly index into a dictionary using a key that doesn't exist if i plan at that moment to give it a value that's okay and that has always been okay since last week but however if i want to go ahead and increment the value that's there i'm going to go ahead and do that in this separate line but i did introduce a bug i did introduce a bug here i think i need to go one step further logically i don't think i want to initialize this to zero per se does anyone see a subtle bug in my logic here if the title is already in the dictionary i'm incrementing it by one otherwise i'm initializing it to zero any subtle catches here yeah olivia what do you see i think you should initialize it to one since it's the first instance exactly i should initialize it to one otherwise i'm accidentally overlooking this particular title and i'm going to go ahead and under count it so i can fix this either by doing this or frankly if you prefer i don't technically need to use an if else i can use just an if by doing something like this instead i could say if title not in titles then i could go ahead and say titles bracket title get zero and then after that i can blindly so to speak just do this so which one is better i think this second one is maybe a little better in that i'm saving one line of code but it's ensuring with that if condition to someone's advice that i'm not indexing into the titles dictionary until i'm sure that the title is in there so let me go ahead and run this now python of favorites dot pi enter and okay it didn't crash so that's good but i'm not yet seeing any useful information but i now have access to a bit more let me scroll down now to the bottom of this program where i have now this loop let me go ahead and print out not just the title but the value of that key in the dictionary by just indexing into it here and you might not have seen the syntax before but with print you can actually pass in multiple arguments and by default print will just separate them with a space for you you can override that behavior and separate them with anything but this is just meant to be a quick and dirty program that prints out titles and now the popularity thereof so let me run this again python of favorites.pai and voila it's kind of all over the place office super popular with 26 votes there a lot of single votes here a lot of big bang theory has nine you know this is all nice and good but i feel like this is going to take me forever to wrap my mind around which are the most popular shows so of course how would we do this well to the point made earlier with spreadsheets my god in microsoft excel or google spreadsheets or apple numbers you just click the column heading and boom sort it we seem to have lost that capability unless we now do it in code so let me do that for us let me go ahead and go back to my code and it looks like sorted even though it does work on dictionaries is actually sorting by key not by value and here's where our python programming techniques need to get a little more sophisticated and we want to introduce another feature here now of python which is going to solve this problem specifically but in a pretty general way so if we read the documentation for sorted the sorted function indeed sorts sets by the values they're in it sorts lists by the values they're in it sorts dictionaries by the keys they're in because dictionaries have two pieces of information for every element it has a key and a value not just a value so by default sorted sorts by key so we somehow have to override that behavior so how can we do this well it turns out that the sorted function takes another optional argument literally called key and the key argument takes as its value the name of a function and this is where things get really interesting if not confusing really quickly it turns out in python you can pass around functions as arguments by way of their name and technically you can do this in c it's a lot more syntactically involved but in python it's very common in javascript it's very common in a lot of languages it's very common to think of functions as first class objects which is a fancy way of saying you can pass them around just like they are variables themselves we're not calling them yet but you can pass them around by their name so what do i mean by this well i need a function now to sort my dictionary by its value and only i know how to do this and perhaps so let me go ahead and give myself a generic function name just for the moment called f f for function kind of like in math because we're going to get rid of it eventually but let me go ahead and temporarily define a function called f that takes as input a title and then it returns for me the value corresponding to that key so i'm going to go ahead and return titles bracket title so here we have a function whose purpose in life is super simple you give it a title it gives you the count thereof the frequency the popularity thereof by just looking it up in that global dictionary so it's super simple but that's its only purpose in life but now according to the documentation for sorted what it's now going to do because i'm passing in a second argument called key the sorted function rather than just presume you want everything sorted alphabetically by key it's instead going to call that function f on every one of the elements in your dictionary and depending on your answer the return value you give with that f function that will be used instead to determine to determine the actual ordering so by default sorted just looks at key what i'm effectively doing with this f function is instead returning the value corresponding to every key and so the logical implication of this even though the syntax is a little new is that this dictionary of titles will now be sorted by value instead of by key because again by default it sorts by key but if i define my own key function and override that behavior to return the corresponding value it's the values the numbers the counts that will actually be used to sort this thing all right let's go ahead and see if that's true in practice let me go ahead and rerun python of favorites.pi i should see all the titles and voila conveniently the most popular show seems to be game of thrones with 33 votes followed by friends with 27 followed by the office with 26 and so forth but of course the list is kind of backwards i mean it's convenient that i can see it at the bottom of my screen but really if we're making a list it should really be at the top so how can we override that behavior turns out the sorted function if you read its documentation also takes another optional parameter called reverse and if you set reverse equal to true capital t in python that's going to go ahead and give us now uh the reverse order of that same sword so let me go ahead and maximize my terminal window rerun it again and voila if i scroll back up to the top it's not alphabetically sorted but if i keep going keep going keep going keep going the numbers are getting bigger and voila now game of thrones with 33 is all the way at the top all right so pretty cool and again the new functionality here in python at least is that we can actually pass in functions two functions and leave it to the ladder to call the former so that just complicated just to say but any questions or confusion now on how we are using dictionaries and how we are sorting things in this reverse value-based way any questions or confusion anything in the chat or verbally brian uh looks like all questions are answered here okay then in that case let me point out a common mistake notice that even though f is a function notice that i did not call it there that would be incorrect the reason being we deliberately want to pass the function f into uh the sorted function so that the sorted function can take it upon itself to call f again and again and again we don't want to just call it once by using the parentheses ourselves we want to just pass it in by name so that the sorted function which comes with python can instead do it for us santiago did you have a question yes i was i was going to ask why didn't we put f of title uh so like why we didn't i was going to ask that question specifically oh with the with the parentheses oh okay perfect so uh because that would call the function once and only once we want sorted to be able to call it again and again now here's actually an example as we've seen in the past of a correct solution this is behaving as i intend a list of sorted titles uh from top to bottom in order of popularity but it's a little poorly designed because i'm defining this function f whose name in the first place is kind of lame but i'm defining a function only to use it in one place and my god the function's so tiny it just feels like a waste of keystrokes to have defined a new function just to then pass it in so it turns out in python if you have a very short function whose purpose in life is meant to be to solve a local problem just once and that's it and it's short enough that you're pretty sure you can fit it on one line of code without things wrapping and starting to get ugly stylistically it turns out you can actually do this instead you can copy the code that you had in mind like this and instead of actually defining f as a function name you can actually use a special keyword in python called lambda you can specify the name of an argument for your function as before and then you can simply specify the return value thereafter deleting the function itself so to be clear key is still an argument to the sorted function it expects as its value typically the name of a function but if you've decided that this seems like a waste of extra a waste of effort to define a function then pass the function in especially when it's so short you can do it in a one-liner a lambda function is an anonymous function lambda literally says python give me a function i don't care about its name therefore you don't have to choose a name for it but it does care still about its arguments and its return value so it's still up to you to provide zero or more arguments and a return value and notice i've done that i've specified the keyword lambda followed by the name of the argument i want this anonymous nameless function to accept and then i'm specifying the return value and with lambda functions you do not need to specify return whatever you write after the colon is literally what will be returned automatically so again this is a very pythonic thing to do it's kind of a very clever one-liner even though it's a little cryptic to see for the very first time but it allows you to condense your thoughts into a succinct statement that gets the job done so you don't have to start defining more and more functions that you or someone else then need to keep track of all right any questions then on this and i i'm pretty sure this is as complex or sophisticated as our python code today will get yeah over to us sofia i was wondering why lambda is used as like specifically rather than some other keyword yeah so there's a long history in this and if in fact you take a course on functional programming at harvard it's called cs51 there's a whole etymology behind keywords like this let me defer that one for another time but indeed not only in python but in other languages as well these things have come to exist called lambda functions so they're actually quite commonplace in other languages as well and so python just adopted the term of art mathematically lambda is often used as a symbol for functions and so they borrowed that same idea in the world of programming all right so seeing no other questions let's go ahead and solve a related problem still with some python but that's going to push up against the limits of efficiency when it comes to storing our data in csv files let me go ahead and start let me go ahead and start fresh in this file favorites dot pi all of the code i've written thus far though is on the course's website in advance so you can see the incremental improvement i'm going to go ahead and again import csv at the top and now this let's write a program this time that doesn't just automatically open up the csv and analyze it looking for the total popularity of shows let's search for a specific show in the csv and then go ahead and output the popularity thereof and i can do this in a bunch of different ways but i'm going to try to make this as concise as possible i'm first going to ask the user for to input a title i could use cs50's getstring function but recall that it's pretty much the same as python's input function so i'm going to use python's input function today and then i'm going to go ahead and as before open up that same csv called favorite tv shows form responses 1 dot csv in read-only mode as a variable called file i'm then going to give myself a reader and i'll use a dict reader again so i don't have to worry about knowing which columns things are in passing in file and then let's see if i only care about one title i can keep this program simpler i don't need to figure out the popularity of every show i just need to figure out the popularity of one show the title that the human has typed in so i'm going to go ahead and give myself a very simple int called counter and set it equal to zero i don't need a whole dictionary just one variable suffices now and i'm going to go ahead and iterate over the rows in the reader as before and then i'm going to say if the current rows title equals equals the title the human typed in let's go ahead and increment counter by one and it's already initialized because i did that on line seven so i think i'm good and then at the end of this program let's very simply print out the value of counter so the purpose of this program is to prompt the user for a title of a show and then just report the popularity thereof by counting the number of instances of it in the file so let me go ahead and run this with python of favorites.pi enter uh let me go ahead and type in the office enter and 19. now i don't remember exactly what the number was but i remember the office was more popular than that i'm pretty sure it was not 19. any intuition as to why this program is buggy or so it would seem a few people in the chat are saying you need to remember to deal with capitalization and white space again yeah so we need to practice those same lessons learned from before so i should really canonicalize the input that the human i just typed in and also the input that's coming from the csv file perhaps the simplest way to do this is up here to first strip off leading and trailing white space in case i get a little sloppy and hit the space bar where i shouldn't and then let's go ahead and force it to uppercase just because it doesn't matter if it's upper or lower but at least we'll standardize things that way and then when i do this look at the current rows title i think i really need to do the same thing if i'm going to canonicalize one i need to canonicalize the other and now compare the all caps white space script versions of both strings so now let me rerun it now i'm going to type in the office enter and voila now i'm at 26 which i think is where we were at before and in fact now i the user can be a little sloppy i can say the office i can run it again and say the office and then for whatever reason hit the space bar a lot enter it's still going to work and indeed though we seem to be like belaboring the penantic here with uh trimming off white space and so forth just think in a relatively small audience here how many of you accidentally hit the space bar or capitalize things differently this happens massively on scale and you can imagine this being important when you're tagging friends in some social media account you're doing at brian or the like you don't want to have to require the user to type at capital b lowercase r i a n and so forth so tolerating disparate messy user input is such a common problem to solve including in today's apps that we all use all right any questions then on this program which i think is correct then let me ask a question of you in what sense is this program poorly designed in what sense is this program poorly designed this is more subtle but think about the running time of this program in terms of big o what is the running time of this program if the csv file has n different shows in it or n different submissions so n is the variable in question yeah what's the running time andrew yeah it's big o of n because i'm literally using linear search by way of the for loop that's how a for loop works in python just like in c starts at the beginning and potentially goes all the way till the end and so i'm using implicitly linear search because i'm not using any fancy data structures no sets no dictionaries i'm just looping from top to bottom so you can imagine that if we surveyed not just all of the students here in class but maybe everyone on campus or everyone in the world maybe we're internet movie database imdb there could be a huge number of votes and a huge number of shows and so writing a program whether it's in a terminal window like mine or maybe on a mobile device or maybe on a web page for your laptop or desktop it's probably not the best design to constantly loop over all of the shows in your database from top to bottom just to answer a single question it would be much nicer to do things in log of end time or in constant time and thankfully over the past few weeks both in cnn and python we have seen smarter ways to do this but i'm not practicing what i've preached here and in fact at some point this notion of a flat file database starts to get too primitive for us flat file databases like csv files are wonderfully useful when you just want to do something quickly or when you want to download data from some third party like google in a standard portable way portable means that it can be used by different people on different systems csv is about as simple as it gets because you don't need to own microsoft word or apple numbers or any particular product it's just a text file so you can use any text editing program or any programming language to access it but flat file databases aren't necessarily the best uh structure to use ultimately for larger data sets because they don't really lend themselves to more efficient queries so csv files pretty much at best you have to search top to bottom left or right but it turns out that there are better databases out there generally known as relational databases that instead of being files in which you store data they are instead programs in which you store data now to be fair those programs use a lot of ram memory where they actually store your data and they do certainly persist your data they keep it long-term by storing your data also in files but between you and your data there is this running program and if you've ever heard of oracle or mysql or postgres or sql server or microsoft access or bunches of other popular products both commercial and free and open source alike relational databases are so similar in spirit to spreadsheets but they are implemented in software and they give us more and more features and they use more and more data structures so that we can search for data insert data delete data update data much much more efficiently than we could if just using something like a csv file so let's go ahead and take our five-minute break here and when we come back we'll look at relational databases and in turn a language called sql all right we are back and the goal at hand now is to transition from these fairly simplistic flat file databases to a more proper relational database and relational databases are indeed what power so many of today's mobile applications web applications and the like now we're beginning to transition to real-world software with real-world languages at that and so now uh let me introduce what we're going to call sql lite so it turns out that a relational database is a database that stores all of the data still in rows and columns but it doesn't do so using spreadsheets or sheets it instead does so using what we're going to call tables so it's pretty much the same idea but in with tables do we get some additional functionality with those tables we'll have the ability to search for data update data delete data insert new data and the like and these are things that we absolutely can do with spreadsheets but in the world of spreadsheets if you want to search for something it's you the human doing it by manually clicking and scrolling typically if you want to insert data it's you the human typing it in manually after adding a new row if you want to delete something it's you right clicking or control clicking and deleting a whole row or updating the individual cells they're in with sql structured query language we have a new programming language that is very often used in conjunction with other programming languages and so today we'll see sql used on its own initially but we'll also see it in the context of a python program so a language like python can itself use sql to do more powerful things than python alone could do so with that said sql lite is like a light version of sql it's a more user-friendly version it's more portable it can be used on macs and pcs and phones and laptops and desktops and servers but it's incredibly common in fact in your iphone and your android phone many of the applications you are running today on your own device are using sql lite underneath the hood so it isn't a toy language per se it's instead of relatively simple implementation of a language generally known as sql but long story short there's other implementations of relational databases out there and i rattled off several of them already oracle and mysql and postgres and the like those all have slightly different flavors or dialects of sql so sql is like a fairly standard language for interacting with databases but different companies different communities have kind of added or subtracted their own preferred features and so the syntax you use is generally constant across all platforms but we will standardize for our purposes on sql lite and indeed this is what you would use these days in the world of mobile applications so it's very much germaine there so with sql lite we're going to have ultimately the ability to query data and update data delete data and the like but to do so we actually need a program with which to interact with our database so the way sql lite works is that it stores all of your data still in a file but it's a binary file now that is it's a file containing zeros and ones and those zeros and ones might represent text they might represent numbers but it's a more compact efficient representation than a mere csv file would be using ascii or unicode so that's the first difference sqlite uses a single file a binary file to store all of your data and represented inside of that file by way of all of those zeros and ones are the tables to which i alluded before which are the analog in the database world of sheets or spreadsheets in the spreadsheet world so to interact with that binary file wherein all of your data is stored we need some kind of user facing program and there's many different tools to use but the most uh the the standard one that comes with sqlite is called sqlite3 essentially version 3 of the tool this is a command line tool similar in spirit to any of the commands you've run in a terminal window thus far that allows you to open up that binary file and interact with all of your tables now here again we kind of have a chicken in the egg problem if i want to use a database but i don't yet have a database and yet i want to select data from my database how do i actually load things in well you can load data into a sqlite database in at least two ways one which i'll do in a moment you can just import an existing flat file database like a csv and what you do is you save the csv on your mac or pc and your cs50 ide you run a special command with sqlite3 and it will just load the csv into memory it will figure out where all of the commas are and it will construct inside of that binary file the corresponding rows and columns using the appropriate zeros and ones to store all of that information so it just imports it for you automatically approach two would be to actually write code in a language like python or any other that actually manually inserts all of the data into your database and we'll do that as well but let's start simple let me go ahead and run for instance sqlite3 and this is pre-installed on cs50 ide and it's not that hard to get it up and running on a mac and pc as well i'm going to go ahead and run sqlite3 in my terminal window here and voila you just see some very simple output it's telling me to type period help if i want to see some usage hints but i know most of the commands and will generally give you all of the commands that you might need in fact one of the commands that we can use is dot mode and another is dot import so generally you won't use these that frequently you'll only use them when creating a database for the first time when you are creating that database from an existing csv file and indeed that's my goal at the moment let me take our csv file containing all of your favorite tv shows and load it into sqlite in a proper relational database so that we can do better than for instance big o of n when it comes to searching that data and doing anything else on it so to do this i have to execute two commands one i need to put sql lite into csv mode and that's just to distinguish it from other flat file formats like tsv for tabs or some other format and now i'm going to go ahead and run import then i have to specify the name of the file to import which is the csv and i'm going to go ahead and call my table shows so dot import takes two arguments the name of the file that you want to import and the name of the table that you want to create out of that file and again tables have rows and columns and the commas in the file are going to delineate where those columns begin and end i'm going to go ahead and hit enter it looks like it flew by pretty fast nothing seems to have happened but i think that's okay because now we're going to go ahead and have the ability to actually manipulate that data but how do we manipulate the data we need a new language sql structured query language is the language used by sql lights and oracle and mysql and postgres and bunches of other products whose names you don't need to know or remember anytime soon but sql is the language we'll use to query the database for information and do something with it generally speaking a relational database and in turn sql which is a language by which you can interact with relational databases support four fundamental operations and they're sort of a crude acronym uh pun intended that is just helpful for remembering what those fundamental operations are with relational databases crud stands for create read update and deletes and indeed the acronym is crud crud so it helps you remember that the four basic operations supported by any relational database are create read update delete create means to create or add new data read means to access and load into memory new data we've seen read before with opening files update and delete mean exactly that as well if you want to manipulate the data in your data set now those are generic terms for any relational database those are the four properties typically supported by any relational database in the world of sql there are some very specific commands or functions if you will that implement those four uh functionalities they are create and insert achieve the same thing as create more generally the keyword select is what's used to read data from a database update and delete are the same so it's kind of an annoying inconsistency the acronym or the term of art is crud create read update delete but in the world of sql the authors of the language decided to implement those four ideas by way of these five keywords or functions or commands if you will in the language sequel so what you are looking at as are five of the keywords you can use in this new language called sql to actually do something with your database now what does that mean well suppose that you wanted to manually create a database for the very first time what do you do well back in the world of spreadsheets it's pretty straightforward right you like open up google spreadsheets you go to like file new or whatever and then you just voila you get a new spreadsheet into which you can start creating rows and columns and the like microsoft excel apple number same thing file menu new spreadsheet or whatever and boom you have a new spreadsheet now in the world of sql sql databases are generally meant to be interacted with code however there are graphical user interfaces gui's by which you can interact with them as well but we're going to use code today to do so and programs at a command line it turns out that you can create tables programmatically by running a command like this so if you literally type out syntax along the lines of create table then the name of your table indicated here in lowercase then a parenthesis then the name of your column that you want to create and the type of that column a la c and then comma dot dot some more columns this is generally speaking the syntax you will use to create in this language called sql a new table now this is in the abstract again like table in lowercase is meant to represent the name you want to give to your actual table column in lowercase is meant to be the name you want to give to your own column maybe it's title maybe genres and dot dot just means of course you can have even more columns than that but literally in a moment if i were to type in this kind of command into the terminal window after running the sql light 3 program i could start creating one or more tables for myself and in fact that's what already happened for me this dot import command which is not part of sql this is like the equivalent of a menu option in excel or google spreadsheets dot import just automates a certain process for me and what it did for me is this if i type now dot schema which is another sql light specific command anything that starts with a dot is specific only to sql lite3 this terminal window program notice what's outputted is this by running.import that automatically for me created a table in my database called shows and it gave it three columns timestamp title and genres where did those column names come from come from well they came from the very first line in the csv and they all looked like text so the type of those values was just in assumed to be text text text now to be clear i could have manually typed this out created these three columns in a new table called shows for me but again the dot import command just automated that from a csv but the sql is what we see here create table shows and so forth so that is to say now in this database there is a file or rather there is a table called shows inside of which is all of the data from that csv how do i actually get at that data well it turns out there's other commands recalled not just create but also select it turns out select is the equivalent of read getting data from the database and this one's pretty powerful and the reason that so many data scientists and statisticians use and like using languages like sql they make it relatively easy to just get data and filter that data and analyze that data using new syntax for us today but relatively simple syntax relative to other things we've seen the select command in sql lets you select one or more columns from your table by the given name so we'll see this now in just a moment here how might i go about doing this well let me go ahead and now at my prompt after just clearing the window to keep things neat let me try this out let me go ahead and select let's say title from shows semicolon so why am i doing this well again the conventional format for the select command is to say select then the name of one or more columns then literally the preposition from and then the name of the table from which you want to select that data so if my table is called shows and the column is called title it stands to reason that select title from shows should give me back the data i want now notice a couple of stylistic choices that aren't strictly required but are good style conventionally i would capitalize any sql keywords including select and from in this case and then lowercase anything that's a column name or a table name assuming you created those columns and tables in in fact lowercase there's different conventions out there some people will uppercase some people use something called camel case or snake case or the like but generally speaking i would encourage all caps for sql syntax and lowercase for the column and table names i'm going to go ahead now and hit enter and voila we see rapidly a whole list of values outputted from the database and if you think way back you'll might recognize that this actually happens to be the same order as before because the csv file was loaded top to bottom into this same database table and so what we're seeing in fact is all of that same data duplicates and miscapitalizations and weird spacing and all but suppose i want to see all of the data from the csv well it turns out you can select multiple columns you can select not only title but maybe timestamp was of interest and this one admittedly was capitalized because that's what it was in the spreadsheet that was not something i chose manually so if i just use a comma separated list of column names notice what i can do now it's a little hard to see for us humans because there's a lot going on now but notice that in double quotes on the left there are all of the time stamps which represent the time at which you all submitted your favorite shows and on the right of the comma there's another quoted string that is the title of the show that you liked although sequel light omits the com the quotes if it's just a single word like friends just by convention you know in fact if i want to get all of the columns turns out there's some shorthand syntax for that star is the so-called wildcard operator and it will get me all of the columns from left to right in my table and voila now i see all of the data including all of the genres as well so now i effectively have three columns being outputted all at once here well this is not that useful thus far in fact all i've been doing is really just outputting the contents of the csv but sql is powerful because it comes with other features right out of the box somewhat similar in spirit to functions that are built into google spreadsheets and excel but now we can use them ultimately in our own code so functions like average count distinct lower max min and upper and bunches more these are all functions built into sql that you can use as part of your query to sort of alter the data as it's coming back from the database not permanently but as it's coming back to you so that it's in a format you actually care about so for instance one of my goals earlier was to get back just the distinct the unique titles and we had to write all that annoying code using a set and then add things to the set and then loop over it again right like that was not a huge amount of code but it definitely took us what five ten minutes to get the job done at least in sql you can do all of that in one breath i'm gonna go ahead now and do this select not just title from shows let me go ahead and select distinct title from shows so distinct again is an available function in sql that does what the name says it's going to filter out all of the titles to just give me the distinct ones back so if i hit enter now you'll see a similarly messy list but including no idea someone that doesn't watch tv including an unsorted list of those titles so i think we can probably start to clean this thing up as we did before let me go ahead and now select not just distinct but let me go ahead and uppercase everything as well and i can use upper as another function and notice i'm just nesting things like the output of one function as we've seen in many languages now can be the input to another let me hit enter now and now it's getting a little more canonicalized so to speak because i'm using capitalization for everything but it would seem that things still aren't really sorted it's just the same order in which you inputted them but without duplicates this time so it turns out that sql has other syntax that we can use to make our queries more precise and more powerful so in addition to these kinds of functions that you can use to alter the data that's being shown to you and coming back you can also use these kinds of clauses or syntax in sql queries you can say where which is the equivalent of a condition you can say give select all of this data where something is true or false you can say like where you can say give me data that doesn't isn't exactly this but is like this you can order the data by some column you can limit the number of rows that come back and you can group identical values together in some way so let's see a few examples of this let me go back here and play around now with uh how about the office that was the one we looked at earlier so let me go ahead and select title from shows where title equals uh the office quote unquote semicolon so i've added this where predicate so to speak where uh title equals quote unquote the office so sql is nice similar in spirit to python it's more user-friendly perhaps than c where everything kinda sort of reads like an english sentence even though it's a little more precise and it's a little more succinct let me go ahead and hit enter and voila that's how many of you inputted the office but notice it's not everyone is it we're missing some still it seems that i got back only those of you who typed in literally the office capital t capital o so what if i want to be a little more resilient than that well let me get back any rose where you all typed in office maybe you omitted the um the article the so let me go ahead and say not title equals office but let me go ahead and say where the title is like office but i don't want it to just be office i want to allow for maybe some stuff at the beginning maybe some stuff at the end and even though this seems like a bit of an inconsistency in the context of using like there's another wild card character the percent sign represents zero or more characters to the left and this percent sign represents zero or more characters to the right so it's kind of this catch-all that will now find me all titles that somewhere have o f f o-f-f-i-c-e inside of them and it turns out like is case insensitive so i don't even need to worry about capitalization with like now let me hit enter and voila now i get back more answers and you can really see the messiness now notice up here one of you used lowercase you know that tends to be common when typing things in quickly one of you did it lowercase here and then also gave us an extra white space at the end one of you just typed in office one of you typed in the office again with the space at the end and so there's a lot of variation here and that's why when we forced everything to uppercase and we started trimming things we were able to get rid of a lot of those redundancies well in fact let's go ahead and and order this now so let me go back to selecting the distinct uppercase title so select distinct upper of title from shows and let me now order by which is a new clause the uppercased version of title so now notice there's a few things going on here but i'm just building up more complicated queries similar learn to scratch where we just started throwing more and more puzzle pieces out of problem i'm selecting all of the distinct uppercase titles from the shows table but i'm going to order the results this time by the uppercase version of title so everything's going to be uppercase and then it's going to be sorted a through z hit enter now and now things are a little easier to make sense of notice the quotes are there only when there are multiple words in a title otherwise sequel light 3 doesn't bother showing us but notice here's all the the shows and if we keep scrolling up the p's the ends the m's the l's and so forth it's indeed alphabetized thanks to using order by all right well let's start to solve more similar problems now in sql by writing way less code than we did a bit ago in python suppose i want to actually figure out the counts of these most popular shows so i want to combine all of the identical shows and figure out all of the corresponding counts well let me go ahead and try this let me go ahead and select again um the uppercase version of title but i'm not going to do distinct this time because i want to do that a little differently i'm going to select the uppercase version of title the count of those titles so the number of times a given title appears so count as a new keyword now from shows but now how do i figure out what the count is well if you think about this table as having a lot of titles title title title title title it would be nice to kind of group the identical titles together and then actually count how many such titles we group together and the syntax for that is literally to say group by upper title this tells sql to group all of the uppercase titles together kind of collapse multiple rows into one but keep track of the count of titles after that collapse let me go ahead now and hit enter and you'll see very similar to one of the earlier python programs we wrote all of the titles on the left followed by a comma followed by the count so one of you really likes tom and jerry one of you really likes top top gear if i scroll up though two of you really liked the wire 23 of you here like the office although we still haven't trimmed the issue here so we could still combine that further by trimming white space if we want but now we're getting these kinds of counts well how can i go ahead and order this as we did before let me go ahead here and add order by count of title and then hit semicolon now and now notice just as in python everything is from smallest to largest initially with game of thrones here down on the bottom how can i fix this well it turns out if you can order things in descending order d-e-s-c for short instead of asc which is the default for ascending so if i do it in descending order now i'd have to scroll all the way back up to the a's the very top to see where the lines begin whoops if i scroll all the way back up to the top we'll see where all of the a words begin up here and now if i want to whoops whoops did i do that right sorry i don't want to uh there we go order by count descending now let me go ahead and this is just a little too unwieldy to see let me just limit myself to the top 10 and keep it simple and only look at the top 10 values here voila now i have game of thrones at 33 friends at 26 the office at 23 though i think i'm still missing a few brian do you recall the sequel function for trimming leading and trailing white space i think it's just trim trim okay i myself did not remember so when in doubt google or s brian so let me go ahead and fix this let me go ahead and select uppercase of trimming the title first and then i'm going to group by uh trimming and then uppercasing it there and now enter and voila thank you brian so now we're up to our 26 offices here so in short it took us a little while to get to this point in the story in sql but notice what we've done we've taken a program that took us a few minutes and certainly a dozen or more lines of code and we've distilled it into something that yes is a new language but it's just kind of a one-liner and once you get comfortable with a language like sql especially if you're not even a computer scientist but maybe a data scientist or an analyst of some sort who spends a lot of their day looking at financial information or medical information or really any data set that can be loaded into rows and columns once you start to speak and read sql as a human can you start to express some pretty powerful queries relatively succinctly and boom get back your answer and by using a command line program like sql lite3 you can immediately see the results there albeit it's very simplistic text but as mentioned too there's also some graphical programs out there free and commercial that also supports sql where you can still type these commands and then it will show it to you in a more user-friendly way much like in windows or mac os would by default so any questions now on the syntax or capabilities of select statements one question came in where is the file with this data actually being stored where is the good question where is the file actually being stored so before quitting i can actually save this file as anything i want the file extension would typically be db and in fact brian do you mind just checking what's the syntax for writing the file manually with dot something it would be under dot help i think i think it's dot save if followed by the name of the file dot save so i'll call this shows.db enter if i now will go ahead and open up another terminal window and type our old friend ls you'll see that now i have a csv file i have my python file from before and i have a new file called shows.db which i've created that is the binary file that contains the tables that i the table that i've loaded dynamically in from that csv file any other questions on select queries or what we can do with them yeah a few people are asking about what the runtime of this is yeah really good question what is the runtime i'm going to come back to that question just a little bit if that's okay right now it's admittedly big o of n i've not actually done anything better than we did with our csv file or our python code right now it's still big o of n by default but there's going to be a better answer to that that's going to make it something much more logarithmic so let me come back to that feature when it's time to enable it but in fact let's start to take some steps toward that because it turns out when loading in data we're not always going to have the luxury of just having one big file in csv format that we import and we go about our business we're gonna have to decide in advance how we wanna store the data and what data we wanna store and what the relationships might be across not one single table but multiple tables so let me go ahead and run one other command here that actually introduces the first of a problem let me go ahead and select title from shows where genres equals for instance comedy that was one of the genres and notice that we get back a whole bunch of results but i get i bet i'm missing some i'm skimming through this pretty quickly but i bet i'm missing some because if i check if genre is equals comedy what am i omitting well those of you who checked multiple boxes might have said something is a comedy and a drama or comedy and romance or maybe a couple of other permutations of genres if i'm searching for equality here equals comedy i'm only going to get those favorites from you where you only said my favorite tv show is a a comedy but what about something like uh what if comedy what if we want to do something like like comedy instead and we could say something like well so long as the word comedy is in there then we should get back even more results and let me stipulate that indeed i now have a longer list of results now we have all shows where you checked at least the comedy box but unfortunately this starts to get a little sloppy because recall what the genres column looks like select let me select genres from shows semicolon notice that all of the genres that we loaded into this table from the csv file are a comma separated list of genres that's just the way google forms did it and that's fine for csv purposes that's kind of fine for sql purposes but this is kind of messy like generally speaking storing comma separated lists of values in a sql database is not what you should be doing the whole point of using a sql database is to move away from commas and csvs and to actually store things more cleanly because in fact let me propose a problem notice that suppose i want to search not for comedy but maybe also music like this thereby allowing me to find any shows where the word music is somewhere in the comma separated list there's a subtle bug here and you might have to think back to where we began the form that i select the form that you pulled up i can't show the whole thing here but we started with action adventure animation biography dot dot music musical was also there so distinct a music video versus a musical are two different types of genres but notice my query at the moment what's problematic with this at the moment we would seem to have a bug whereby this query will select not only music but also musical and so this is just where things are getting messy now yeah you know what we could kind of clean this up maybe we could put a comma here so that it can't just be music something it has to be music comma but what if music is the last box that you checked well then it's music nothing there is no comma so now i need to like or things together so maybe i have to do something like we're music like this or or genres like quote-unquote music like this but honestly this is just getting messy like this is poorly designed if you're just storing your data as a comma separated list of values inside of a column and you have to resort to this kind of hack to figure out well maybe it's over here or here or here and thinking about all the permutations of syntax you're doing it wrong you're not using a sql database to its fullest potential so how do we go about designing this thing better and actually load this csv into a database a little more cleanly in short how do we get rid of the stupid commas in the genres column and instead put one word comedy or music or musical in each of those cells so to speak not two not three one only without throwing away some of those genres well let me introduce a few building blocks that'll get us there it turns out when creating your own tables and loading data into a database on your own we're going to need more than just select select of course is just for reading but if we're going to do this better and not just use sqlite 3's built in dot import command but instead we're going to write some code to load all of our data into maybe two tables one for the titles one for the genres we're going to need a little more expressiveness when it comes to sql and so for that we're going to need one the ability to create our own tables and we've seen a glimpse of this before but we're also going to need to see another piece of syntax as well so inserting inserting is another command that you can execute on a sql database in order to actually add data to a database which is great because if i want to ultimately iterate over that same csv but this time manually add all of the rows to the database myself well then i'm going to need some way of inserting and the syntax for that is as follows insert into the name of the table the column or columns that you want to insert values into then literally the word values and then literally in parentheses again the actual list of values so it's a little abstract when we see it in this generic form but we'll see this more explicitly in just a moment here as well so when it comes to inserting something into a database let's go ahead and try this so suppose that um let's see what's what's a show that uh the muppet show like i grew up loving the muppet show it was out in like the 70s and i don't think it was on the list but i can check this for sure so select star from shows where uh title like let's just search for muppets with a wild card and i'm guessing no one put it there good so it's a missed opportunity i forgot to fill out the form i could go back and fill out the form and re-import the csv but let's go ahead and do this manually so let me go ahead and insert into shows what columns title and genres and i guess i could do a time stamp just for kicks and then i'm going to insert what values the values will be well i don't know whatever time it is now so i'm going to cheat there just rather than look up the date and the time the title will be like the muppet show and the genres will be it was kind of a comedy it was kind of a musical so we'll kind of leave it at that semicolon so again this follows the standard syntax here of specifying the table you want to insert into the columns you want to insert into and the values you want to put into those columns and i'm going to go ahead and hit enter now nothing seems to have happened but if i now select that same query oh okay uh it's still nothing because i made a subtle mistake uh not i'm not searching for muppets plural i'm searching for muppet singular the muppet show voila now you see my row in this database and so insert would give us the ability now to insert new rows into the database suppose you want to update uh something maybe you know some of the muppet shows were actually pretty dramatic so how might we do that well i can say update shows set let's see genre is equal to comedy uh drama musical where title equals the muppet show so again i'll pull up the canonical syntax for this in a bit but for now just a little teaser you can update things pretty simply and even though it takes a little getting used to the syntax it kind of does what it says update shows set genres equal to this where title equals that and now i can go ahead and enter if i go ahead and select the same thing just like in a terminal window you can go up and down that's how i'm typing so quickly i'm just going up and down to previous commands voila now i see that the muppet show is a comedy a drama drama and a musical well i i take issue though with one of the more popular shows that was in the list a whole bunch of you liked um let's say friends which i've never really been a fan of and let me go ahead and select a title from shows where title equals friends and maybe i should be a little more rigorous than that i should say title like friends just in case there was different capitalizations enter a lot of you really liked friends in fact how many of you well recall that i can do this i can say count and i can let sequel do the count for me 26 of you i disagree with strongly and there's a couple of you that even added all the dots but we'll deal with you later so suppose i do take issue with this well delete from shows where title equals quote-unquote friends uh actually title like friends let's get them all enter and now if we select this again i'm sorry friends has been cancelled so you can again update the you can execute these fundamental commands of crud create read update and delete by using create or insert by using select by using update literally and delete literally as well and that's about it like even though this was a lot quickly there really are just those four fundamental operations in sql plus some of these add-on features like these additional functions like count that you can use and also some of these keywords like where and the like well let me propose that we now do better if we have the ability to select data and create tables and insert data let's go ahead and write our own python script that uses sql as in a loop to read over my csv file and to insert insert insert insert each of the rows manually because honestly it will take me forever to like manually type out like hundreds of sql queries to import all of your rows into a new database i want to write a program that does this instead and i'm going to propose that we design it in the following way i'm going to have two tables this time represented here with this artist's rendition one is going to be called shows one is going to be called genres and this is a fundamental principle of designing relational databases to figure out the relationships among data and to normalize your data to normalize your data means to eliminate redundancies to normalize your data means to eliminate mentions of the same words again and again and have just single sources of truth for your data so to speak so what do i mean by that i'm going to propose that we instead create a simpler table called shows that has just two columns one is going to be called id which is new the other is going to be called title as before honestly i don't care about timestamps so we're just going to throw that value away which is another upside of writing our own program we can add or remove any data we want for id i'm introducing this which is going to be a unique identifier literally a simple integer one two three all the way up to a billion or two billion however many favorites we have i'm just going to let this auto increment as we go why i propose that we move to another table all of the genres and that instead of having one or two or three or five genres in one column as a stupid comma separated list which is stupid only in the sense that it's just messy right it means that i have to run stupid commands where i'm checking for the comma here the comma there it's very hackish so to speak bad design instead of doing that i'm going to create another table that also has two columns one is going to be called show id and the other is going to be called genre and genre here is just going to be a single word now that column will contain single words for genres like comedy or music or musical but we're going to associate all of those genres with the original show to which they belong per your google form submissions by using this show id here so what does this mean in particular by adding to our first table shows this unique identifier one two three four five six i can now refer to that same show in a very efficient way using a very simple number instead of redundantly having the office the office the office again and again i can refer to it by just one canonical number which is only going to be like four bytes or 32 bits pretty efficient but i can still associate that show with one genre or two or three or more or even none so in this way every row in our current table is going to become one or more rows in our new pair of tables we're factoring out the genres so that we can add multiple rows for every show potentially but still remap those genres back to the original show itself so what is some of the the buzzwords here what's some of the the language to be familiar with well we need to know what kinds of types are at our disposal here so for that let me propose this let me propose that we have this list here it turns out in sql lite there are five main data types and that's a bit of an oversimplification but there's five main data types some of which look familiar a couple of which are a little weird um integer is a thing uh real is the same thing as float so an integer might be a 32 bit or four byte value like one two three or four positive or negative a real number is going to have a decimal point in it a floating point value probably 32 bits by default but those kinds of things the sizes of these types vary by system just like they technically did in c so do they vary by system in the world of sql but generally speaking these are good rules of thumb text is just that it's sort of the equivalent of a string of some length but then in sql lite it turns out there's two other data types we've not seen before numeric and blob but more on those in just a little bit blob is binary large object it means you can store zeros and ones in your database numeric is going to be something that's number like but isn't a number per se it's like a year or a time something that has numbers but isn't just a simple integer at that and then we propose two that sql light is going to allow us to specify two when we create our own columns manually by executing the sql code ourselves we can specify that a column cannot be null thus far we've ignored this but some of you might have taken the fifth and just not given us the title of a show or a genre your answers might be blank uh some of you may be in registering for a website don't want to provide information like where you live or your phone number so a database in general sometimes does want to support null values but you might want to say that it can't be null a website probably needs your email address needs your uh password and a few other fields but not everything and there's another keyword in sql just so you've seen it called unique where you can additionally say that whatever values are in this column must be unique so a data a website might also use that if you want to make sure that the same email address can't register for your website multiple times you just specify that the email column is unique that way you can't put multiple people in with identical email addresses so long story short this is just more of the tools in our sql toolkit because we'll see some of these now indirectly and the last piece of jargon we need before designing our own tables is going to be this it turns out that in sql there's this notion of primary keys and foreign keys and we've not seen this in spreadsheets odds are unless you've been working in the real world for some years and you have fairly fancy spreadsheets in front of you as an analyst or financial person and the like odds are you've not seen keys or unique identifiers in quite the same way but they're relatively simple in fact let me go back to our picture before and propose that when you have two tables like this and you want to use a simple integer to uniquely identify all of the rows in one of the tables that's called technically an id that's what i'll call it by convention you could call it anything you want but id just means it's a unique identifier but semantically this id is what's called a primary key a primary key is the column in a table that uniquely identifies every row this means you can have multiple versions of the office in that title field but each of those rows is going to have its own number uniquely potentially so primary key uniquely identifies each row in another table like genres which i'm proposing we create in just a moment it turns out that you're welcome to refer back to another table by way of that unique identifier but when it's in this context that id is called a foreign key so even though i've called it show id here that's just a convention in a lot of sql databases to imply that this is technically a column called id in a table called show or shows plural in this case so if there's a number one here and suppose that the office has a unique id of one we would have a row in this table called id is one title is the office the office might be in the comedy category the drama category and the romance category so multiple ones therefore in the genres table we want to output three rows the number one one one in each of those rows but the words comedy drama romance in each of those rows respectively so again the goal here is to just design our database better not have these stupid comma separated list of values inside of a single column we want to kind of blow that up explode it into individual rows you might think well why don't we just use multiple columns but again per our principle from spreadsheets you should not be in the habit of adding more and more columns when the data is all the same like genre genre genre right the sort of stupid way to do this in the spreadsheet world would be to have one column called genre1 another column called genre2 another column called genre 3 genre 4. and you can imagine just how stupid and inefficient this is a lot of those columns are going to be empty for shows with very few genres and it's just kind of messy at that point so better in the world of relational databases to have something like a second table where you have multiple rows that somehow link back to that primary key by way of what we're calling conceptually a foreign key all right so let's go ahead now and try to write this code let me go back to my ide let me quit out of sql lite now and let me just move away i'm going to move this away my file for just a moment so that we're only left with our original data let's go about implementing a final version of my python file that does this creates two tables one called shows one called genres and then two in a for loop iterates over that csv and inserts some data into the shows and other data into the genres how can we do this programmatically well there's a final piece of the puzzle that we need we need some way of bridging the world of python and sql and here we do need a library because it would just be way too painful to do without a library it can be cs50 cs50s as we'll see makes this very simple there are other third party commercial and open source libraries that you can also use in the real world as well that do the same thing but the syntax is a little less friendly so we'll start by using the cs50 library which in python recall has functions like get string and get int and get float but today it also has support it turns out for sql capabilities as well so i'm going to go back to my favorites file and i'm going to import not only csv but i'm also going to import from the cs50 library a feature called sql so we have a a variable if you will inside of the cs50 library or rather a function inside of the cs50 library called sql that if i call it will allow me to load a sql lite database into memory so how do i do this well let me go ahead and add a couple of new lines of code let me go ahead and open up a file called shows.db but this time in write mode and then just for kick just for now rather i'm going to go ahead and close it right away this is a pythonic way of creating an empty file it's kind of stupid looking but by opening a file called shows.db in write mode and then immediately closing it it has the effect of creating the file closing the file so i now have an empty file with which to interact i could also do this as an aside by doing this touch shows.db touch kind of a strange command but in a terminal window it means to create a file if it doesn't exist so we could also do that instead but that would be on uh that would be independent of python so once i've created this file let me go ahead and open the file now as a sqlite database i'm going to declare a variable called db for database i'm going to use the sql function from cs50s library and i'm going to open via somewhat cryptic string this sqlite colon slash slash shows.db now it looks like a url http colon slash but it's sql lite instead and there's three uh slashes instead of the usual two but this line of code line six has the result of opening now that otherwise empty file with nothing in it yet as being a sqlite database using cs50s library why did i do that well i did that because i now want to create my first table let me go ahead and execute db.execute so there's a function called execute inside of the cs50 sql library and i'm going to go ahead and run this create table called shows uh the type of which the columns of which are an id which is going to be an integer a title which is going to be text the primary key in which is going to be the id column so this is a bit cryptic but let's see what's happening i seem to now in line 8 be combining python with sql and this is where now like programming gets really powerful fancy cool difficult whatever however you want to perceive it i can actually use one language inside of another how well sql is just a bunch of textual commands up until now i've been typing them out manually in this program called sqlite 3. there's nothing stopping me though from storing those same commands in python strings and then passing them to a database using code the code i'm using is a function called execute and its purpose in life and cs50 staff wrote this is to pass the argument from your python code into the database for execution so it's like the programmatic way of just typing things manually at the sql lite prompt a few minutes ago so that's going to go ahead and create my table called shows in which i'm going to store all of those unique ids and also the titles and then let me do this again db.execute create table genres and that's going to have a column called show id which is an integer also genre which is text and lastly it's going to have a foreign key it's going to wrap onto two it's going to wrap a little long here on show id which references the shows table id all right so this is a lot so let's just recap left to right db execute is my python function that executes any sql i want create table genres create a table called genres the columns in that table will be something called show id which is an integer and genre which is a text field but it's going to be one genre at a time not multiple and then here i'm specifying a foreign key will be the show id column which happens to refer back to the shows tables ids column it's a little cryptic but all this is doing is implementing for us the equivalent of this picture here i could have manually typed both of these sql commands at that blinking prompt but again no i want to write a program now in python that creates the tables for me and now more interestingly loads the data into that that database so let's go ahead and do this now i'm not going to select a title from the user because i want to import everything i'm not going to use any counting or anything like that so let's go ahead and just go inside of my loop as before and this time let's go ahead and for row in reader let's go ahead and get the current title as we've always done but let's also as always go ahead and strip it of white space and capitalize it just to canonicalize it and now i'm going to go ahead and execute db execute quote unquote insert into shows the title column the value of quote-unquote title so i want to put the title here it turns out that sql libraries like ours supports one a final piece of syntax which is a placeholder in c we used percent s in python we just used curly braces and put the word right there in sql we have a third approach to the same problem just syntactically different but conceptually the same you put a question mark where you want to put a placeholder and then outside of this string i'm going to actually type in the value that i want to plug into that question mark so this is so similar to printf in week one but instead of percent s it's a question mark now and then a comma separated list of the arguments you want to plug in for those placeholders so now this line of code 16 has just inserted all of those values into my database and let's go ahead and run this before i go any further let me go ahead and do this i'm going to go ahead now and run python a favorite stop pi and cross my fingers as always it's taking a moment taking a moment that's because there's a decent sized file there or i screwed up take this is taking too long uh oh okay i should have just been more patient all right so it just seems my connection's a little slow so uh as i expected everything is 100 correct and it's working fine so now let's go ahead and see what i actually did if i type ls notice that i have a file called shows.db this is brand new because my python program created it this time let's go ahead and run sqlite3 of shows.db just so i can now see what's inside of it notice that i can do dot schema just to see what tables exist and indeed the two tables that i created in my python code seem to exist but notice that there's if i do select star from shows let's see all the data voila there is a table that's been programmatically created and it has noticed this time no time stamps no genres but it has an id on the left and the title on the right and amazingly all of the ids are monotonically increasing from 1 on up to 513 in this case why is that well one of the features you get in a sql database is if you define a column as being a primary key in sql light it's going to be auto-incremented for you recall that nowhere in my code did i even have a line an integer inputting 1 then 2 then 3. i could absolutely do that i could have done something like this counter uh rather i could have done something like this counter equals one and then down here i could have said uh id comma title give myself two placeholders and then pass in the counter each time i could have implemented this myself and then on each iteration done counter plus equals one but with sql databases as we've seen you get a lot more functionality built in i don't have to do any of that because if i've declared that id as being a primary key sqlite is going to insert it for me and increment it also for me as well alright so if i go back to sql light though notice that i do have ids and titles but if i select star from genres there's of course nothing there yet so how now do i get all of the genres for each of these shows in i need to finish my script so inside of this same loop i have not only the title in my current row but i also have genres in the current row but the genres are separated by commas recall that in the csv next to every title there's a comma separated list of titles so how do genres how do i get at each genre individually well i'd like to be able to say for genre in row bracket genres but this is not going to work because that's not going to be split up based on those commas that's literally just going to iterate over in fact all of the characters in that string as we saw last week but it turns out that strings in python have a fancy split function whereby i can split on a comma followed by a space and what this function will do for me in python is take a comma separated list of genres and explode it so to speak split it on every comma space into a python list containing genre after genre in an actual python list allah square brackets so now i can iterate over that list of individual genres and inside of here i can do db execute insert into genres show id genre the values question mark question mark but huh there's a problem i can definitely plug in the current genre which is this but i need to put something here still for that first question mark i need a value for the show id how do i know what the id is of the current tv show well it turns out the library can help you with this when you insert new rows into a table that has a primary key it turns out that most libraries will return you that value in some way and if i go back to line 15 and i actually store the return value of db execute after using insert my the library will tell me what was the integer that was just used for this given show maybe it's one two three i don't have to know or care as the programmer but the return value i can store in a variable and then down here i can literally put that same id so that now if i am inputting the office whose id is one into the shows table and its genres are comedy drama romance i can now inside of this for loop this nested for loop insert one followed by comedy one followed by drama one followed by romance three rows all at once and so now let's go back down here into my terminal window let me remove the old shows.db with rm just to start fresh let me go ahead and rerun python of favorites.pi i'll be more patient this time because cloud's being a little slow so it's doing some thinking and in fact there's more work being done now at this point in the story my program is presumably iterating over all of the rows in the csv and it's inserting into the shows table one at a time and then it's inserting one or more genres into the genres table it's a little slow if we're on a faster system or if i were doing it on my own mac or pc it would probably go down more quickly but you can see here an example of why i used the dot import command in the first place that automated some of this process but unfortunately it didn't allow me to change the format of my data but the key point to make here is that even though this is taking a little bit of time to insert these hundreds of rows all at once i'm only going to have to do this once and was asked a bit ago was the performance of this it turns out that now that we have full control over the sql database it turns out we're going to have the ability to to actually improve the performance thereof oh okay as expected it finished right on time and let me go ahead now and run sequel light 3 on shows.db alright so now i'm back in my raw sql environment if i do select star from shows which i did before we'll see all of this as before if i select star from shows where title equals quote unquote the office i'll see the actual unique ids of all of those we didn't bother eliminating duplicates we just kept everything as is but we gave everything a unique id but if i now do select star from genres will see all of the values there and notice the key detail there is only one genre per row here and so we can ultimately line those up with our titles and our titles here we had all of these here something's wrong i want to get this right let's go ahead and take our second and final five minute break here and we'll come back and i will explain what's going on all right we are back and just before we broke up my own self-doubt was starting to creep in but i'm happy to say with no fancy magic behind the scenes everything was actually working fine i was just doubting the correctness of this if i do select star from shows i indeed get back two columns one with the unique id the so-called primary key followed by the title of each of those shows and if i similarly search for star from genres i get single genres at a time but on the left hand side are not primary keys per se but now those same numbers here in this context called foreign keys that map one to the other so for instance whatever show 512 is has five different genres associated with it and in fact if i don't go back a moment to shows it looks like game of thrones was decided by one of you as belonging in thriller history adventure action and war as well those five so now this is what's meant by relational database you have this relation or relationship across multiple tables that link some data in one to some other data in the like the catch though is that it would seem a little harder now to answer questions because now i have to kind of query two tables or execute two separate queries and then combine the data but that's not actually the case suppose that i want to answer the question of what are all of the musicals among your favorite tv shows i can't select just the shows because there's no genres in there anymore but i also can't select just the genres table because there's no titles in there but there is a value that's bridging one in the other that foreign key to primary key relationship so you know what i can do off the top of my head i'm pretty sure i can select all of the show ids from the genres table where a specific genre equals quote unquote musical and i don't have to worry about commas or spaces now because again in this new version that i have designed programmatically with code musical and every other genre is just a single word if i hit enter all of these show ids were decided by you all as belonging to musicals but now this is not interesting and i certainly don't want to execute 10 or so queries manually to look up every one of those ids but notice what we can do in sql as well i can nest queries let me put this whole query in parentheses for just a moment and then prepend to it the following select title from shows where the primary key id is in this sub query so you can have nested queries similar in spirit a bit like in python and see when you have nested for loops in this case just like in grade school math whatever's in the parentheses will be executed first then the outer query will be executed using the results of that inner query so if i select the title from shows where the id is in that list of ids voila it seems that somewhat amusingly several of you think that breaking bad supernatural glee sherlock how i met your mother hawaii five oh twin peaks the lawyer and my brother my brother and me are all musicals i take exception to a few of those but so be it you check the box for musical for those shows so even though we've sort of done things we've designed things better in the sense that we've normalized our database by factoring out commonalities or rather we've cleaned up the data there's still admittedly some redundancy there's still admittedly some redundancy but i at least now have the data in clean fashion so that every column has just a single value in it and not some contrived comma separated list suppose i want to find out all of the genres that you all thought the office was in so let's ask kind of the opposite question well how might i do that well to figure out the office i'm going to first need to select the id from shows where title equals quote unquote the office because a whole bunch of you typed in the office and we gave each of your answers a unique identifier so we could keep track of it and there's all of those numbers now this is like dozens of responses i certainly don't want to execute that many queries but i think a sub query will help us out again let me put parentheses around this whole thing and now let me say select distinct genre from genres where the show id in the genres table is in that query and just for kicks let me go ahead and order by uh genre so let me go ahead and execute this and okay somewhat amusingly those of you who inputted the office checked boxes for animation comedy documentary drama family horror reality tv romance and sci-fi i take exception to a few of those too but this is what happens when you accept user input so here again we have with this sql language the ability to express fairly succinctly even though it's a lot of new features today all at once what would otherwise take me a dozen or two lines in python code to implement and god knows how many lines of code and how many hours it would take me to implement something like this in c now admittedly we could do better than this design this table or this picture represents what we have now but you'll notice a lot of redundancy implicit in the genres table anytime you check the comedy box i have a row now that says comedy comedy comedy comedy and the show id differs but i have the word comedy again and again and now that tends to be frowned upon in the world of relational databases because if you have a genre called called comedy or one called musical or anything else you should ideally just have that living in one place and so if we really wanted to be particular and really truly normalize this database which is an academic term referring to removing all such redundancies we could actually do it like this we could have a shows table still with an id and title no difference there but we could have a genres table with two columns id and name now this is its own id it has no connection with the show id it's just its own unique identifier a primary key here now and the name of that genre so you would have one row in this genres table for comedy for drama music musical and everything else and then you would use a third table which is colloquially called a join table which i'll draw here in the middle and you can call it anything you want but we've drawn called it shows underscore genres to make clear that this table implements a relationship between those two tables and notice that in this table is really no juicy data it's just foreign keys show id genre id and by having this third table we can now make sure that the word comedy only appears in one row anywhere the word musical only appears in one row anywhere but we use these more efficient integers called show id and genre id which respectively point to those primary keys in their primary tables to link those two together and this is an example of what's called in the world of databases a many-to-many relationship one show can have many genres one genre can belong to many shows and so by having this third table you can have that many to many relationship and again the third table now allows us to truly normalize our data set by getting rid of all of the duplicate comedy comedy comedy why is this important probably not a huge deal for genres but imagine with my current design if i want if i made a spelling mistake and i misnamed comedy i would now have to change every row with the word comedy again and again or if maybe you change the cat the genres of the shows you have to change it in multiple places but with this other approach with three tables you can argue that now you only have to change the name of a genre in one place not all over the place and that in general in c and now in python and now sql has generally been a good thing not to copy-paste identical values all over the place all right so with that said what other tools do we have at our disposal well it turns out that there are other data types out there in the real world using sql besides just these five blob integer numeric real and text blob again is for binary stuff generally not used except for more specialized applications let's say integer which is an int typically 32 bits numeric which is something like a date or a year or a time or something like that real numbers which are floating point values and text which are things like strings but if you graduate ultimately from sql lite on phones and on macs and pcs to actual servers that run oracle mysql and postgres if you're actually running your own internet style business well it turns out that more sophisticated even more powerful databases come with other subtypes if you will so besides integer you can specify small int for small numbers maybe using just a few bits instead of 32 integer itself or bigint which uses 64 bits instead of 32. the facebooks the twitters of the world need to use big ants a lot because they have so much data you and i can get away with simple integers because we're not going to have more than 4 billion favorite tv shows in a class certainly something like real you can have 32-bit real numbers or a little weirdly named double precision which is like a double was in c using 64 bits instead for more precision numeric is kind of this catch-all you can have not only dates and date times but things like boolean values you can specify the total number of digits to store using this numeric scale and precision so it relates to numbers that aren't just quite integers and then you also have categories of text char followed by a number which specifies that every value in the column will have the same number of characters that's helpful for things where you know the length in advance like in the us all states all 50 states have two digit code or two character codes like m a for massachusetts ca for california char 2 would be appropriate there because you know every value in the column is going to have two characters when you don't know though you can use varchar and varchar specifies a maximum number of characters and so you might specify varchar of like 32 no one might be able to type in a name that's longer than 32 characters or varchar 200 if you want to allow for something even bigger but this is germane to our real world experience with the web if you've ever gone to a website start filling out a form and all of a sudden you can't type any more characters your response is too long why is that well one the programmers just might not want you to keep expressing yourself in more detail especially if it's like a complaint form on a customer service site but pragmatically it's probably because their database was designed to store a finite number of characters and you have hit that threshold and you certainly don't want to have a buffer overflow like in c so the database won't force a maximum value n and then text is for even bigger chunks of text if you're letting people copy paste their resumes or whole documents or even larger sets of text you might use text instead so let's then consider a real-world data set things get really interesting in all of these very academic ideas and recommendations really come into play when we don't have hundreds of favorites but when we have uh thousands instead and so what i'm going to go ahead and do here is download a file here which is a sql sql-lite version of the imdb internet movie database that some of you might have used in website form in order to look up movies and ratings thereof and the like and what we've done in advance is we wrote a script i wrote a script that downloaded all of that information in advance as tsv files it turns out that they internet movie database make it available all of their data available as tsv files tab separated values and we went ahead and imported it with a script called shows.db as follows so i'm going to go ahead in just a moment and open up shows.db which is not the version i created earlier based on your favorites this is now the version that we the staff created in advance by downloading hundreds of thousands of movies and tv shows and actors and directors from imdb.com under their license and then imported into a sqlite database so how can i see what's in here well let me go ahead and type schema recall and you'll see a whole bunch of data therein and in fact in pictorial form it actually looks like this here's a picture that just gives you the lay of the land there's going to be a people table that has an id for every person a name and their birth year there's going to be a shows table just like we've been talking which is ids titles of shows also though the year that the show debuted and the number of episodes that the show had then there's going to be genres similar in design to before so we didn't go all out and factor it out into a third table we just have some duplication here admittedly in genres but then there's a ratings table and here's where you can see where relational databases get interesting you can have a ratings table storing ratings like one to five but also associate those ratings with a show by way of its show id and then you can keep track of the number of votes that that show got writers notice is a separate table and notice this is kind of cool this table per the arrows relates to the shows table and the people table because this is a join table a foreign key of show id and a foreign key of person id refer to the shows table and the people table respectively so that a human uh person can be a writer for multiple shows and one show can have multiple writers another many to many relationship and then lastly stars the actors in a show notice that this too is a join table it's only got two foreign keys a show id and a person id that are referring back to those tables respectively and here's where it really makes sense a relational database it would be pretty stupid and bad design if you had names of all of the directors and names of all of the writers and names of all of the stars of these shows in separate tables in duplicate like steve carell steve carell steve carell all of those actors and directors and writers and every other role in the business are just people at the end of the day so in a relational database the advice would be to put all of those people in a people table and then use primary and foreign keys to refer to to relate them to these other types of tables the catch is though that when we do this it turns out that uh things can be slow when we have lots of data so for instance let me go into this let me go ahead and select star from shows semicolon that's a lot of data it's pretty fast on my mac and i switch from the ide to my mac just to save time because it's a little faster doing things locally instead of in the cloud let me go ahead and count the number of shows in this imdb database by using count thousand 153 hundred thirty one tv shows so that's a lot how about the count of people uh from the people table uh four hundred fifty seven thousand eight hundred eighty six people who might be stars or uh writers or some other role as well so this is a sizable data set so let me go ahead and do something simple though let me go ahead and select star from shows where title equals the office and this time i don't have to worry about weird capitalization or spacing this is imdb this is clean data from an authoritative source notice that there's actually different versions of the office you probably know the uk one and the us one there's other shows that are unrelated to that particular type of show but uh each of them is distinguished notice by the year here all right so that's kind of a lot um and let's do this again let me go ahead and turn on a feature temporarily just to time this query by turning on a timer in this program and let me run it again it looks like it took .0122 seconds of real time to do that search that's pretty fast i barely noticed certainly because it's so fast but let me go ahead and do this let me go ahead and create an index called title index on the table called shows on its title column well what am i doing well to answer the question finally from before about performance by default everything we've been doing is indeed big o of n it's just being linearly searched from top to bottom which seems to call into question the whole purpose of sql if we were doing no better than with csvs but an index is a clue to the database to sort of load the data more efficiently in such a way that you get logarithmic time an index is a fancy data structure that the sql lite database or the oracle database or the mysql database whatever product you're using builds up for you in memory and then it does something using syntax like this that builds in memory generally something known as a b tree we've talked a bit about trees in the class we talked about binary search trees things that kind of look like family trees a bee tree is essentially a family tree that's just very wide and not that tall it's a data structure similar in spirit to what we looked at in sea but it tries to keep all of the leaf nodes all of the children or grandchildren or great grandchildren so to speak as close to the root as possible and the algorithm it uses for that tends to be proprietary or documented based on the system you're using but it doesn't store things in a list it does not store things top to bottom like the tables we view them as underneath the hood those tables that look like very tall structures are actually underneath the hood implemented with fancier things called trees and if we create those trees by creating what they're properly called indexes like this it might take us a moment like 0.098 seconds to create an index but now notice what happens previously when i searched the titles for the office using linear search it took .012 seconds if i do the same query again after having created the index and having told sql light build me this fancy tree in memory voila 0.001 seconds so orders of magnitude faster now both are fast to us human certainly but imagine the data set being even bigger the query being even bigger these indexes can get even larger than that uh and they're rather the queries can take longer than that and therefore take even more time than that but unfortunately if i've got all of my data all over the place as in a chart as in a diagram like this my god how do i actually get useful work done how do i get back the people in a movie and the writers and the stars and the ratings if it's all over the place i would seem to have created such a mess and that i now need to execute all of these queries but notice it doesn't have to be that complicated it turns out that there's another keyword in sql really the last that we'll look at here called join the join keyword which you can use implicitly or explicitly allows you to just join tables together and sort of reconstitute a bigger more user-friendly table so for instance suppose i want to get all of steve carell's tv shows not just the office well recall that i can select steve's id from the people table where name equals steve carell so again he has a different id in this table because this is from imdb but there's his id and let me go ahead and turn the timer off for now all right so there is his id one three six seven nine seven i could copy paste that into my code but that's not necessary thanks to these nested queries i can do something like this let me go ahead and now select all of the show ids from the stars table where person id from that table is in or is equal to this result so there's that join table stars that links people and shows so let me go ahead and execute that all right so there's all of the show ids of steve carell's tv shows that's a lot and it's very non-obvious what they are so let me do another nested query by putting all of that in parentheses and now select title from shows where the id of the show is in this big long list of show ids and there are all of the shows that he's in including the dana carvey show back when uh the office up at the top and then most recently shows like the morning show on apple tv alright so that's pretty cool that we can actually reconstitute the data like that but it turns out there's different ways of doing that as well and you'll see more of this in the coming weeks and in problem sets in labs and the like but it turns out we can do other things as well and let me just show this syntax even though it'll look a little cryptic at first glance you can also use that join keyword as follow i can select the title from the people table joined with the stars table on the people's id column equaling the stars person id column so in other words i can select a title from the result of joining people and stars like this on the id column in one and the person id column in the other and i can join in the shows table on the stars dot show id equaling the shows dot id so again now i'm joining the primary and foreign keys on these two tables where the name equals quote unquote steve carell so this is the most cryptic thing we've seen yet but it just means take this table and join it with this one and then join it with this one and filter all of the resulting joined rows by a name of steve carell and voila there we have all of those answers as well and there's other ways of doing this too i'll leave unsaid now some of the syntax for that but that's felt a little slow and in fact let me go ahead and turn my timer back on let me re-execute this last query select title from people joining on stars joining on shows where name equals steve carell that took over half a second so that was actually admittedly kind of slow but again indexes come to the rescue and if again we don't allow linear search to dominate but let me go ahead and create a few indexes create an index on called person index on the stars table the person id column why well my query a moment ago used the person id column it filtered on it so that might be a bottleneck i'm going to go ahead and create another index on the called show index on the stars table on show id similarly a moment ago my query used the show id column and so that too might have been a bottleneck linearly top to bottom so let me create that index and then lastly let me create an index called name index and this is perhaps the most obvious similar to the show titles before on the people table on the name column and that too took a moment now in total this took like almost a full second but these these indexes only get created once they get maintained automatically over time but you don't incur this with every query now let me do my select again let me select title from people joining the stars table joining the shows table where name equals steve carell boom 0.001 seconds that it's an order of magnitude faster than the like more than half a second it took us a little bit ago so here too you see the power of a relational database so even though we've created some problems for ourselves over time we've solved them ultimately granted with some more sophisticated features and additional syntax but a relational database is indeed why you use them in the real world for the twitters the instagrams the facebooks the googles because they can store data so efficiently without redundancy because you can normalize them and factor everything out but they can still maintain the relations that you might have seen in a spreadsheet but using something closer to logarithmic thanks to those tree structures but there are problems and what we wanted to do is end on today two primary problems that are introduced with sql and because they are just unfortunately so commonly done notice this here there is something generally known as a sql injection attack which you are vulnerable to in any application where you're taking user input that hasn't been an issue for my favorites.pi file where i only took input from a csv but if one of you were malicious what if one of you had maliciously typed in the word delete or update or something else as the title of your show and i accidentally plugged it into my own python code when executing a query you could potentially inject sql into my own code how might that be well if logging in via yale you'll typically see a form like this or logging in via harvard to something you'll see a form like this here's an example that i'm pretty sure neither harvard nor yale or vulnerable to suppose i type in my email address to this login form as mailand.harvard.edu single quote dash dash it turns out in sql dash dash is the symbol for commenting if you want to comment something out it turns out that the single quote is used when you want to search for something like steve carell or in this case mailing at harvard.edu it can be double quotes it can be single quotes in this case i'm using single quotes here but let's consider some sample code if you will in python here's a line of code that i propose might exist in the back end for harvard's authentication or yales or anyone else's maybe someone wrote some python code like this using select star from users where username equals question mark and password equals question mark and they plugged in username and password whatever the user typed into that web form a moment ago gets plugged in here to these question marks this is good this is good code because you're using the sql question marks so if you literally just do what we preach today and use these question mark placeholders you are safe from sql injection attacks unfortunately there are too many developers in the world that don't practice this or don't realize this or do forget this if you instead resort to python approaches like this where you use an f string instead which might be your instincts after last week because they're wonderfully convenient with the curly braces and all suppose that you literally plug in username and password not with the question mark placeholders but just literally in between those curly braces watch what happens if my username maylyn harvard.edu was actually typed in by me maliciously as mailin harvard.edu single quote dash dash that would have the effect of tricking this python code into doing essentially this let me do a find and replace it with trick python into executing username equals quote mainland at harvard.edu quote dash dash and then other stuff unfortunately the dash dash again means comment which means you could maybe trick a server into ignoring the whole password part of the sql query and if the sql query's purpose in life is to check is this username and password valid so that you can decide to log the user in or to say no you're not authorized well by essentially commenting out everything related to password notice what i've done i've just now theoretically logged myself in as mainland harvard.edu without even knowing or inputting a password because i injected sql syntax the quote and the dash dash into my query tricking the server into just ignoring the password equality check and so it turns out that db execute when you execute an insert it returns to you as said the id of the newly inserted row when you use db execute to select rows from a database table it returns to you a list of rows each of which is a dictionary so this is now pseudocode down here with my comment but if you get back one row that would seem to imply that there is a user named malin at harvard.edu don't know what his password is because whoever this person is maliciously tricked the server into ignoring that syntax so sql injection attacks are unfortunately one of the most common attacks against sql databases they are completely preventable if you simply use placeholders and use libraries whether it's cs50s or other third-party libraries that you may use down the road a common meme on the internet is this picture here uh if we zoom in on this person's license plate or where the license plate should be this is an example of someone theoretically trying to trick some camera on the highway into like dropping the whole database drop is another keyword in sql that deletes a database table and this person was either intentionally or just humorously trying to trick it into executing sql by using syntax like this so characters like single quotes dash dash semicolons are all potentially dangerous characters in sql if they're passed through unchanged to the database a very popular xkcd comic let me give you a moment to just read this is another uh well-known meme of sorts now in computer science if you'd like to read this one on your own but henceforth you are now in the um family of of educated learners who know who little bobby tables is unfortunately it's dead silence in here so i can't tell if anyone is actually laughing at this joke but anyhow this is a very well-known meme so if you're a computer scientist who knows sql you know this one and there's one last problem we'd like to introduce if you don't mind just a couple final moments here and that is a fundamental problem in computing called race conditions which for the first time is now manifest in our discussion of sql it turns out that sql and sql databases are very often used again in the real world for very high performing applications and by that i mean again the googles the facebooks the twitters of the world where lots and lots of data is coming into servers all at once and case in point some of you might have clicked like on this egg some time ago this is the most liked instagram post ever as of last night it was up to like 50 plus million likes uh well eclipsed kim kardashian's previous post which is still at like 18 million or so this is to say this is a hard problem to solve this notion of likes coming in at such an incredible rate because suppose that long story short instagram actually has a server with a sql database and they have code in python or c plus or whatever language that's talking to that database and suppose that they have code that's trying to increment the total number of likes well how might this work logically well in order to increment the number of likes that a picture like this egg has you might first select from the database the current number of likes for the id of that egg photograph then you might add one to it then you might update the database and i didn't use it before but just like there's insert and delete there's update as well so you might update the database with the new count plus one so the code for that might look a little something like this three lines of code using cs50s library here where you execute select likes from posts where id equals question mark where id is the unique identifier for that egg and then i'm storing the result in a rows variable which again i claim is a list of rows i'm going to go into the first row so that's rows bracket 0 and i'm going to go into the likes column to get the actual number and that number i'm going to store in a variable called like so this is going to be like 50 million and i want it to go to 50 million in one so how do i do that well i execute on the database update posts set likes equal to question mark and then i just plug in likes plus one the problem though with the instagrams and googles and twitters of the world is that they don't just have one server they have many thousands of servers and all of those servers might in parallel be receiving clicks from you and i on the internet and those clicks translate into this code getting executed executed executed and the problem is that when you have three lines of code and suppose brian and i click on that egg at roughly the same time my three lines might not get executed before his three lines or vice versa they might get commingled chronologically my first line might get executed then brian's first line might get executed my second line might get executed brian's second line so they might get interspersed on different servers or just temporarily in time chronologically that's problematic because suppose brian and i click on that egg roughly at the same time and we get back the same answer to the select query 50 million is the current count then our next lines of code execute on the servers we happen to be on which adds one to the likes the server might accidentally end up updating the row for the egg with 50 million one both times because the fundamental problem is if my code executes while brian codes execute we are both checking the value of a variable at essentially the same time and we are both then making a conclusion oh the current likes are 50 million we are then making a decision let's add one to 50 million we are then updating the value with 50 million one the problem is though that really if brian's code or the server he happens to be connected to on instagram happens to have selected the number of likes first he should be allowed to finish the code that's being executed so that when i select it i see 50 million one and i add one to that so the new count is 50 million two this is what's known as a race condition when you write code in a multi-server or fancily known as a multi-threaded environment lines of code chronologically can get commingled on different servers at any given time the problem fundamentally derives from the fact that if brian's server is in the middle of checking the state of a variable i should be locked out i should not be allowed to click on that button at the same time or my logic code my code might should not be allowed to execute logically so there is a solution when you have to write code like this as is common for twitter and instagram and facebook and the like to use what are called transactions transactions add some few new pieces of syntax that we won't dwell on today and you won't need to use in the coming days but they do solve a fundamentally hard problem transactions essentially allow you to lock a table or really a row in a table so that if brian's click on that egg results in some code executing that's in the process of checking what is the total like count my click on the egg will not get handled by the server until his code is done executing so in green here i've proposed the way you should do this you shouldn't just execute the middle three lines you being in facebook in this case instagram should execute begin transaction first then commit the transaction at the end and the design of transactions is that all of the lines in between will either succeed altogether or fail altogether the database won't get into this funky state where we start losing track of likes on eggs and though this has not been an issue in recent years back in the day when twitter was first getting started twitter was super popular and super offline a lot of the time there was this thing called a fail whale which is like the picture they showed on their website when they were getting too much traffic to handle that was because when people are liking and tweeting and retweeting things it's a huge amount of data coming in and it turns out it's very hard to solve these problems but locking the database table or the rows with these transactions is one way fundamentally to solve this and on our final extra time today we thought we would play this out in the same example that i was taught transactions in some years ago suppose that the scenario at hand is that you and your roommates have a nice dorm fridge and you're all in the habit of drinking lots of milk and you want to be able to drink some milk but you go to the fridge like i'm about to here and you realize uh oh we're out of milk and so now i'm inspecting the state of this refrigerator which is quite old but also quite empty and the state of this variable being empty tells me that i should go to cvs and buy some more milk so what do i then do i'm presumably going to close the fridge and i'm gonna go and leave and go head to cvs unfortunately the same problem arises that we'll act out here in our final 60 or so seconds together whereby if brian now my roommate in this story also wants some milk he comes by when i'm already headed to the store inspects the state of the fridge and realizes oh we're out of milk so he nicely will go restock as well so let's see how this plays out and we'll see if there's isn't a similar uh analogous solution so i've checked the state of the variable we're indeed out of milk i'll be right back just going to go to cvs [Music] do [Music] do [Music] all right i am now back from the store i've picked up some milk gonna go ahead and put it into the fridge and oh how did this happen now there's multiple jugs of milk and of course you know milk does not last that long and brian and i don't drink that much milk so this is like a really serious problem we've sort of tried to update the very value of this variable at the same time so so how do we go about fixing this what's the the actual solution here well i dare say that we can draw some inspiration from the world of transactions and the world of databases and perhaps create a visual for here that we hope you never forget if you take nothing away from today let's go ahead and act this act this out one last time where this time i'm going to be a little more extreme i go ahead and open the fridge i realize ah we're out of milk i'm going to go to the store i do not want to allow for this situation where brian accidentally checks the fridge as well so i am going to lock the refrigerator instead let me go ahead and drape this through here a little extreme but i think so long as he can't get into the fridge this shouldn't be a problem let me go ahead now and just attach the lock here almost got it come on all right now the fridge is locked now i'm gonna go get some milk [Music] uh [Music] [Applause] [Music] [Music] [Music] [Music] you
Info
Channel: CS50
Views: 98,491
Rating: undefined out of 5
Keywords: cs50, harvard, computer, science, david, j., malan
Id: Wb0DM9I8RDo
Channel Id: undefined
Length: 157min 20sec (9440 seconds)
Published: Mon Mar 01 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.