[MUSIC PLAYING] SPEAKER 1: All right. This is CS50, and this is week 7
on the day before All Hallows' Eve. Today, we will introduce yet
another language, the goal of which is not to introduce another
language for language's sake, but to really begin to emphasize that
when it comes to the world of software programming, engineering
more generally, there's often different tools for different jobs. And if you were to try to use C to
solve all of our future problems, it would actually be
painful, as you've already seen how much more pleasant it is to
solve certain problems with Python. But today, quite quickly,
we'll realize that even Python's actually not the most pleasant
way to solve a lot of problems, especially when it comes
to data, data at scale. And, in fact, as we transition
in the final weeks of CS50 to the world of web programming, and if
you so choose with your final project mobile programming, you'll
actually need more tools in your tool kit than
C and Python alone. In fact, today we'll introduce a
database-centric language called S-Q-L, or SQL. Next week, we'll explore markup
languages, like HTML and CSS. A bit of JavaScript, as well. And then we'll synthesize all of this
together at the end of the class, as some of you might, for
your final projects, as well. But, before we do that,
and talk about data, let's actually start
to gather some data. So if you could visit this URL
here on your phone, or a laptop. Or, if easier, here's a
bar code version thereof. You can point your camera
at this bar code here, and it's going to pull up a
relatively short Google form that's just going to ask you
a couple of questions, so that we can collect
some actual live data and actually play around with
it, initially, in Python. So if you go to that URL there,
you'll see a Google form. And if you haven't been able to
scan it quite yet, that's fine. Just kind of look over the
shoulder of the person next to you. But you, or they, will see a
little something like this. So among the questions will be,
which is your favorite language? On the list, only thus far,
is Scratch, C, and Python, and below that you'll see
another question asking about your favorite problem
across the problem sets thus far. Each of them is radio
buttons, which means you'll be able to select one
button for each of those questions. And, ultimately, what's
going to be nice is that if you've never used Google
Forms before as an administrator, all of that data is actually going to
end up being, not only in Google Forms, but also, if you so choose,
in Google Spreadsheets. Which is an example, of course, of a
spreadsheet software similar in spirit to Apple Numbers on Macs or Microsoft
Excel on different platforms. And Google Sheets is
going to allow us to store all of that data in rows and columns. And so, since Google made
both of these products, they integrated one with the other. So, in fact, if I, on my laptop
here in another window, open this up-- let me flip over. Here's the live spreadsheet. And we'll see that the very
first person who buzzed in really liked Python, as did a lot
of other people thereafter. But, Hello, World was your
favorite in Python, which is great. There's a couple of
votes for Scratch here. If we scroll down, there's one hold
out for C, who really liked Credit, in this case here. And if we scroll down further, it looks
like Python and Scratch are in there. A few more C's, and so on and so forth. So suppose that we wanted to,
actually, now analyze this data. Now, any of you who have used Excel,
or Numbers, or Google Spreadsheets know that it comes with built
in functions and formulas, and we can do all of that, but suppose
there was a huge amount of data. Or, suppose that this data was
not coming in via Google Sheets, but via your own web application,
or your own mobile application, and it's just ending up in
some kind of spreadsheet. Well, wouldn't it be nice if we could
actually analyze that kind of data with code. And, in fact, you can. And the simplest way to store a bunch
of data isn't with anything fancy, in fact, but just a literal text
file, something ending in .txt, or, maybe more commonly, .csv. In fact, what we'd call
a flat file database is literally just code for a text
file containing all of your data. But your data typically has delimiters
that separate some values from others. And, in fact, the most
common approach, daresay, is to use, indeed, comma-separated
values, or, CSV files. And what that means is that in simple
text alone, just asking, maybe Unicode, you sort of mimic the
idea of rows and columns by using newline characters,
like backslash n, to represent row after row after row. That's pretty straightforward. To just move the cursor
to the next line. And because text files aren't graphical,
so there's no notion of vertical bars that you can put between what
you and I think of as columns, you literally just use a comma,
or some other such symbol to separate one value from another. So, in fact, let me
go ahead and do this. Even if you've never done this before,
it turns out, in Google Sheets, and also Excel and Numbers,
you can export your data, not in some proprietary Apple
or Microsoft or Google format, but in a globally portable
format known as .csv. So let me go up to file. I will go to download. And notice, I can export this
in a bunch of different formats, but the one I care about for
today is going to be .csv. On my Mac, that's going to put
it into my downloads folder. And what I'm going to go ahead
here and do, in just a moment, is let me go ahead it
and open up VS Code. So, in short, I've downloaded this file. Google gave it a long default file name. I've renamed it to favorites.csv. Let me go ahead and actually
open it now in VS Code here. And this now is the exact same data. If you're still submitting the form,
I'm afraid you didn't make the cut off. So we're not going to see more
data ending up in this file because I've literally
downloaded a copy of it. But, indeed, if I scroll
through this, we'll see that it's got some
399 lines of data. Or, technically, 398
because the very first one is what we'd call a header
row, which just describes what each of these columns means. Now notice that even though the
Google Sheets interface actually put all of this data in proper
graphical rows and columns, you can still see the rows because
they're just separate lines, and you can kind of see the
columns by focusing on the comma here, the comma here, the comma here. And then, also, the comma
here, the comma here, the comma here, and so forth. Now, a quick subtlety. Hello, World is a two-word problem
name, and it itself has a comma in it. What's to stop me from confusing the
comma in Hello, World from the commas that Google, apparently,
inserted into this file? Yeah. So, automatically, you
all did not type this. We did not type it into the form. But Google is smart enough, as
is Apple and Microsoft, when they export CSVs to somehow escape
seemingly dangerous characters, or characters that
could just break things. And, in this case, a
convention in the CSV world is just to quote any strings
that themselves have commas so that any program you're writing that
reads this file doesn't get confused. So the only thing that came
in automatically from Google is just this timestamp
here, based on the time of day in our own local timezone. That was added automatically. So we have three, and
not just two columns. So with that said, we
have three columns: timestamp, language, and problem. The latter two of which
came from all of you. Let's actually write some
code that analyzes this data. Let's figure out what the most
popular something or other is here. So I'll close the CSV file. Let me go ahead and use what seems to
be folks favorite language, thus far, and write a file called
favorites.py, for instance. And I'm going to use Python
to open that CSV file, analyze it, crunch some numbers, and
output some statistics, if you will. So the easiest way to manipulate CSV
files, as you might have gleaned, is not to just open the file yourself,
look for commas, allocate memory, or anything like that
in C. In Python, you can literally just import CSV,
which is a Python module that gives you CSV related functionality. In Python, there's a bunch
of ways to open files. One way to open a file
is like this in Python. You can say file equals
open, similar to fopen in C. You can specify the name of the file
you want to open, like favorites.csv, and you can explicitly say you
want to open the file for reading, quote unquote, "r," just like fopen. Strictly speaking, in
Python read is implied. So if you omit the second
argument, it will still work. But for parity with fopen,
I'll do the same here. Then you can, maybe, do something
with file, and then, at the end, you can close the file like this. So in Python, there really is a
mapping between fopen and fclose. But in Python the functions are
called open and close for short. But a more common way, a more
Pythonic way, so to speak, i.e. the way people tend to do
it in Python, is actually to use a keyword that didn't exist
in C, where you instead say with. And you say, with open this file
name as a specific file name, and then indent it inside of that. Now you can do whatever
you want with the file. And the implication of using
with, even though it's not obvious from the keyword itself, is that
the file will be automatically closed for you later. So this is just a minor
Python convenience so that you don't have to
remember to close the file. Unless something goes
wrong, it will just close automatically as soon as
you're outside of this with block. So how do I go about reading a CSV? The simplest way is to actually
give yourself a variable called, maybe, reader. Like if I want to read the CSV,
I'll call my variable reader. Set that equal to the return value of
a function that comes with Python CSV module called reader, in lowercase,
and you just pass in the file name. So the first line there on
line three opens the file and gives me access
to the bytes there in. Line four now actually
uses this library that comes with Python to just go read it
for me, figure out where the commas are, so that it can hand me, line
after line, the data in the file. Now the first piece of data
in the file, though, was what? What's the very first row? So it was that header row with
timestamp, language, problem. I actually want to skip that
because that's not data. That's what we might call metadata
that's describing my actual data. So one way to deal with this, I
can literally just say next reader. And next, even though the
semantics here are a little weird, it just means, no, no, give me the
next line from that reader instead of the first by default. And now inside of this with block,
I can do something like this. For each row in that reader, let's
do something super simple initially. Let's just print out row bracket one. So row bracket one. So what's going on here? Well, the CSV reader in Python
is going to return to you, inside of this loop, one row after another. Each of those rows, though,
has how many columns? Three. The timestamp, the
language, and the problem. And just like in C, our lists
in Python are zero index. So zero, one, two. So if I'm printing out row bracket one,
that's the second, or middle, column. So what's this going
to print row after row? Each of the languages
that you all replied with. So that's all. This isn't doing any
kind of analytics yet. I'm just going through
the motions to, at least, print out some data of interest. So let me run this. Python of favorites.py, enter. And it happens super fast. But if I scroll back in my terminal
window, there is all of that raw data. So this is to say, once you
know the function names, once you know the
keywords, it's actually pretty simple in Python to just
get up and running with a file, and start looking at the data therein. But it turns out that it's a little
sub-optimal to use the reader alone. It turns out there's better
ways to do this, and let me make this clear as follows. I don't strictly need a variable,
but let me actually declare a variable called favorite, set
it equal to row bracket one, and then print out that favorite value. This is not doing anything new. It's just declaring an
additional variable, but I wanted to highlight the fact
that I'm just kind of trusting that row bracket one is the problem-- or, is the language I care about. Is language, is language. But suppose one of you, or I,
go into the Google spreadsheet, and like anyone might
with a spreadsheet, you might start moving things around. And you might swap some of
the columns left and right. You might delete one of the
columns, add something else. In short, spreadsheets
are arguably fragile in that it's pretty easy in the
GUI, the graphical user interface, to change them around. And so my code in Python,
accordingly, is, arguably, fragile because I'm just hoping that row
bracket one is always the data that I care about. So what would be marginally better? Well, let's actually use
that header row instead. And more common, arguably, in Python,
is not to use a simple reader, but, instead, what we would
call a Dictionary Reader. I'm going to change this to
DictReader, capital D, capital R, and that's it for the
change to line four. On line five, I'm going to
get rid of the next line because I don't want
to skip the header now. What DictReader does,
which reader does not, is it automatically analyzes
that first line in the file, figures out what are all
of your columns called, and, thereafter, when you iterate over
this reader what each of your rows now is-- it's no longer a list of
size three, bracket zero, bracket one, bracket two. Each row that you get back
in this loop on line five is now, wonderfully, a dictionary
instead, the keys of which are from the header field. Timestamp, language, problem. The values of which are whatever
each of you typed in again and again. So how do I change the code? I no longer have these numeric indices
because row is no longer a list, it's a dictionary. So if I literally want the
language that you typed in, I can use row bracket, quote unquote,
"language," treating row as a Dict not as a list anymore. Now this is, again, more robust because
if you move the columns around on me, code is still going to work,
at least so long as you don't rename the columns at the very top. So that's still one
assumption I'm making. All right. Well, beyond that, what
could I actually do here? Well, just to be clear, I don't,
strictly speaking, need this variable. So no need to highlight this again. So let me just simplify
the code a little bit and get rid of this variable,
and instead just print out the language in that row. As a quick check, let me
rerun Python of favorites.py and it seems to still work, even
though there's a lot of output. We're not going to check
all 399 lines, but it looks like it printed out all
of those popular languages. Well, what more can we actually do? Well, let's actually now
start to crunch some numbers and figure out how many people
like Scratch, how many people like C, how many people like Python. Let's start to analyze this. So maybe the most pedantic
way to do this in Python would be to just create some variables
and do all of the counting myself. So let me actually go ahead and do this. Let me delete this code for now. And after opening the
reader, let me create a variable called Scratch
and set it equal to zero, a variable called C, set it equal
to zero, a variable called Python, set it equal to zero, just so that I
have three counters, similar to what we did in week zero, week one,
anytime we counted anything Honestly, this looks a little
stupid, and it's not wrong. In fact, this is how you
would do it in Python, but Python also has some clever syntax. If you want to be really cool,
you can do Scratch comma, c comma, Python equals zero, zero, zero, and
do all three at once if you like that. So it tightens up the code a little bit,
even though the effect is ultimately the same. Now let's go ahead and iterate
over this file row by row by row. And if we see Scratch,
increment the Scratch counter. If we see C, increment the C counter. If we see Python, increment
the Python counter, instead. So how do I do this? Well, I could do something like,
for each row in the reader, just like before, let me go ahead
and get that favorite variable and set it equal to the language in
that dictionary that just came back as part of this iteration. And now I can do something like this, if
favorite equals, equals "Scratch" then, with my indentation, I can do Scratch
plus equals one, elif favorite equals, equals "C." I can go ahead and
increment C plus equals one. Else-- and I don't think I
want else, just to be safe. Even though we only saw three options,
I think just to be super safe, elif favorite equals, equals
"Python," then let's go ahead and increment Python. The one thing we can't do that we
could do in C is the plus, plus trick. So plus equals is as
close as we can get. So what have I done? Inside of this loop, I've just
incremented each of those counters by one if I see Scratch, or
C, or Python, again and again. All right. Outside of the loop and
outside of the with block, because once I'm done
reading all of the rows, I might as well let the with clause
close the file automatically, let's just go ahead and
print out some values. So I'm going to go ahead and print out
Scratch colon, and then inside of there let's print out whatever the
value of that variable is. Let's then go ahead
and print out C colon, and then whatever the
value of the C variable is. And then, lastly, let's print out
Python colon and whatever the value is. And now I made three typos here. This is not going to print the values. What do I need to do? Sorry? Yeah, I'm missing the f in
front of each of these strings so that I actually get formatted. And that is to say, the variables get
interpolated inside of the curly braces like we saw last week. All right. So, honestly, that's kind
of a decent amount of code, like 18 or so lines of code, just
to count the number of responses. But let's see if I got it right. Let me open my terminal and
run Python of favorites.py, and now I see, by an
overwhelming amount, folks like Python, followed by C,
followed by Scratch, in that order. But that was a decent amount
of code to have to write, and it turns out there's actually better
ways of doing this, more Pythonic ways, more programmatic ways of doing this. And if we think back to one of
our universal data structures. Think back to how we preached
last week, and the week before, the value of these
dictionaries, more generally. Like the CSV module,
clearly a fan of them, because that's what
DictReader is giving us, dictionary, after
dictionary, after dictionary. And this was the general
idea of a dictionary. It associates keys with
values, much like you might in a two-column table
on a chalkboard, or the like. Well, this is what I need if I want
to keep track of how many people said Scratch, and C, and Python. If I had a piece of chalk, I
could just write Scratch, and C, and Python as three keys. And then with my chalk and, maybe,
an eraser, keep track of the values. They all start at zero. Then, I add one, add two, add three. Or, maybe, on a chalkboard,
I actually use hash marks. But a dictionary is kind of the perfect
data structure for just associating something like Scratch, C, Python,
with something else-- keys with values, respectively. So this is going to
look a little weirder, but it's going to be pretty conventional
to do something like this instead. Let me go back into VS Code. I'll close my terminal window. And let me go ahead and
actually delete a lot of this because I can simplify this further. Let me go ahead and now give
myself, maybe, a variable just, generically, called counts, and set
that equal to an empty dictionary. And you can actually do
this in a couple of ways. You can literally write
out dict with nothing in parentheses, which will
give you an empty dictionary, like the picture on the board,
but a little more conventional is to just use two keystrokes and use
two curly braces with nothing inside. That gives me an empty dictionary
like this picture here. Now my loop is going to be the same. I'm going to do, for
each row in the reader, I'm going to go ahead and grab the
favorite language for this person. So favorite equals row,
quote unquote, "language." But now I'm going to do
something a little different. There's two scenarios here. Either, I have seen this language before
and I want to increment it by one. Or, if I've never seen this language
before because the loop just started, what should I initialize the
count for this language to? So, one because I've only seen it once. Exactly. So now let me go ahead
and do exactly that. If this current favorite,
that I am seeing in the row, is already in the counts
dictionary-- and in Python, this is literally how
you ask that question. If favorite in counts. That will check, is there a key with
this name, Scratch, C, or Python? If so, go into that location
in the counts dictionary, index into it at the favorite
location, because favorite is a string. It's either, quote unquote,
"Scratch," "C," or "Python," and just increment it by one like this. Else, as you noted. If it's not there implicitly, then
counts bracket favorite should probably be set equal-- not to zero
because we're literally are seeing it in the current row. Let's initialize it to one. And, thereafter, if
we see it again, it's going to be plus equals one,
plus equals one, plus equals one. So now outside of that loop, outside
of the with block, let me do this. For each favorite in those counts-- And this, too, in Python is a
trick if you want to iterate over all of the keys in a dictionary. That is, if you want to iterate over the
left-hand column of all of these keys, you literally can say, for
something in that dictionary. So for favorite in counts, this is
giving me a variable called favorite and updating it automatically
top to bottom in that dictionary. Let's go ahead and print
out an f string that's going to say whatever the name
of that language is colon, and whatever the value of that
language is in that there dictionary. So, again, logically the only
thing that's new is this. I'm now using one dictionary
instead of three variables to keep track of three things, like
updating this chalkboard with three different things, Scratch, C, Python. And the last thing I'm doing,
which is a little different, is once I have that dictionary,
whether there's three languages, or, maybe, tomorrow there'll
be fourth because we're going to introduce SQL today. Well, this will iterate over all of
those keys and print out the values. All right, so if I
didn't do anything wrong, if I do Python of
favorites.py and hit enter. There we have it. And it happens to be in a
different order this time. That's because we saw Python
first, we then saw Scratch, and, eventually, we saw C. But if
we wanted to sort these differently, we actually could with
some different code. But, in short, what have we done? We've created this kind of
structure in memory with three keys, Python, C, and Scratch,
because each time we encounter such a language from you
all, we either set our counter to one or increment it by one instead. Any questions on this code, or this
general idea of using dictionaries as, like, a little cheat sheet for
doing some math in this way? Super common paradigm. All right. Well, let me tweak this a little bit. Right now, in my output, we're
seeing Python, Scratch and C. Maybe, for the sake of discussion,
suppose we want to sort this by key. We can actually do that. Let me close my terminal temporarily. And it turns out, in Python,
there's a bunch of ways to do this, but the simplest way to
sort a dictionary by key is literally to use a
function called sorted that comes with Python,
that just does it for you. And even if you pass it a dictionary,
it will sort that dictionary by the left-hand column so you can
iterate it over alphabetically instead. So if I go back now to VS Code. If I open my terminal window and
I rerun Python on favorites.py, now that I've added the
sorted call, we should now see just because it's sorted
alphabetically instead. Now that's not that useful, especially
if we had lots of languages. You probably don't care
about it being alphabetized as much as you care about it being
ranked by which is the most popular, which is the least popular. And, for that, there's a bunch
of ways to do this in Python. And, I think, the simplest way to sort
by value the right-hand column instead of the left-hand column is probably
to make this change instead. Let me close my terminal temporarily. Let me still use the sorted function,
which by default sorts by key, but let's change it to be as follows. Let's change it to sort by a
function called counts.get, which is a little weird,
but this comes back to last week's brief discussion of
object-oriented programming, or oop. Remember, in Python,
that almost everything is like an object of some sort. An int is an object, a dictionary
is an object, a string is an object. Which is to say that, not only do these
things have values like, quote unquote, "Hello, World," or 50, these
variables, these objects, can also have functions
built into them, a.k.a. methods. So it turns out that because counts
is a dictionary, because I made it so, that counts dictionary, like
any dictionary in Python, comes with a function called get. And if you just tell the sorted
function to use that built-in method, it will actually, for every key, get
its value, get its value, get its value, and sort effectively by the
right-hand column instead of the left. Now we'll see down the line, perhaps,
more sophisticated ways of using this, but, for now, this just
overrides the default behavior and sorts the dictionary, not
by key, but by value instead. All right. So now watch this if I run
Python of favorites.py once more. Previously, it was in the order in
which the languages appeared first in the CSV file, then it
was sorted alphabetically. Now it should be sorted by value. And, indeed. Scratch is the least with
40, C is the next with 78, Python is the biggest with 280. That's not much of a top
10, or a top three list. Let's actually reverse it. And the easiest way in
Python to do that is to pass a third argument into
sorted, and you would know this by just reading the documentation. You can literally say, reverse
equals True, capital T, and now if I rerun this one last
time, Python of favorites.py, I'll see the same values but with
the whole thing reversed in order. Long story short, even though
this might feel like a slog, like adding this and looking up
this, so much easier than in C where you would have had to figure
out, how does bubble sort work? Let me implement bubble sort, selection
sort, any of those sorting algorithms, or use some other library. In Python, you just get a lot
more for free, so to speak. It's just built in once you get
comfy with the documentation. And, to be clear, this is an
argument, as is this, as is this. But in Python, we have not only
positional arguments, which are based on what position they
are in, left to right, just like C, you also have these
named parameters whereby they have explicit names
that you can use yourself, to make clear that you're using
this one but not this other one. More parameters in Python can
be optional than in C. Phew. All right. Any questions about that technique yet? And if you're feeling like this is
starting to take the fun out of Python, that's actually kind of the
point of doing this the hard way. All right. Well, let's do it one other
way that's marginally better. It turns out, in Python there really
is this rich ecosystem of libraries, the code that comes with the language
itself, or, even, third parties. And coming with the
language is another module called the collections
module, or package here, whereby if I use from collections, I
can import something called Counter, capital C. And it
turns out, if this felt a little bit painful to
create a dictionary yourself, initialize it, maybe, to
zero or one, like this. Turns out, you have the same problem
that people before you have had, and so there's another way to do this. You can create a variable called
counts, set it equal to Counter, capital C, open paren, close paren. And this is a different type of object. It's a different type of
object in Python, that has counting capabilities built in. And so if I actually want to use
this counter instead, I can do this. For each row in the reader, let's go
ahead and grab the favorite language from that row, just like before. And without doing any of that headache
of like, if, elif, or any of this, you can literally just index
into that counter using favorite, quote unquote, "Scratch," or "C," or
"Python," and increment it by one. What the Counter class is going
to do for you, so to speak-- Another example of object-oriented
programming, and counts is now an object thereof. What this whole feature of Counter
is going to do for you is it's going to automatically
initialize everything to zero, even if you've
never seen it before, and then you can just blindly
start incrementing it. So, in short, there's just
more pleasant ways, sometimes, to do something in Python as well. All right. How about, lastly, let's
make things, maybe-- Oh, actually, let's do this. We can even simplify the sorting here. Let me actually take
this one step further. Instead of manually figuring out how
to sort this, I'm going to do this. For each favorite, and the count
thereof in the counts variable's most common function's return value, go
ahead and print out this as well. So in short, again, a bit new syntax. But what's going on here? Well, it turns out that
this counts class and-- sorry, this Counter class, and,
in turn, this counts variable, comes with a function built in that you
would only know from the documentation. It's literally called
most underscore common, and what it returns to you when you call
it is a pair of key value, key value. And so this, too, is a trick in
Python that we did not have in C. If you want to iterate
over something, but grab two variables at a time on
each iteration like this, you separate them by commas
and can get favorite count, favorite count, favorite count. So if I run this now, Python of
favorites.py, this, too, just works. And it's getting a little simpler,
a little tighter than before than if we had actually
done it all manually. Lastly, here is a code that's
the shortest version thereof. We're down to like 14 or 15 lines. If I wanted to change this to
analyze the most popular problem thus far in the class, how do I
go about changing the code to print out, top to bottom, the
most popular problem or problems? What line should change? Yeah. So, yeah. Line 10. Because I've written this in
kind of a general purpose way and using dictionaries
with keys, it suffices to change language to,
quote unquote, "problem" because that was the
third column from the CSV. And so now, if you're
curious, let's actually make my terminal window a bit bigger. Python of favorites.py, enter. And, OK. Tragically, we peaked
early with Hello, World-- is the most popular problem thus far,
followed by Filter, then Scratch. OK. Peaked even earlier. Mario, DNA, and so forth, and
a bunch of others thereafter. So based on this sample size, here's
the ranking of the problems thus far. So, we got it. More Hello, World problems
in the weeks to come. All right. Now that we've done
that in that way, let's just make this program
slightly interactive and see how we can really take a
fundamentally different approach. I'm going to go into VS Code. I'm going to keep everything the
same, except that, at the bottom, I'm going to get rid of
this loop because I don't want any more print out everything. I want to look up specific counts. Like, how popular was this problem,
how popular was this other problem? And what I'm going to go ahead and do
is to create a variable called favorite, set it equal to-- I could use get string
in the CS50 library, but we saw last week there's no
need to for strings, certainly. Let me just use the input
function that comes with Python, and prompt the human for
their favorite problem. And then let me go ahead and print out,
for instance, an f string containing whatever their favorite is, colon,
and whatever the count is thereof of that favorite, close quote. So let me open my terminal window. Let me run Python of
favorites.py, enter. And if I type in Hello, World, looks
like 65 people, indeed, like that one. If I run Python of favorites.py again. I type in Scratch, now we see that one. If I type in anything else, I'm
going to get its specific value. So this is to say, not only can we write
Python code to analyze some data pretty tightly versus the manual
code we wrote out earlier, you can also make these
programs interactive as well. And this is going to be a
super common paradigm, right, if you go into the world of consulting,
analytics, data science, more generally. Among your roles is going to be to
analyze data, to ask questions of data, get back the answer. be ask questions
of data, get back the answer. Honestly, life gets pretty
tedious, even though you've only been programming in Python, perhaps, for
like one week, a week and a half now. When you have to write code to
solve all of the world's problems-- and there's this sort of
tenant in programming, that programmers tend to avoid
writing code as much as they can because, ideally, you
would solve problems with the right tool for the job,
minimizing the number of lines of code you actually write. So how do we actually get to that point? Well, instead of just dealing with
CSV files, pure text, it turns out there's an entire world
of proper databases. Not flat file databases, where you
store everything in text files, but a database program, a piece
of software running on a computer, running on a server, that's
always listening for you. It's got a lot of memory, it's got a
lot of space, and in turn a lot of data, and it supports a
database specific language that makes it much easier, much faster
to ask questions of the very same data. It's a relational database
in the sense, too, that it's not even
necessarily one spreadsheet, one set of rows and columns. You can have two sheets, three sheets,
30 sheets across which there might very well be relationships, or relations. So S-Q-L, or SQL, is a
database specific language, stands for Structured Query Language,
that's a declarative language whereby you're not going to be
in the habit with SQL, typically, of writing loops and
conditionals, and this kind of thing. You're instead going to describe
the data that you want to get back, you're going to describe the
question that you want the answer to, and we'll do this using a
relatively small grammar. That is to say, there's not
that many keywords in SQL. It's a pretty small language. But it's going to allow us to eliminate
dozens of lines of Python code, perhaps. SQL follows this CRUD paradigm. So C-R-U-D, which simply means
that in a relational database, you can really only do four things. You can create data, read data-- that
is, look at it or analyze it somehow. Update the data, or delete the data. So, CRUD, for short. And that really speaks to
just how relatively simple the world is, even though we'll
just scratch the surface of some of its capabilities today. And you'll explore more over time. Specifically, in SQL, there's
going to be other keywords that map to those four ideas. Technically, you don't just
create data in the world of SQL, you can also insert data, like
inserting more rows into a sheet. And it's not the word
"read" that people use. People say to "select" data. But they mean to read data, which
is sort of the opposite of writing or creating data. But the U and the D are the
same, except that there's also a keyword in SQL known as DROP,
which lets you very destructively, very dangerously delete entire
database tables, as well. So how do we do this, and what's
the connection to our favorites data thus far? Well, here is the syntax
in this language called SQL via which you can create a table. So the jargon is a little different,
but the ideas are exactly the same from the world of spreadsheets. What you call a sheet in a spreadsheet,
the database world calls a table. It's a table of rows and columns,
but it's the exact same idea. You're going to have discretion
over what to call the table, just like you can call a
spreadsheet something, or else, and you can also specify
the types of data that you want to store
in your rows and columns. And it's going to go a little more
deeply than just formatting it, like in Excel, and Numbers,
and Google Spreadsheets, you can actually control, maybe,
how big the data could be depending on the database you're actually using. In CS50, we're going to
use a light version of SQL. Literally, a language called-- an implementation of SQL
called SQLite, which has really all of the core functionality that
you would see in the real world, and with larger, more
scalable systems, but it's going to allow us to focus on
a lot of the building blocks. And SQLite's actually really popular
on Macs, PCs, and phones, nowadays. A lot of the data that games and other
applications on your phone might store, actually have a file, a binary
file with zeros and ones, that's in the SQLite format. So if you do a mobile app, for
instance, for your final project, you'll have an opportunity to
play with something like this. Well, how do you actually run SQLite3? It's just a command built
into your code space. So this is a program you could
install on your own Mac, your own PC, or the like. We'll do everything as we've
done before, in the cloud, and actually use your code space. And by that I mean, we can just
start to play with this data now using SQL instead of Python. So let me do this. Let me open up my terminal
window here, and let me go ahead and maximize my
terminal window just because we'll focus now on the files here. Recall that I have a file
called favorites.csv, and that CSV file is just text. But let me load it
into a proper database so I can actually use this
other language called SQL on it. To do this, I'm going
to run SQLite3, which just means the third
version of it, and I'm going to create a new
database called favorites.db. That's just a convention,
but it means here comes a database that I'm going to create. Notice, I'm not using
the tabbed code editor. I'm not using the code command
because the code command is generally for text files. SQLite3 is going to create a binary
file, zeros and ones, ultimately. When I run that, it's going
to ask me to verify yes. I'm going to hit y and then enter. And now I'm at the SQLite prompt,
which is not the dollar sign. It literally says SQLite
with an angled bracket. Now, one time only, I want to
go ahead and load favorites.csv into this database so I can actually
play around with it using not Python, but this new language called SQL. And the way I'm going to
do this is as follows. I'm going to do dot. Mode csv, enter. And that just puts SQLite into CSV mode. It has different modes for
different file formats. I'm going to .import and then I'm
going to specify the file that I want to import, which is favorites.csv. And then this one's up to me. What is the name of the
table I want to create? And table, again, is essentially
synonymous with sheets. So I'm going to call
everything the same. I'm going to call my
table favorites as well. So what this command is
essentially going to do, is all of those lines of Python code
that open the file, read it row by row, and do something with it-- This is just built into SQLite. It's going to load the whole darn
CSV into this new favorites.db file, and then that's it for now. I'm going to go ahead and literally
type .quit to get out of SQLite. I'm back at my dollar sign prompt. If I type ls, I have
not only favorites.csv, I also have favorites.db now as well, a
brand new file, and in that file now is an optimized version of the CSV file. In that DB file now is a
version of the data that's going to lend itself to CRUD
operations, creating, reading, update, and deleting, using this
new language called SQL. All right, so how do I get into this? Well, let me clear my terminal window
and pretend that I'm doing this now the next day. I've already created the database. That's a one time operation. Once you've got the
data, now I'm going to go ahead and again run SQLite3 favorites.db
just to open the file again. But it's already now-- all
of the data is in there. Just as a teaser, let
me go ahead and do this. .schema is a SQLite command that just
shows me the schema of this database table. And we'll see more about
this in a little bit, but for now this is
showing me, essentially, the SQL command that
was automatically run when I imported this database
the first time around. And, for now, just notice
that it mentions timestamp, it mentions language,
it mentions problem. Very loosely, it calls
each of those texts. So we're not trying very hard
to distinguish one type of data from another. It's all text. But notice, create table
If not exists favorites. This is essentially
the create table syntax that I alluded to
earlier via which you can create a table in a SQLite database. But more on that in just a bit. Here now is how we can actually
get at the data in that database. It turns out that we can select
one or more columns from a database table using syntax like this. Literally, the keyword select, then
the name of one or more columns that are in that database, and then from
the specific table that you care about. And notice that in capital
letters here are all of the SQL specific keywords, select,
and from, in particular. And in lowercase, by convention,
here are the placeholders for the columns that you, or I, have
created, and the tables that you, or I, have created. So if I go back to SQLite here. Let me just clear with Control L, which
will just freshen up the screen here so we can focus on what's new. If I want to select everything
from the table called favorites, here's what I can do. Select star from favorites semicolon. And, do forgive me, semicolons
are back for SQL, in this case. But select star from favorites uses a
syntax you might not be familiar with. Star here has nothing
to do with pointers. Star is a wild card. It means give me everything,
no matter what it's called, from this particular table. When I hit enter, what
we're going to see is the entire contents
of the favorite table that's the result of having imported
that CSV into this database. So when I hit enter,
there is all of that data. And SQLite, just to be
friendly, it's using what we might call ASCII art, just
very simple text, like hyphens, and vertical bars, and pluses on
the corner, to make it look pretty and make it look like
it is a proper table. But what you're really seeing
is the contents of favorites.db, specifically in that table. Specifically, if I only
care about languages, let me try something
more specific than star. Select language from
favorites semicolon. This is going to give me
just a single column now, of all of the favorites that you
selected for language specifically. This is a little overwhelming to
see all 399 or so pieces of data, so let me actually
truncate it a little bit. Let me do select language
from favorites limit 10. So we're about to see that
there's little tricks you can use to tweak the
behavior of the language in order to get back more or less data. In fact, it turns out
there's a bunch of keywords like these built into SQL, much like
Google Spreadsheets, Apple Numbers, Microsoft Excel, and certainly Python. There's a lot of functionality that you
just get for free with the language. If you want to calculate an average,
count the number of things in a file, get the unique or distinct values,
force everything to lowercase, force everything to uppercase,
get the maximum value, minimum value-- much like spreadsheets,
if you're familiar with that world, you get all of that functionality
in SQL but also more. So, for instance, if I go back
to my terminal window here. Let me go ahead and select
the total number of favorites in this table, the total number
of rows that you all inputted. So I could do select star
from favorites semicolon, and then I could literally
start counting these. Like 1, 2, 3, 4-- there's
clearly a better way. And, indeed, on our list of
functions was a count function. And so the way I can use
that in SQL is like this. Select count of star-- so pass star in as an argument
to the count function. You don't care what columns you're
counting just count them all. From favorites semicolon. And now, you're actually
going to get back like a little baby table that has just
one row, one column inside of which-- one cell of which has
the total actual count. And it's 398 because 399, recall,
included the actual header row from the file. All right. So suppose you want to-- actually, note that this
is the exact same thing as counting a specific
column because every row has the same number of columns, three. We could just say select
the count of languages, or select the count of problems. All of those are going to
give me back the same answer. It is, therefore, conventional
in SQL if you're just trying to count the
number of rows, don't even worry about what they're called. Just do count star to get
back everything more simply. All right, but what if we want to
get back the distinct languages and we didn't know a priori that
this came from a Google form with three radio buttons? Well, we could do something like this. We could select the distinct languages
from the favorites table, enter. And that gives me Python,
Scratch, C because distinct is one of the other functions
that comes with SQL. This is, obviously, very easily
countable with my human eyes, but if I wanted to do
this more dynamically, I could change this to be
count the distinct languages. And just like in C, just like
in Python, just like in Scratch, I can nest these functions
and pass the output of one into the input of another. If I hit enter now, I now
get three in this case here. OK. Let me pause to see if there's any
questions or confusion just yet. Yeah. [INDISTINCT SPEECH] Does SQLite-- [INDISTINCT SPEECH] SQLite3 is a program. And it's an implementation
of the SQLite language, which itself is a lightweight
version of what the world known as SQL, which is a very
convoluted way of saying there's lots of humans in the world. Not everyone agrees what SQL should be. Microsoft might disagree with Oracle,
might disagree with other companies, as well. So there's a common
subset of SQL in the world that almost everyone knows,
and learns, and uses, but there are also some
vendor specific features. SQLite tries to distill things
really into the essence, and so that's what you increasingly see
on Android, on iOS, on Macs, and PCs, as well. So we use it because it's
relatively canonical. Good question. All right. So let's do a few other things by
introducing a few other keywords without trying all of these right now. Here in this list is a
bunch of new keywords that are going to give
us even finer control. And we saw limit already, and
that just limits the output. But you can also have what
are called predicates. You can literally use the keyword
where to start filtering the data, without using an if, and an elif, and
an elif, and an elif, and so forth. You can just in one line
express something conditionally, you can order the data, and you can
even group similar data together. So what do I mean by this? Let me go back to VS Code
here, and let me play around with a few different queries. Let me select, maybe, the count of
rows from favorites, which previously was going to be 398 if I just get back
all of the rows, but suppose I only want to know how many of you
liked C. I can then say something like where the language in each
row equals, quote unquote, "C," and the convention here is to
use single quotes, though SQLite is tolerant of other formats as well. If I hit enter here, I'll see, indeed,
as we saw with Python, the 78 number. That, honestly, took what? 13, 14, 15 lines of code? Now I've distilled that kind of
query into a single line of SQL code instead, by using this
built in functionality. Suppose I really want to get specific,
and how many of you really liked Hello, World in C as your favorite? Well, I could change this query. And just like your dollar
sign prompt, your shell, you can go up and down in your
history in SQLite to save keystrokes. You can use Boolean logic. And I can say language
equals C AND, maybe, problem equals, quote
unquote, "Hello, World," and the number of you that
liked that problem was seven. So really, really early on
likes Hello, World in C. Now notice a couple of key differences. One, I'm using AND, and not
ampersand, ampersand like in C. I'm using single equal signs. So SQL behaves like Scratch does,
which is not like Python or C. Why? Different people have implemented
different languages differently. Equals, equals, equality
in the world of SQL for comparing things left and right. All right. Things are now going to get
a little more interesting, but the whole goal of
all of that Python code was to analyze the ranking of
languages and popularity thereof. Turns out in SQL, once
you have the vocabulary, it's pretty easy to do
something like that. I'm going to do this. I'm going to select all of
the languages in the table, but I'm also going to
select the count thereof. And then I'm going to do that
from the favorites table, but I'm going to group
by language because I claimed a moment ago that group by is
another one of our key phrases in SQL that's going to let us group data. And what this effectively
means is that if you've got this table with a lot of
duplicate languages, again, and again, and again, you can group by
that column, and, essentially, smush all of the Python rows together,
all of the Scratch rows together, all of the C rows together, but
figure out how many of those rows just got smushed together. Effectively, doing all of
that dictionary legwork, or the counter legwork,
that I did in 13-- 15 lines of Python code. So if I hit enter here,
this now is the motivation for what we're now starting to do. I have distilled into
a single line of code in a language called SQL
what, indeed, took me more than a dozen lines of Python
code just to get back an answer. And I can do the same
thing with problem. I can just change language here,
for instance, to problem instead. But, per this list, I can not only
group things, I can order them. So if you actually want to get a top
10, or a top three list, well let's just change this query slightly. Before the semicolon, let me order
by the count of those rows semicolon. And now what I get is
from smallest to largest. 40, 78, 280. If you want to flip that, that's fine. By default, order by uses
ascending order, abbreviated A-S-C. If you want to do descending order,
D-E-S-C, you can do that as well. And now we have a top three
list, from largest to smallest. Now, honestly, this is a bit of a
mouthful to use count star over here, count star over here. There's a nicety in SQL, too, where
you can create little aliases of sorts. So if I use the same query again-- let me scroll over to the left. I can actually use the
keyword as here, and I can rename this weird looking column,
count star, to anything I want. I can rename it to n. And then at the end of this query,
I can order by n, essentially, creating a synonym, if you
will, for one versus the other. So if I hit enter now, same exact
thing, but my little baby table that came back-- not a technical
term-- has two columns, one of which is more simply called n
now instead of count star. It just makes it minorly more
convenient in your actual SQL code to reference things that might
actually be a little annoying to type. Lastly, suppose we want
to get a top one list, and we just want the
most popular language. Honestly, I can just
do limit one, enter. That gives me just this tiny little
table, a temporary table, really, with one row. And, honestly, if I don't even
care about what the language is, I can omit that entirely. Just see how many people really
like the most popular language. 280, in this case. But, of course, it's more interesting
to see what it actually is. So, in short, just by turning
these knobs syntactically, it's relatively easy to start
getting at more and more data. And more answers there, too. Phew. Questions on this thus far? Any questions? No? OK. Well suppose that this
week, for instance. One of our new problems is
going to be called Fiftyville, and it's going to allow you
to explore the world of SQL in the context of a
place called Fiftyville. Suppose that suddenly becomes
your favorite problem. Well, how can we go about
adding more data to a database? Well, we've seen create
table for creating the table, we've seen select for
selecting data there from. Turns out there's also
an insert into command that you can use to insert
new data into a table. Now, I did this in bulk by just
importing that whole CSV file, and SQLite3 did it all
for me automatically. But in the real world, if you don't have
a captive audience, every one of whom is submitting the form
at the same time-- but maybe it's an application
that's running 24/7, you're going to get more and more data
over time, just like Google itself. So if you write code like
this, you can insert one row at a time, one row at a time, and
actually change the data in your table. So just as a check, let me do
select star from favorites, enter. Just to see all of the data. And the last data we got was
at 1:41 PM, and 21 seconds. Suppose now I've decided I
want to insert one new row. I can do this. Insert into favorites--
and then I have to specify what columns do I want to insert into. I'm going to insert a new language
column, and a new problem column. Timestamp? I could. I don't really want to look up the time,
so I'm going to leave that one blank. And I'm going to put in
values as follows for this. SQL for the language, which wasn't
even an option on the form earlier, and Fiftyville for the name
of the problem, semicolon. So there's a bit of dichotomy here. In the first set of parentheses,
you specify a comma separated list of the columns that you
want to put data into. In the second set of
parentheses, you actually specify the values that you
want to put into those columns. So when I hit enter
nothing seems to happen, which, in general, is a
good thing at my terminal. But if I now rerun select star
from favorites, we will see-- voila, a brand new row. We don't know what time or
date it was inputted at. In fact, we see an old friend, null,
which indicates the absence of a value, but we do indeed see that SQL, in
Fiftyville, is actually now in there. So in the world of SQL, null has nothing
to do with pointers or addresses. The world of SQL, it's just using the
same word to represent the same idea, that there's no data
here, but it has nothing to do with actual memory
addresses in this case. But suppose that you don't want
to do that, and, like, no, no, no. Let's just delete that. Fiftyville hasn't even
been released yet, nor have we even finished
talking about SQL. How do we delete data
from a database table? Well, there's a delete from command. Let me go back to VS Code here. Let me go ahead and clear my
terminal just to keep things clean. Let me go ahead and
delete from favorites, and let me not hit enter
here after a semicolon. This is one of the
most destructive things you can do as a database administrator. If you Google around,
there are horror stories of interns in the real world executing
commands like this at their companies. This will delete
everything from favorites. So if you ever do this, remember,
we told you not to today. But if we add a where clause here, only
delete rows where the timestamp column is null, this is more reasonable. And, frankly, any companies
you work for should also have backups of their
database, so we shouldn't even be reading about these horror
stories, but such is the real world. So this is going to delete any
row from the favorites table where Timestamp, capital T, because
that's how Google did it, is null. I go ahead and hit enter. Nothing seems to happen, but if I do
select star from favorites semicolon that now row is, again, gone. So you can use these predicates, these
where conditions, coupled with select, coupled with delete, and
other operations as well. What if I actually
want to make a change? So if you want to update
existing data like this. Well, we could do this. I could update this table I could
set one column equal to this value where some condition is true. So how might this work? Well, let me boldly
claim that a lot of you are really going to like
Fiftyville in the world of SQL, so all of these favorites
are sort of passe now. So let's do this. Let me go ahead and update
the favorites table, setting the language column
equal to "SQL," quote unquote. And with a comma let me go ahead and
also update the problem column to be equal to, quote unquote, "Fiftyville." I'm not going to have any kind
of where here, which means this is just going to do its
thing on all of the rows. So if I hit enter nothing
seems to have happened, but if I now do select
star from favorites, everyone's favorite is
literally that problem. So this too is destructive,
unlike the real digital world, there's no Control Z, or undo that. You better have made a backup
of your database, otherwise that's not a good thing. In this case, I do have this
CSV file, so I could just delete my favorites.db file. I could re-import the data, so I haven't
really lost anything of importance, but you could in the case of
the real world and any data you're actually working on. So just to make the
point, let me go ahead and delete from favorites
semicolon, enter. Let me reselect. There's no data there anymore. And in fact, if I do select
count star from favorites, we'll see as much that the
answer is, in fact, zero because everything has now been deleted. Any questions, then, on that code there? No? All right. So if not too scared yet, let's go
ahead and take our 10-minute break now. Halloween candy is served,
and we'll be back in 10. All right. So we are back, and before we dive
back into SQL and some real world data, it turns out, unbeknownst to me,
we've had a Halloween costume contest. So it's now time to announce the winners
of this year's CS50 costume contest. If our two winners would
like to come on up who, I'm told during break, dressed up as me. [APPLAUSE] Come on over. Would you like to introduce yourselves? SPEAKER 2: Yes Hi, everyone. I'm David. I am living in Matthews as a first year. I'm planning on studying
gov and computer science. SPEAKER 3: Hi, everyone. I'm David. I'm a second year in
Mather, and I'm planning on just studying computer science. [LAUGHING] SPEAKER 1: Well, thank you. We have some Oreos for you. Thank you, both, so much. Did anyone else dress like this? We have two more Oreos, if you'd like? Intentionally? [LAUGHING] Oh, that's pretty good. OK. Yes, we have one more winner. Come on down. [LAUGHS] Thank you. SPEAKER 4: Hi. SPEAKER 1: This is intentional? SPEAKER 4: No. SPEAKER 1: Oh, OK. [LAUGHING] SPEAKER 4: Hello, my name is David. I'm from Canada, and I'm a first year. I'm not sure what I'm going to study. SPEAKER 1: OK. Welcome, as well. SPEAKER 4: Thank you. SPEAKER 1: All right. [APPLAUSE] Up until now, we've played
around with the data that you all gave us, which was based,
very simply, on your favorite language and your favorite problems. But it turns out there's a lot
of real world data in, indeed, the real world, some of
which is quite voluminous. And, indeed, there can be not just
dozens, or hundreds, but thousands, hundreds of thousands, even millions
of rows in the biggest of databases. And so what we thought we'd do
in the latter part of today, is really, actually, get our
hands dirty with a real world data set from the Internet Movie
Database, otherwise known as IMDb. And, in fact, if you go to
imdb.com, you'll be able to answer, via their web interface, some
of the very questions we'll do today using SQL alone. But what you'll find, ultimately,
is that what websites like imdb.com, or their mobile app versions
thereof, are probably doing is-- yes, giving you a nice pretty
graphical interface to type queries, but underneath the hood they are
passing your input into SQL queries, or similar queries, that
they formed most of. They're just waiting for
placeholders, like the keywords, that you're actually searching for. So let's go ahead and experiment,
maybe, with just some real-world data, initially, before we consider how
to actually store it at scale. So let me open up, just
for the sake of discussion, an actual, empty
spreadsheet, just so I have some rows and columns to play with. And let me propose that we want to
model TV shows from the real world. How can we go about doing this? Well, maybe I could start in
this first column A, so to speak, and I could create a title column. And then, maybe, a column
for the star of that show. And a very popular show,
of course, is The Office. So I might put this into the
second cell in that first column. And under star, I could put
someone like Steve Carell. But, of course, he wasn't
the only star of the show. There are others as well. And so if I want to put in
someone like Rainn Wilson, well, maybe, I need
a second star column. So Rainn Wilson. But even as early as
the first season, there was also another star in
the credits, John Krasinski. So he was a star. Jenna Fischer was top
credited in the first season. So Jenna Fischer. And then BJ Novak, Harvard alum, was
also in the first season's opening credits, as well. So we've got all-- one, two, three, four,
five of these folks. Hopefully, I didn't
misspell anyone's name, but here's the beginnings
of a real-world data set. And we could imagine doing this
for everyone's favorite shows, adding more, and more rows. But let's consider, as we often
do, not just the correctness of this implementation, but the design. It's pretty straightforward. It's very readable. So I think it's good in that sense. But if you start to nitpick
what's poorly designed, even in the world of spreadsheets,
about what I've done here-- assuming that the next
row is another show, the next row is another
show, and so forth. What's bad about this? Yeah? [INDISTINCT SPEECH] Yeah, so each row is going to have
a different number of columns, and even I, kind of, couldn't
make up my mind from the get go. Like, do I have just one star column? Or two? Or maybe now I'm up to 5? For even bigger shows,
and later in The Office, when more people got
top billing, we're going to need more than five
columns for stars. So that's fine. We can clearly scroll to the right,
and just keep adding more columns, but there should be
something about that design that like rubs you the
wrong way, like something feels a little off if some
rows have this many columns, others have this many. The data would be very jagged
along the right hand side. It would be very sparse, which
would be another way to describe it. There's probably a better way. So maybe I should flip
the data around, and maybe a better approach here would be
to just have one column for stars. So let me do this. Let me just move Rainn Wilson over
here, and John Krasinski over here, and Jenna Fischer over here,
and BJ Novak over here, as well. I'll get rid of all of these
superfluous, identically named columns. And now this is sort of better because
now I can have any number of stars in the vertical, although it's a
little weird to leave this blank. So maybe I should of copy paste here. So in some sense this is
better, in that now I only have one title column, one
star column, and I can just keep adding row, row, row,
for each show and its stars. But what's now poorly
designed about this? Yeah? [INDISTINCT SPEECH] Yeah. I'm repeating the title. And, in general, copy, paste,
repeating yourself in code has generally been a bad thing. It's generally gotten us in trouble
if I make a change, or maybe a typographical error somewhere. Maybe it propagates elsewhere. And if nothing else, it's
just a lot of wasted space. If this is actually going to be
stored in a database, in a CSV file, why are you duplicating the same
string again, and again, and again, for large TV shows? That's just wasteful. It just doesn't seem wise. So how can we eliminate that redundancy? Well, unfortunately, in the world
of spreadsheets, things kind of now escalate quickly to be kind of annoying. But let me do it, nonetheless,
with just a small bit of data. Let me propose that we do this instead. Let me create not one sheet
but maybe multiple sheets, and assume that there's
some kind of relationship, or relation, across these sheets. So, just to be pedantic, let me call
this sheet, not the default sheet one, but let's call this shows. And in this sheet, I'm going to
have a title column for every show, and I think I'm going
to be proactive here. I'm going to start giving every show
a unique ID number, much like Harvard affiliates have Harvard
IDs, Yale affiliates have Yale ID numbers, and so forth. Let's go ahead and give each
show its own unique identifier, for reasons we'll soon see. So for The Office let
me, just for consistency with the actual Internet
Movie Database, I'm going to give it a
unique number of 386676. The specifics don't
really matter, but that happens to be what people in
the real world actually do. But that's it for TV shows,
even though I could imagine there being many more in this sheet. Let me create another sheet now
here, and I'll call it people. And in the people sheet, let me
keep track of all of those TV stars. So one column will be name, another
will be also called ID here, but it's going to be a
person ID, not a show ID. And here we have, for
instance, Steve Carell. We have Rainn Wilson. We have John Krasinski. We have Jenna Fischer. And we have BJ Novak. And, this is going to
be a little tedious, but just to be consistent
with reality, turns out that according to IMDb Steve Carell's
unique number in the world is 136797, Rainn's is 933988, John's is
1024677, Jenna's is 278979, and, lastly, BJ Novak's is 1145983. Phew. OK. So now we have the same people, but
they each have a unique ID number. Lastly, let's associate those shows
with those people in a way that avoids the two problems we
identified earlier, which was having a variable number
of columns in one case versus redundancy in the second case. Let's really tighten things
up so nothing is in duplicate that doesn't actually need to be. So I'm going to create a third
sheet here, and I'll call it stars, like the TV stars for these
shows, and what I'm going to do is have only two columns,
a show ID, and a person ID. I could write these in
different ways, but it's conventional in the database world
to use snake case, so to speak, where everything is lowercase,
with underscores instead of spaces. And for show ID-- well, we're
only mocking up one show for now, but I'm going to go ahead and say
386676, which is The Office, I claimed. And now I'm going to go ahead
and have all of those person IDs, and this time it's OK to copy
paste, if only to save time. So I'm going to grab all of these
five stars IDs, paste them there, and I am going to indulge
by duplicating the show ID, even though we didn't like that earlier. Now this indeed has escalated
quickly because none of these sheets are very useful to look at to the
human eye because none of them has the complete picture. We have shows in one, people in another,
and then like this cryptic mapping of numbers in the third. But I propose that this is the
right way to implement data if your goal is to have a
canonical source of truth for every show and every person. That is to say, you only say the
show's name once in one place. You only write the TV star's
names once in one place. And you associate them, you
relate one with the other, by way of this third sheet here. So if you've not seen it already,
notice that if The Office has this unique ID, 386676,
notice in the stars table, that same value appears multiple times. But what this third sheet is doing
is associating that same show ID with one, two, three, four,
five different people. Now I can see that it's similar
in spirit to what we already indicted as bad design a moment ago. The Office, The Office, The Office. But think about our world of C. In
the world of C, and really computers in general, data takes up finite
amount of space, typically. Like an integer is four bytes, 32 bits. So even though, yes, I'm
duplicating this value, it's just the same four
bytes, four bytes, four bytes. It's not t-h-e space
o-f-f-i-c-e, null character. It's not the same 11
bytes again, and again. It's just a number, and numbers
tend to be much more efficient. That computers can crunch
numbers much more quickly. Duplicating numbers is in
general allowed, or smiled upon. Duplicating strings will
get you into trouble. So with that said, is this
a useful spreadsheet now? Would you want to be handed this
in your job and asked questions? Like, hey, who stars in The Office? You can answer it, but you
have to look up one sheet, then another, then a third. Or you need to use VLOOKUP, or special
functions in Excel, or Google Sheets. I mean, you're just creating
a lot of work for yourself. But, academically, if you
will, systematically, this has a lot of merit because
we've avoided all duplication. We've normalized the data, so to speak,
by factoring out any duplication. So where are we going with this? Well, it turns out that we'll
play now with some actual data from the real world from the
actual Internet Movie database. And in a moment, it's going to
look a little something like this. This is an artist's rendition
of five different-- sorry, six different tables. So not one, but six
different tables that we created using freely available
Internet Movie Database data. They kindly provide, not CSVs,
but TSVs, tab separated values, which are essentially
the same thing, but you look for tab characters
instead of commas in order to convert them, as
we did, into our own format, SQLite in this case. But we'll see that there
is a file that we've made available from today, which you
can download off the course's website, called shows.db, and it contains
all of this same information. And in that shows.db file there
are indeed these six tables, but let's focus on just
two of them initially. IMDb, the Internet Movie Database,
is all about rating TV shows, and tracking that kind of
information, so let's actually take a look at some of these
ratings and figure out how we can actually
answer actual questions. So let me go over to VS Code. And let me run SQLite of shows.db,
which is a file that already exists. There's no CSVs, no TSVs, we
did all of this for you already. When I hit enter, I
get my SQLite prompt. And the first thing
I like to do whenever I'm playing with a SQL database
for the first time-- maybe I got it from a class,
or my boss, or the like-- is just to wrap my mind around
what's inside of the database because you're not typically going to
be provided with pictures like this. You're just going to be given a file. So let me just select some data. Let me select star from the shows table. I don't really want to see all
of it, so let me just limit it to the first 10 shows in the table. And here, we can infer what
the shows table looks like. Every show has an ID, a title, the
year in which it debuted, apparently, and the number of episodes as of
last night when we exported the data. So that seems to reflect this
picture, and this is technically an entity relationship diagram, a
standard way of depicting things. And you'll see that in
our picture shows, indeed, have an ID column, title column,
year column, and episodes column. Well what about these ratings? Well, according to the picture,
that has a show ID, a rating, and a votes column. So let's go back to VS Code
here, and let's do select star from ratings, limit 10, just to wrap
our mind around some of the data. And there we have a
show ID in the left, we have rating in the middle, which seems
to be like a floating point value, and then votes, which
seems to be an integer. So we have some different
types of data here. But there's a lot of data. In fact, if I do this now. Select star from shows-- let's not select all of the data. Let's do select count stars
from shows, and in this database there are 214,000
shows in this database. So we're well past the 398 rows that
we've been talking about thus far. So it turns out, per this
diagram, there's actually a standard relationship between
these two tables, shows and ratings respectively, and it's what
we would call a one-to-one one relationship whereby every show
in this design has one rating. And this is indicated,
technically, if you look at what the arrowheads
look like on these diagrams. This indicates that it's a
one-to-one relationship, which means every show has one rating, which
means every row in the shows table has a corresponding row
in the ratings table. Strictly speaking, they
could be in the same table. You could just join them together
and make one wider table, but IMDb keeps the
data separate so we too kept them separate in
two separate tables here. So what does this
actually mean in practice? Well, let's actually take a look in VS
Code at the schema for these tables. Let me clear my screen,
and let me .schema, but specifically look at the schema,
or the design of, the shows table. So you can do .schema shows, and any
command in SQLite with a dot is SQLite specific. In the real world, if you're
using other products like Oracle, or Postgre, or MySQL, or others,
they have different commands, but anything else that
we've been typing, especially the capitalized
keywords, is indeed standard SQL. If I hit enter here, here is what the
shows table apparently looks like. In other words, here is the create
table command that we, the staff, ran in order to create
this table for you, and then we imported a
bunch of data into it. Every show has an ID, has a title,
has a year, has a number of episodes. But there's more detail here. Apparently, the ID is an integer, the
title is text, and it is not null, cannot be null, the year is
numeric, whatever that means, and the episodes is an integer. Well, that is now familiar. And then primary key. So there's some other
stuff going on there, which we'll come back to, but let
me also do .schema ratings, enter, and we'll see a couple
of other data types here. There's show ID, which is
still an integer, but not null. Rating, which is a real number, a.k.a. float, but it, too, cannot be null. And then some number of votes,
which is an integer, cannot be null. And then there's mention of foreign key. So unlike our data set for the
favorites a bit ago, which we just did automatically, and imported,
this database by us, and by IMDb, has been more thoughtfully
designed, where there's actually some relationships across multiple
tables, rather than previously just one. Now in the world of SQL, we have
indeed different data types. For instance, we have
these five primarily. One, cutely named
BLOB, which is actually Binary Large Object, which generally
means like a file or some piece of data that's zeros and ones. Though, generally, it's best
to store files on file systems, like in folders, on disks, so
to speak, not in your database. There's integers we've seen. There's numeric. Which numeric is more
like dates and times, things that are numbers, but not
necessarily integers or floating point values. There's reals, which do
have decimal points in them. And then there's just text. In other SQL databases in the
real world, in your future jobs, or your future classes that
you might use SQL again in, there are even more data types in other
databases: Oracle, MySQL, Postgre, and so forth. But these are sort of
representative of them. They just get more
precise in other systems. But there's also some keywords,
that we've seen already, that you can specify when
designing a database. --that this column cannot be null. If you want to make sure that
no one can insert or update data unbeknownst to
you that is null, you can impose that when creating the table. And unlike Excel, and
Google Spreadsheets, and Apple Numbers, which will generally
let the human type in or not type in anything they want, with a
database you have more protections over the integrity of your data. Moreover, you can specify that a
column's values must be unique. If you want to avoid
duplicates, like you don't want the same person to be able to
register twice for your website thereby making sure they have
one unique email address, your database can help with that too. You don't have to rely on Python
to check if it already exists. But there's this other feature
of relational databases, that is databases that have multiple tables
across which there are relationships, and that's these keywords
we saw briefly a moment ago. Primary key, and foreign key. And we started to scratch
this surface here. It turns out what I
was doing was actually best practice in the world
of relational databases I gave or, really, IMDb gave every
show in the world a unique ID. And that unique ID, in this case
386676, is a numeric value, an integer, that uniquely identifies that TV show. In other words, this is the
primary key for this table. Technically a sheet, but I'm using
that just because it's easier to type in than my black and white window. This ID column is the primary key for
shows, as I was mocking up earlier. What is the primary key
in the people sheet here? It is also ID. It's a different ID, but it's by
convention often called the same thing. But this people column,
called ID, is its primary key. And you can perhaps see
where this is going. Those same numbers also happen
to appear in this third table, but in that context, they're
sort of foreign keys. They didn't come from this star sheet,
but they are in this star sheet, so they're sort of
relatively foreign to it. So foreign keys is simply the presence
of primary keys in some other table, in some other data set. And so it's just a
description of relativity, but the foreign key is the column
that uniquely identifies your data. Foreign keys is just the appearance
of those same numbers elsewhere. So what does this mean? If we go back to VS
Code here, you'll see that when we created this
table using IMDb's real data, we specified that in our shows
table, which is bigger than the one I mocked up with Google Sheets there,
has not only an ID and a title it, again, also has year in
which the show debuted, and the total number of episodes. Because that's juicy data
that comes from IMDb, beyond what I mocked up a moment ago. In the ratings table,
meanwhile, there's a show ID, which is also an integer, just like
this ID, but as the name implies, this show ID column is actually going
to be a foreign key that references the shows tables ID column. So this is the relational
in relational databases. These are two tables
that have a relationship, and that relationship is that show ID
is referring to this actual ID here. And it allows us
essentially, conceptually, to link these two tables together. So what does this actually mean? Well, let me go ahead and do this. Let me go back to VS Code here. I'll clear my terminal, and let's
play around with some of this data. So let's go ahead and do
this just to experiment. Select star from ratings where-- let's get all of the good shows
where, just like Rotten Tomatoes, we'll do the cut off at 6.0 out of 10. So where rating is greater
than or equal to 6.0. And just so I don't overwhelm
my screen, let me just limit this to the first 10 results. In other words, this is
SQL syntax for selecting all of the ratings that are at
least 6.0 or higher from that table. Enter. And we see just the first 10 of them. Not the top 10 because we've not sorted
or grouped or anything like that, but the first 10 in the table. So what is interesting
here is that we've seen just some sampling
of the data, if you will. But this isn't all
that interesting here. Let me actually distill this just to
the show ID because, in other words, I want to know 10 good shows to watch. So let me just select show ID. So same result, but less data. It's just that first column thereof. Now this is going to
be a little annoying, but if I want to find out
the names of these shows-- think about the picture
from whence we came. All of the show's names are in the shows
table, but all of the show's ratings are in the ratings table. So even if I do select
star from ratings, I'm never going to know
what show I'm looking at. Like, what the heck is show ID 62614? Well, I could do this. I could select star from shows where the
ID of the show equals 62614 semicolon. Enter. OK, so I could watch
this show from 1981. Let me do another one. Select star from shows
where ID equals 63881. So I'm just grabbing
the second ID from here. OK. So Catweazle, a kid's show from 1970. All right. So I'll watch that. So now let's do another one. I'll just copy this. Like suffice it to say, this is
not the best way to look up data, where I'm literally copying and pasting
values from one query into the next. But this is where SQL
gets a bit powerful. I can have nested queries. I can put one inside of the other. So let me instead do this. Let me clear the screen,
and let me instead do this. Select star from shows
where the ID of the show is in the following list of IDs. Select. And, actually, I'll do
this on a separate line. Select show ID from ratings
where the rating value is greater than or equal to 6.0 semicolon. So I've separated this onto two lines. The dot, dot, dot is just
a continuation character, which means same query, multiple lines,
but the parentheses are deliberate. Just like grade school math, I want
what's in parentheses to happen first. And so what the database
will do is we'll select, as before, all of the
show IDs from the ratings table where the rating value is
at least 6.0 out of 10. And that's going to
return to me, effectively, a list, some kind of
collection of show IDs, which previously I was
copying and pasting, now the database will
do the legwork for me. It will now select everything from the
shows table, where the ID of the show is in that list of values. And it's actually going to be
more than 10 unless I go in there and say limit 10, which I can do. So let me go ahead and hit enter now. And now I see more useful
information not just the ratings information,
which in a vacuum tells me nothing about what to watch. Now I see the show ID, the title,
the year, and the episodes. But notably, what is-- and if I want
to distill this into just the title, let me actually go back
here and instead do select just the title from shows
where the ID is in this whole list. And I'll re-execute it by just
copying and pasting the same. The only difference now is instead
of star, I'm selecting title. Here's now how the data
analyst at IMDb might be selecting 10 shows that are really
good to watch, according to ratings. But what, of course, is
missing from the output, whether I do star or just title? What's missing? Yeah? AUDIENCE: Rating. - The actual rating. I know these are at least
6.0, but which is 7.0? Which is 10.0? It'd be nice to actually
combine the data in some way. So we can actually do that,
too, because it turns out that when you have two tables in
the world of SQL, or even more, you can actually join them together. You can join them together, literally
using a keyword called join. And you can do this as follows. Let me propose, with a sample
data set, these two tables. And dot, dot, dot just means we don't
really care about the specifics. We just care about the structure. So on the left here is a
simplified version of my shows table that has a show
ID and a show title, but I've omitted year and
episodes as just intellectually distracting from the
basic structure here. But they're there in the real table. On the right hand side here,
we have the ratings table with just two of its columns,
the show ID and the rating, but I've omitted the votes
because it doesn't really add anything to the discussion. But let me propose this. Notice that these two
tables and these two rows therein definitely have commonalities. Like, they both have the same ID. In the left table it's 386676, a.k.a. The primary key of that row. But it's also appearing in duplicate
in the right hand table, 386676, and in that context it's a foreign key. The point, though, is
that they're the same. So wouldn't it be nice if I could
treat one table here, one table here, and if my fingertips represent these
identical values, kind of like glue them together so I get one wider
table with all of the information together to satisfy your
concern, that we don't even know what the ratings are of those shows. Well let me go ahead and do this. Just for artist's rendition,
let me flip title and ID, which has no functional effect. It's just going to put the numbers
closer together on the screen. Let me then literally highlight the fact
that these two numbers are identical, and let me propose that we
do the equivalent of this, we somehow join these two
tables on that common value. Strictly speaking, I
don't need both values because they're duplicates, so I
don't care if one of them goes away. But what I'd really like
to do is select, indeed, a temporary table that is the joined
version of that original data. And frankly, I don't really
care as the user what the ID is. Heck, all I care about is what show
to watch and what its rating is. Give me the title and the rating. All of these numbers are,
again, metadata, things that the computer cares about,
but we humans probably do not. So how can we implement that idea? Of taking one data set
that has a relationship with this data set, and
somehow combine it together? Well let me go back to VS Code here. Let me clear my screen. And this is going to be a bit cryptic
at first, but it's very step by step. Let me do this. Select star from shows,
but not from shows alone. Let me join it with the ratings table. So let me select everything
from shows joined with ratings, but I need to tell the database, well,
what do I want to join things on? What are my fingertips? Specifically, I want them to
join on those common integers. So I can literally say, on, and then
I can specify one table on the left. Shows.id on the left should equal
the ratings table's shows.id column on the right. Again, if I'm joining shows with
ratings it's called ID in one, it's called show ID in the other,
but it's the exact same thing. In fact, if I rewind, this
is where we came from. Two tables with the same value. So with this query here, if I go ahead
and now specify not just that but let me further say, where rating is
greater than or equal to 6.0. And, heck, let's limit it to 10. Just fits on the screen, so it's more
of a mouthful, but when I hit enter now we have a wider table that indeed
contains everything, star, from having joined these two tables left and right. Now, again, I don't really
care about much of this data, like year and episodes and
definitely not the IDs. So let me actually hit up. Let me go to the beginning
of the query, and let me just select the title of the
show and the rating of the show. The query is getting a little long and
it's wrapping, but it's the same query. Except, instead of star,
I've done title comma rating. Now when I hit enter-- like, this is the list that would have
been nice to see the first time around. Show me 10 shows with a
rating of 6.0 or higher, but remind me what the
rating actually is so, maybe, I can prioritize the sevens, the
eights, the nines, and even the tens, if any. Any questions about this
technique of joining two tables? This sort of solves the problem that
we created in the world of this sheet, where I was just kind of playing
around, where I sort of moved all the data into its
separate locations, which is not at all pleasant to use. But with SQL, with join, you can
still get any of the data you want. Yeah. [INDISTINCT SPEECH] Correct. So, yes, I should have called
that out more explicitly. In my query here, I was using dot
notation, which we've seen in Python, we've seen in C. It means
something similar in spirit here, but it has nothing to do
with structured objects. In this case, it has to do
with tables and columns. So shows.id just makes clear that I
want the ID column from the shows table to line up with the show ID
column from the ratings table. Strictly speaking, I don't need
to do that because in this case there is no ambiguity. One table has a column
called ID, the other table has a column called show ID. So certainly the database can
just figure this out for me. But, for best practice, and for the
sake of being explicit, using the dot notation and table names
can help, especially if there's some common
language across them. All right. Well let's go back to
the bigger data set here. These are all six tables in IMDb. We focused for just a moment there
on like shows and ratings alone, but what about genres. So, genres, like comedy, and
documentary, and drama, and so forth. Turns out that this actually implements
a different type of relationship. Previously we saw a
one-to-one relationship, but it turns out that
IMDb supports what's called a one-to-many relationship
when it comes to genres. Why? Well, shows like The Office, I do
think are generally considered comedy and that's it, but there
are certainly other TV shows that might have multiple
genres associated with them. Maybe it's comedy and a bit of romance
thrown in, like rom-coms, and so forth, so you could imagine some shows
having two or three or more genres, and so one-to-many means that
one show can have many genres. One-to-one would mean one show can
have one rating, as we've seen. So why don't we go ahead and focus,
maybe, on how about a query like this. Let me go back to VS Code
here, clear my screen, and let's just look at
some of those genres. Select star from genres and
then I'll limit it to 10. And, again, I do this just to wrap
my mind around a new data set. I could look at the schema, but
that tends to be more cryptic. I just want to look at the raw data. OK. It looks like here there are a bunch
of genres: comedy, adventure, comedy. So two comedies, which is interesting. Oh, interesting. Family, action, sci-fi, family. So the values here are
duplicated, which it turns out is not the best design of IMDb. We literally just imported
the data as they implement it. But notice that show ID
62614 is a comedy, but so is show 63881, and so is show 65270. So it turns out that in the real world
sometimes data is somewhat messy. There's duplication of
comedy, comedy, comedy, but such is the way IMDb's data is. But what's more interesting to
me, for now, is notice this. This show ID in three rows is the same. So there's some show out there
that's considered by the world to be an adventure, a
comedy, and a family show. So let's see what that is. Let me just highlight and copy
that value, 63881, and do this. Select star from shows where the
ID of the show equals that value. And it turns out, we
saw it briefly before, it's a show from the 1970s called
Catweazle, which falls into all three of those categories. So by using a one-to-many relationship,
sort of depicted by this picture here, you can implement that same idea
without having that jagged edge. When we looked at the spreadsheet
earlier, in an earlier version of this, we had star, star, star, which we
could do again, genre, genre, genre, but now we instead have
two separate tables where this many-to-many
relationship is implemented across. So let's actually play around with it. Let me go back to VS Code
here, and let's actually take a look at the schema for genres. And we'll see that it's pretty
small, as the picture suggests. It's called genres. Every row has a show ID, which
is an integer, cannot be null. It has a genre, which
is text, cannot be null. And that show ID is a foreign key
in this table that references an ID column in the shows table. So very similar in
spirit, and so it really is just kind of on the
honor system that we're only putting one row for
each show in ratings, but 0 or more shows in the
genres table for shows, as well. So what can we do once we
want to tinker with genres? Well, let me do this. How about we select the show ID from the
genres table where the genre is comedy. Like, I'm in the mood for some comedy. Let's see all of the available comedies,
except let's just limit it to 10. Here are the show IDs for 10 comedies
according to the Internet Movie Database. Well, that's not very interesting. I care about the title,
so we can do that. Select title from shows where the
ID of the show is not equal to, but rather, in the following
subquery, if you will. Select show ID from genres where
genre equals, quote unquote, "Comedy," limit 10, just
to keep things simple. So same query as before, but now
I'm using it as a nested query to select the actual titles
whose IDs match those their. Enter. And there are those
titles for 10 comedies, maybe a couple of which we've seen. Catweazel we've seen before. But what if we want to now-- let's see. Maybe we want to get all of
the-- let's flip it around. Catweazle keeps coming up. Why don't we figure out, with a
query, what all of its genres are? So Catweazle. Let's see. So let's do select star
from genres limit 10, because I know it's in the the first 10. This was, 63881 was Catweazle's ID. So let's do this. So select genre from genres where
the show ID equals that value. OK. So there's the same
query as we did before. Can we make this dynamic? Well, we can, too. What if I instead, more
dynamically do, select genre from genres where the show ID-- and not in. If I'm looking for a specific
show, now I can actually do equals. And in my subquery, I
could do this, select ID from shows where the
title of the show equals, quote unquote, "Catweazel,"
semicolon, enter. So, again, even though I'm
typing these very quickly, I'm really just composing
similar, smaller ideas that we've seen before into
larger and larger queries to just get at more of this data. So what's really going
on underneath the hood? Well, you can think of it like this. If we've got this relationship
between shows and genres, here's an excerpt from
shows, and I didn't bother showing the thousands of other shows. Here's an excerpt from
genres on the right. What is that query-- what are
we essentially trying to do? Well, let me flip this around here. Let me highlight the fact that
this is the same, this is the same, this is the same. So wouldn't it be nice if I could
kind of get these all together? Well, if I join these
tables, we're actually going to notice an interesting artifact. If I join them together, as
we did before with ratings, I'm going to need to fill in the gap
there because this is not a table. Tables, by definition, always have
the same number of rows and columns. You can't have gaps in them like this. So the simplest thing to do is
just to fill that in this way. But if I were to try to
combine two tables that have this one-to-many
relationship, you're actually going to get duplication. It's not duplication
in the original tables, but in the temporary tables,
otherwise known as a result set, that's coming back to us. So what do I mean by this? Well, if we actually implement
this same idea as before, where we try to join
these two tables, let me propose that we do
it with this syntax. Let me do select star
from shows join genres, which is just like we did with ratings
but now let's join it on shows.id equals genres.show_ID. But let's just do this for Catweazel,
where ID equals 63881, semicolon. With the ratings, it
worked perfectly because it was a one-to-one relationship, so
the rows just got wider, if you will. But now, because it's a
one-to-many relationship, when you execute these queries, you are
going to get back duplicate data but this is OK. It's considered OK because
this is sort of ephemeral. These result sets,
these temporary tables, exist just for us to
look at the data, just for us to crunch the numbers somehow. It's not actually stored in
duplicate in the database itself. If I wanted to tighten
this further, though, let me actually get rid of the star
and let me just do title genre. And, indeed, we can now see. OK, Catweazel three times has
three different categories. But, generally, we don't
even care about that so I can even whittle this query
down to just selecting genre, and that, too, will just give me
the result, effectively, hiding the duplication. But when you join data with
a one-to-many relationship you're temporarily going to get
duplicates, which is actually useful because it's very easy then
to get at the show's title no matter where you are in some loop. All right. Well, what more can we do here? Well let me propose that we revisit the
main database, here with six tables, and let's look at, perhaps, the
juiciest, and the one that's really what most people use IMDb for
is to look up shows and people therein. Let's focus on these three tables. And we can infer, from this diagram,
that there's now, for the first time, three tables involved in a relationship. There's people, there's shows, but
I've proposed this intermediary stars table, much like I
temporarily, in Google Sheets, gave us a third sheet to
link the two together. This stars table we're about
to see, it's purpose in life is to join two other tables together. And, in fact, it's only going to have
two columns, show ID and person ID. So what this is going to do
for us is implement this idea, many-to-many relationship. Why? Because any TV show can
obviously have many people in it, but one person can presumably
star in many different shows. Like, Steve Carell has been in
multiple shows, not just The Office. So when you have a
many-to-many relationship, you actually do need this third table
to bridge the two any number of times. But it's going to make our life
a little more unpleasant to get the data we want because it's going to
add some additional steps, if you will. So let me do this. Suppose that I want to get
everything I know about The Office. Well, let's start with a
single query here in VS Code. Select star from shows where title
equals, quote unquote, "The Office." And I should see-- Oh, interesting. Several attempts at creating a TV
show called The Office over the years. You can, perhaps, infer the year in
which the most popular of them began. 2005? So I presume this is the one
we all know and have watched, at least in the US, which is this one,
386676, which matches the ID that I very carefully used earlier. So let me actually be a
little more deliberate, where the title equals The
Office and the year equals 2005. That query now gets us The Office that
we all in the US and, perhaps, love. But now let's actually
do something like, get all of the people who
starred in it, at least according to IMDb,
whoever had top billing. So how can I do this? Well, unfortunately, in the
shows table there are no people, and there's no stars even, but I
could do a nested query like this. Why don't I select the person
ID from the stars table where-- whoops, where the-- sorry, where show
ID equals, and then in parentheses let me do that same query as
before and for time's sake, I'll just copy paste so that
we get back the one and only Office in that subquery. So what I'm going to do is take
an intermediate step, a baby step, if you will. Right now, I have found in the
shows table The Office I care about. But if I want to get
to the people table, I have to take a step through the stars
table, this intermediate table, just to get anywhere close to
the names of those people. So what can I get from the stars table? Well, why don't I at least
select all of the person IDs in that table that are somehow
associated with the same show ID? So in VS Code, what I'm doing is this. Select person ID from
that intermediate stars table, where the show ID in question is
whatever the show ID is for The Office. I could literally type
386676, but I'm trying to do this more dynamically so
I've used a nested query instead. All right, this is correct. Whoops, ironically. Where show ID-- Oh, thank you. Oh, who said that? OK. Batman. I owe Batman Oreos after class. Thank you, Batman. That's de facto. Yes. Other good prize for that costume. [LAUGHS] OK, so let me fix this. My apologies. So let's go ahead and
select person ID from stars where show ID equals, and this
is where I messed up before. I did select star from
shows, but I can't look for a show ID equaling
an entire row of information. I instead need to do select ID from
shows where the shows title equals, quote unquote, "The Office" and
the year of that show is 2005. And just to call this out, much
like in C, I'm quoting strings, but I'm not quoting numbers. It's not necessary for
something like 2005. And just to be super clear, too, I have
generally adopted a style already today of capitalizing any SQL keywords,
like select, like from, like where, and so forth, and then using
lowercase for everything else. Strictly speaking, that's not necessary. Stylistically, we would
encourage you to be in the habit of using
uppercase for your SQL keywords because they just pop
more on the screen. It sort of makes things more readable. But strictly speaking,
SQL itself does not care about that capitalization of keywords. All right, now let me cross my fingers. And now I get back this
list of person IDs. And, again, my goal
was to figure out who is in The Office that debuted in 2005. This is not that interesting because
I don't know who any of these people are, but here's where we
can do one additional step and nest a nested query. So let me actually select the
names from the people table where the ID of those people is in. And then on a new line,
just to make it pretty, where the ID equals this
query, which I'll paste here, and then that equals, and then
another line and indent further, this query here. So just to save time, I'm copying
and pasting the previous query, but I'm wrapping it with one
outermost query, now, that's saying select names from people
where the ID of those people is in this result set, where the ID
of those shows is in this result set. So the parentheses make clear,
ideally, what's happening and what order from inside out. Enter. And there we have it, at least according
to IMDb, for the latest season, like this is the top-billed stars
that are in this here database. All right. So how can we do something else? Well, let me just do it
in the other direction. Suppose we want to see all of Steve
Carell shows, not just The Office. Select title from shows where
the ID of the show is in. And then here, I'm
going to do select show ID from stars where person ID equals. And then here, indenting
for clarity, I'm going to select Steve
Carell's ID by saying, select ID from people where
the name of that person is, quote unquote, "Steve Carell." And so in this way, I'm writing the SQL
query sort of in reverse. I'm asking first what I care about. But before I can even
answer that, I have to answer this nested query,
what is Steve Carell's ID? Once I have that, what are all the
show IDs that person ID has been in? And then, please tell me what the
title of all of those shows is. Let me go ahead and cross my fingers. And, voila. Some of these you might have heard
of, some of you might not have, but if you were to go on imdb.com
and search for Steve Carell, you would presumably see this here
list of shows that he's been in, in some particular order. Just to show you two other syntaxes,
but let me not emphasize this because it will look complicated. There are other ways to
solve this same problem. If you prefer the
approach of joining, we can actually join not just
two, but three tables together. But question first. [INDISTINCT SPEECH] No, just stretching first. So two final ways to execute the same
idea, but the first of them that I just did is arguably, relatively, simpler. You could do this. I could select the title
from the shows table by joining it on the stars
table on the shows ID column, equaling the stars
tables show ID column. And then I can further join it on the
people table, on stars dot person ID equaling people.id. This is a mouthful, and even I am kind
of crossing my fingers that it didn't screw up when transcribing
it from my printout here, but what I'm effectively doing is
joining one, two, three tables all together by telling the database how
to join the shows table, with the stars table, and the people table. Specifically, the way to bridge
that picture, per the diagram, is to specify that shows.id should
be lined up with stars.show ID. And stars.person ID should
be lined up with people ID. And that's it. That essentially allows
us to connect these three tables with their common fields. If I hit enter now, I'm going
to get back, somewhat slowly actually, a really long
list, with some duplication, of all of those particular shows. Actually, all shows in
the database because I didn't practice what I'm preaching. I wanted to search for just
Steve Carell, what you're seeing is the entirety of the tens
of thousands of TV shows. Control C is your friend. Let me go ahead and reload SQLite,
and let me type that again. Let me type that once more, sorry. Select title from shows, join stars
on shows.id equals stars.show ID, join people on stars.person
ID equals people ID where-- this was the part I left out. Name equals Steve Carell. And if I didn't screw up
by typing so fast-- enter. Now we get, a little
more slowly than before, those same shows that
Steve Carell starred in. So this is just to say there's
another way of doing this. But maybe a third way, which
is a little simpler than that, explicitly joining them in that way. You can, alternatively, still do
this, select title from shows, stars, and people. You just literally enumerate,
with commas, what three tables you want to join somehow. And then you can instead
of using join, you can just use where clauses to
make sure they line up properly. You can say, where shows.id
equals stars.show ID, and people ID equals
stars.person ID, and name equals, quote unquote, "Steve Carell." And I realize this is hard to
keep track of everything now, all these darn different
ways to do this, this is just to say that there's
different approaches to solving the same problem and
for different people you might think about things a little
more differently than someone else. If I hit enter here, this too it's a
little slower than the nested selects it seems, but it does, in
fact, give us that same answer. And just for thoroughness,
if I go back to our diagram. Besides the tables we've seen,
there's actually another writers table in there, as well. If you're curious to see what writers
is, let's just glance at that real fast. In VS Code, let me do .schema writers,
and it's actually almost the same as stars. Except this case, in writers,
we are associating a show with a person ID, both
of which in this context are foreign keys that indeed
reference back shows and people ID. Which, again, if I do this schema
stars, which we didn't see before, is structurally the same. So the relationship is essentially
embodied, in this case, by IMDb, and in turn by us, by way of
the tables name, TV stars or writers thereof. All right. I know that's a lot. Any questions before we take
a higher level step back? Yeah. [INDISTINCT SPEECH] A good question. Does SQL provide any way to figure
out the mapping between tables without looking at the database? Short answer, no. Like, this is the dia--
well, that's not quite fair. Depending on the database
you're using, you might be able to click
a button, for instance, and get a nice pretty picture like
this, that shows the relationships. Indeed, we use software
to generate this. We didn't do this diagram,
for instance, by hand. SQLite itself does not
provide you with that. In SQLite the best you
can do is run .schema. And if you don't specify
a table name, you'll get everything from the table described. Once you get comfortable
with SQL, though, the idea is that you can read
the text and sort of infer what the structure is. But, yes, there are
graphical programs can generate prettier
pictures like this, but it depends on the software you're using. Yeah. [INDISTINCT SPEECH] SQL is not case sensitive
with respect to its keywords. But table names, and other
things that you chose, you've got to be consistent
with how you capitalize them. I've done everything in lowercase,
but that tends to be one convention. Other people might use camel case,
where you alternate caps appropriately. All right, so let's take a
higher level look at this, and also consider some of the actual
real world problems that, tragically, are still with us in some form today. Notice that some of the
queries we executed a bit ago were actually relatively slow whereas
I hit enter and got a lot of my results like that. Those last two queries,
where I was joining all of those tables looking
for Steve Carell's shows were actually relatively slow. And let's try to take a simpler case. Let me do this. In SQLite you can actually time your
queries by running .timer and then turning it on. This is just going to keep track now
of how many seconds or milliseconds any of your queries take if
you're curious to figure out what's faster, what's slow. Let me do something
relatively simple like this, select star from shows where
the title of the show equals, quote unquote, "The Office," semicolon. All right, that was pretty darn fast. And it took 0.044 seconds in reality. If you care further, you can break
this time down into user time, like spent in my account,
versus system time, which means spent in the operating
system, essentially. But we'll focus on the real
wall-clock time of 0.044 seconds. Pretty darn fast. But we can actually do better because
it turns out in the world of SQL, you can create what are called indexes,
which is a fancy way of saying a data structure that makes it
faster to perform queries, like selects, and even
other queries as well. In an index, you can
use syntax like this, create index, the name of the
index, on a specific table on the specific columns. And what I mean by this is if you know
that your application, like imdb.com or their mobile app, is going to
search on certain columns frequently, you can prepare the database
in advance to build up some fancy data structures in
memory so it can get back answers even faster than that. So case in point, let
me go back to VS Code here, and let me create an index
called, whatever, title index, for instance on the shows table,
specifically on the title column. So that's simply the syntax for
telling the database in advance, index this column because I'm
going to do a lot of searching on it so I want the queries to be fast. Enter. It took a moment. It took almost half a second, but
the index I only have to build once. Even though, technically, you have
to maintain it over time if you're doing updates, deletes, and inserts. But now let me do the
same query as before, select star from shows where title
equals, quote unquote, "The Office." Previously, that query
took 0.044 seconds. When I hit enter now? Boom. I mean, it takes no time
at all, or less time than it's even keeping track of
in terms of significant digits. Now, that might not seem like a big
deal to us humans and our human eyes, but if you've got hundreds,
thousands, millions of users, as maybe the real imdb.com
has, you just saved yourself a fortune in servers and complexity. Why? Because the same server
can, clearly, now handle way more people per unit of
time, per second, because each query takes less and less time. I mean, we're all too familiar
here and at Yale, surely, with certain University applications
that are just so darn slow. When you click a button
and the stupid thing spins and makes you wait and
wait, a lot of the time that can be explained by poor database
design, or databases that might not have been indexed properly. So when you're searching for
some course, for instance, in the course catalog,
it's taking forever because underneath the
hood it's essentially doing linear search over everything. But, by contrast, in
a relational database, when you create an index
in advance because you have a hunch that maybe users are
going to search on that column, like show titles. Essentially, you're
building up in memory what's called a b-tree,
which is not a binary tree. It's still a tree, though, if
you think back to week five. But it's a very short,
fat tree instead, where every node might have two, or three,
or 30 children, which essentially pulls the height of the tree way up. Which is to say that when you
search for some value in a b-tree, it's invariably going
to be in the leaves. So the shorter the tree is, the fewer
steps it takes to find the value you care about. So when you run create table,
that kind of data structure is being magically created
for you by the database so it's not a simple linear
search through the entire column, top to bottom. So with that said, we can see this
really with more complicated queries. And let me go back to VS Code here. Let me propose to run that
same, slow query before, even though it's fine if you're
not comfortable with the syntax. It was relatively
slow, though, to watch. So let's do select title
from shows stars and people where shows.id equals stars.show ID, and
people.id equals stars.dot person ID, and name equals Steve Carell. So this was the last of those queries. That just searches for all
of Steve Carell's TV shows without using joins explicitly
but by just enumerating all three tables in question, and
then using where to cleverly connect the dots, so to speak. But my timer is still on. So notice now, when I hit enter,
it doesn't just feel slow. It actually took 2.763 seconds. Like, that's slow. That's expensive. That's going to annoy your users. That's going to annoy your students if
the database is thinking, and thinking, and thinking, and taking that much time. But let's note this. That same query I just executed
touched a bunch of columns, and it turns out that
whenever you declare a primary key in a database, at least
in SQLite, you get an index for free. Primary keys are automatically
indexed, so searching for a number in that column? Super fast. Not linear search, it's something
logarithmic, most likely, or, ideally, closer to something
like constant time even. Here, though, I'm touching
not just shows.id, but I'm also filtering on
stars.show ID, so a foreign key. Foreign keys are not indexed by default. I'm looking at people ID. That's a primary key. That's indexed. But stars.person ID, not indexed
by default, is a foreign key. Lastly, I'm filtering by
name in the people table. Names are not indexed by default. So
I'm touching three separate columns, two foreign keys, one name
field, that have no fancy tree structure built for them. But I can do that. Let me go down to my terminal here. Let me create one index called,
say, person index, though, I could call it anything I want, on the
stars table on the person ID column. So that indexes that foreign key. Took 1.7 seconds, but I
only have to do it once. Create index, show index
on the stars table show ID. So another foreign key
is getting its own index. Took 1.4 seconds, but
it's a one time thing. And, lastly, let's index
all of those actors names. Create index called name index on
the people table on the name column. Enter. That took 1.0 seconds,
but a one time operation. So, essentially, I've built up
like three of these trees in memory now, specifically for these columns. So now recall, previously,
that slow query. If I scroll back up,
that took, what was it? 2.7 seconds? I think. 2.7 seconds, previously. But if I now run the same
thing, select title from show stars people where shows ID
equals stars.show show ID, and people ID equals
stars.person person ID, and name equals Steve, so close, Carell. Same query as before previously took
2.7 seconds, which was the most annoying of them yet. Now, when I hit enter? Boom. 0.001 seconds, which is the
difference, again, between bunches of linear searches and, in this case,
searching a fancier, week-five style, b-tree, in this case. So indexes matter. So what's then, maybe,
the trade off here? Like, why not index every
column in every table? Because this is feeling great. Like, we're speeding things up
by factors of 1,000, practically. What's the trade off? [INDISTINCT SPEECH] Lots and lots of memory, or space. Yeah, so you're just
trading off space for time, which we said a couple of weeks ago
is an acceptable trade off depending on what resources you have. But it's probably an
overcorrection to index everything, especially since it will slightly slow
down inserts, updates, and deletes because you have to maintain this tree
structure so it doesn't devolve back into a linked list or something linear. But, in fact, being selective about
it is perhaps the best strategy. All right. So that we can now solve some
other problems more generally, let me just connect two dots. Even though we focused today on
SQL, specifically for databases, you can actually combine
one language with another and solve different problems. And so, in fact, let me do this. Let me revisit our
favorites.py from earlier, but let me actually now use the
favorites database as follows. Let me go into VS Code here. Let me remove favorites.db because if
you recall it made everyone's count previously-- became Fiftyville. So let me remove that file. Let me run SQLite3 on
favorites.db again. Let me create the file anew. Let me set the mode to CSV again. Let me import that file called
favorites.csv into an identical table as before called favorites. And then quit. So I've just reset things to my
backup, if you will, from the CSV file. So, I again have a favorites.db. Let me now minimize my terminal
window here and reopen favorites.py, and let me just go ahead
and get rid of that version entirely and focus this time
on not talking to a CSV file, opening it, iterating over the rows. We can actually use Python
to execute SQL queries and kind of get the best of both worlds. So let me do this here. Let me, from the CS50 Python library,
import our own SQL functionality. And this is a training wheel
we still provide for SQL because it's just much easier
than using the industry standard libraries for SQL. It just is painful for simple tasks. So now, let me create a variable
called db, for database. I'm going to set it equal to this
SQL function that CS50 wrote. And this is going to look weird, but
the way you open a DB file in Python, whether it's with CS50
library or someone else's, you say SQLite colon slash,
slash, slash favorites.db. So weird syntax, but it's commonplace. Three slashes, not two
like a URL is usually. Now let's use a variable
called favorite and set it equal to the return value of input by
asking the human for their favorite TV show. And now, previously we
opened up the CSV file, iterated over it looking for the show
they typed in, show they typed in. Or, rather-- Oh, sorry, not the show. The problem, or the
problem that we typed in. Let me instead do this. Let me set, use a db
variables, execute function, which comes with the CS50 library,
and let me execute this SQL query. Select count star as n from favorites
where problem equals question mark. And the question mark is a little
weird, but think of it, for now, like C's percent s. Comma favorite. I'm going to plug in whatever the
human typed in into that query where the question mark is. So no percent s, I'm using a
question mark in this world of SQL. This is going to give me
back a temporary table, and I'm going to store that temporary
table in a variable called rows. Because the temporary table is
essentially zero or more rows, so I'm going to name my variable rows. And then if I want to get
back the first and only row, I can literally do row equals rows
bracket zero just to be pedantic, just to get, literally, the first row
in that temporary table, or result set. And now if I want to print
back the n column therein, I can do print row, quote unquote, "n." So let me take a step back and do this. Let me go into SQLite3 of favorites.db
and let me literally type something like this. Here I'm in SQLite3 at the bottom
and the problem I'm searching, for instance, is Scratch semicolon. Notice that is how in SQL, in
the command line of SQLite3, I can get back the answer I want. But what if I want to
make a Python program that queries the database for that value
and then prints something out? In two weeks time, what if
I want to do that to make a web application, like
imdb.com or a mobile app, that writes code to
generate a user interface but that pulls the data from a database? You're not going to have your
human users using SQLite3, you're going to generate
the output for them. So let me close my terminal window here. Rather, let me close out of SQLite 3. Let me now run Python
of favorites.py, enter. I'm prompted for my favorite
using the input function. I type in Scratch and hit
enter, and there's my 34. So this is an incredibly
common practice, to use one language
for what it's best at. Like SQL is best at reading
data from databases, Python is, maybe, best in this
case for creating a user interface or, eventually, making
a web application. But it's certainly fine to
use one inside of the other. The documentation for this library,
if and when you get curious, is it this URL here, along with a
lot of CS50's own documentation. But there are some
problems, nonetheless, we might encounter in this world, and we
thought we'd end on these challenges. It turns out that in the world of
SQL, even though we haven't touched upon this yet, you're generally
working not with like hundreds of people and their favorite
languages and problems, not even thousands of movies, but like
millions of things in the database, like Instagram posts or
TikTok videos or the like. Those are huge databases
with millions of rows. The problem with SQL and,
really, databases in general, is if you have thousands,
millions of things happening at once, things can get out
of order and your math can be wrong. You can lose track of how
many likes something has. And so, for instance, as of last night
this remains the most popular Instagram post, to date, in the world. It was clicked on by so many people. And you might think that's
pretty straightforward to keep track click,
click, click, but not when there's millions of devices in
the world and thousands of servers, probably, at Meta running
Instagram's back end. So how do you actually keep
track of all of these likes? Well, maybe, Meta is
using code like this to implement the counter
for likes on Instagram. Maybe they are using lines of
code similar to what we just wrote db.execute select
the current number of likes from the posts table
where the ID of the post equals whatever the one is
that the user clicked on. So the post a moment ago, presumably,
has a unique ID, a primary key, and that number just gets plugged in
here when Meta wants to figure out, all right, someone just
clicked on this post. Let's figure out what the current
number of likes so we can add 1 to it and update the database. So this query here gives
us a temporary table containing the current number of likes
before you or someone else clicked. Maybe then we just
declare a variable called likes to get at the
first row's likes column. So this is just Python
syntax, similar to what I did, just to get the actual number you care
about, like 34 or whatever million it is here. But then suppose there's a second
database query and a third line of code that updates the posts table setting
the likes equal to this value where the ID of the post is this value. So these question marks are similar,
again, in spirit to printf's percent s. They're placeholders for
things that are going to be plugged in after the commas. So if I want to update
the number of likes to be whatever the current
number is plus one, I put it there as a second argument
and then I plug-in the ID of that post. The problem, though, with large
systems like the Metas, the Googles, the Microsofts, and
others of the world, is that they are executing code
like this on multiple servers, thousands of servers, that might
be executing slightly out of order. One might be faster,
one might be slower. Which is to say, even though
these three lines of code represent what should happen
when I click on that post and you click on that post
and you click on that post, the lines of code chronologically
might get shuffled a little bit. Like, maybe, this line of
code gets executed for me and then it gets executed
for you and then you, and then the server moves
on to the next line of code. So it's sort of multitasking,
handling lots of users at once. The problem with this is that you
run into a race condition of sorts, where the servers are sort
of racing to handle one user but other users requests are
happening at the same time. So the analogy that I was taught
years ago in an operating systems class actually pertains to
something like a refrigerator here. So we have a mini refrigerator here. Suppose you've got one in
your dorm or your house room, and you come home one day. And you really like milk, so you
open the fridge and you look inside, and there's-- Oh, we're out of milk. So you close the fridge, you walk
out to CVS, or somewhere else, and go to get more milk. Meanwhile, though, your roommate
comes home, who in this story also likes milk, and
so they decide that-- Oh, I'm out of milk in the fridge. So they, maybe, head out. Maybe they follow a different
path to go get more milk as well. Some number of minutes pass,
you both come home later on. Like, Oh, darn it. Now we have twice as
much milk as we need. We don't really like it that much. And some of it's going to go
sour now, so it's wasted like. We've made a mistake. We should not have bought
twice as much milk. Now, stupid story, but the
point is that both of you made decisions based on
the state of a variable. But the problem was that variable
was in the process of being updated when someone else looked at it. The first person in the story
was on their way to the store so the variable was
about to be incremented in terms of quantity of
milk, but the other person didn't know that yet so they
too tried to increment it. And, in that case, we ended
up with too much milk. But suppose what might happen
here is similar in spirit. Suppose that post, at some point
in time, had just 1,000,000 likes. And suppose this line of code got
executed for me, for you, and for you after all three of us clicked on it. Well, the value of our likes
variable in Meta servers might be 1,000,000,
1,000,000, and 1,000,000. They therefore update 1,000,000
to be 1,000,000 plus one. And so what they update the
database to be is 1,000,001, but they do it three times. 1,000,001, 1,000,001, 1,000,001. But they've lost two of
those likes because they might have inspected the variable while
some other server, some other user's like, was being processed. So long story short, when
you have lots of data on lots of servers, all of
which is happening very quickly, you run into these so
called race conditions. And code like this can be
dangerous, even though it might not look incorrect at a glance. Thankfully, in the world of
SQL, though you won't generally have to do this certainly for problem
sets sake, there are solutions to this. But too many engineers in the world
don't know this, don't remember this, or don't appreciate this reality. There are keywords in
certain databases that let you instead begin a transaction,
which means, essentially, that these three lines
of code should either all happen together or not at all. They should be atomic. That is to say, they should
all happen without interruption or they just shouldn't happen at all. And that ensures that the
math does not go wrong because my like will get
counted, and then you're like, and then you're like, as opposed
to them being intermingled and lost track of accordingly. So in Python, using
the CS50 library, you could wrap these three lines of
code by saying begin transaction to the database, commit the
transaction to the database, and that relatively simple solution
avoids this problem of race conditions. But this, too, is a topic. If you Google, invariably
you'll see that this is a problem that has hit various
servers and apps over time. But there's one other problem that
the world is still not very good at, and that's known as a
SQL injection attack. And it turns out that even with what
we've been doing, even code like this here, is all too easily vulnerable
to being hacked, if you will. That is misused in some way, unless
you practice what I'm preaching, which is using placeholders like this. It turns out that it's very dangerous
to take user input in, generally. Like, most of your users
might be nice, good people, but there's always going to be
someone who's malicious or curious or just execute something you don't
expect, and things can go wrong. So in the world of SQL,
here's what can happen. For instance, here is the
login screen for Yale accounts. Here's the comparable
screen for Harvard accounts. And all of us are in their habit
of using one of these screens or something similar or another. You're often asked for a username
or email address and a password. But suppose that Harvard or Yale
or Google or Microsoft or wherever, are taking user input from forms
like this, be it on the web or on a mobile app, and they're just
plugging your input into a SQL query that they wrote in advance that just is
waiting for your username or password to be plugged in so they can ask
a complete query of the database. It turns out that if I'm a
bit malicious, or curious, I could, maybe, type in some
funky syntax to fields like this that I know have special
meaning to certain databases. And it turns out in the world of SQL,
SQLite in particular, single quotes are clearly important. Because I've been using them
all day long for strings, and I didn't mention this yet, but the
comment character in SQLite is dash, dash. If you want the rest of a line to be
ignored, you just say, dash, dash. So it's not hash, it's not slash, slash,
like in Python and C, respectively. It's dash, dash, or two hyphens. So suppose that I'm a
hacker or a curious student, and I want to see if Harvard
knows what it's doing when it comes to SQL injection attacks. I could literally type in, maybe
my username with a single quote and then dash, dash. Well, why would I do this? Well, suppose for the sake of discussion
that some developer at Harvard, or any website, really,
has written a line of code like this to check if the
username and password just typed in match what's in the database. So how might I do this? And if so, let the user log in, show
them their account, or whatever. So here's the line of code in question. Select star from users where username
equals question mark and password equals question mark. This is correct. This is green because it's good. It is not vulnerable to attack because I'm using these
placeholders here, which, even though we've implemented
in the CS50 library, most SQL libraries
support the same syntax. But it can vary by system. This is safe. What is not safe is to
use some of the stuff you learned last week where
you can just use f strings, or format your own strings and
interpolate values with curly braces. For instance, suppose
you took me at my word last week that you
can do this in Python. This code here, unfortunately,
in yellow, is not safe. Here's a format string in Python. Here's the beginning of a SQL query. Here's a placeholder
for the user username. Here's a placeholder
for the user's password. And I've proactively, correctly,
put single quotes around them. The problem is if you just
blindly plug users input into pre-made queries like this,
they can finish your thought for you, in ways you don't expect,
if you trust the users. For instance, if I plug
in for my username, mail in at harvard.edu,
single quote, dash, dash. Notice what happens. Here's the single quote from the query. Here's what I typed in. But, wait a minute. It looks like this single
quote that I typed in finishes the thought that the developer started. The dash, dash means, heck,
ignore the rest of that. And, indeed, I've lifted grayed out
because what effectively happens is that only executed is what's
in yellow here at the moment. Why? Because everything
after that close quote, which finishes the developers
thought, is just ignored. So I mean, this is literally an example
of how you can hack into a database by injecting SQL. Like dash, dash is an example of SQL. Silly as it is, it's a comment
that tells the database to ignore the password. So what does this mean? Well, of course you're going to get
rows back now because if you're only searching for someone by username, it
doesn't matter what their password is, what they typed in, you've
essentially blacked out that part and you're not even
checking the password. The effect then would be that you could
log in as me or Carter or anyone else, just by knowing their usernames if
the Harvard developer wrote code in this way. And even though I'm pretty sure
Harvard key does not suffer from this, so many darn websites have in the past. And if you Google, SQL
injection attacks-- search by and Google, for instance, the
past month or the past year, you will tragically likely
see results because humans continue to make this mistake. The solution though,
ultimately, is actually just use placeholders, just
use the library that escapes potentially dangerous input. And this looks a little
weird, but in C we saw that the escape
character was a backslash? And that made backslash n, or backslash
something else, be treated specially. Weirdly, in SQL it tends to
be just another single quote. So if you do quote, quote
that actually means-- I mean, a literal quote. It's not like the empty string,
or nothing in between it. So it looks weird. But, long story short, if
you use a library like CS50, or anything in the
real world that handles escaping of user input,
the whole query you can think of as being now good and green
again because it doesn't matter what the human types in. Any scary characters will be
properly escaped by the database. So it'll depend on the
library you're using, but it almost always is
the syntax using question marks or some similar placeholder. All right. So with that said, you are
now inducted into the Hall of People who know now a little
something about databases. We've only just scratched the
surface of using the language, but it's now something we'll use to
build up more and more interesting applications, especially for
final projects when we soon transition to web programming or mobile
app development, if you go that route. But you'll soon be able to speak a
language, literally and figuratively, that those before you
have acquired as well. So you are now qualified to
understand this sort of internet joke. This is someone who, if
I Zoom in, was trying to get out of paying some camera based
tolls by tricking the state or the city into deleting or dropping
their whole database. Drop means delete the whole
thing, not just the rows therein. OK, so maybe not that funny. But this is now the note will end
on, similar to other xkcd comics we've introduced you to. Every CS person out there
has seen, knows this comic. So if you ever refer to, with a wink-- if you ever refer to little
Bobby tables with a wink, if it's a computer scientist on the
other end, they'll know whom you mean. [LAUGHING] OK, there we go. All right. Tough crowd. All right, Batman, come on down for your
cookies, and we'll see you next time. [APPLAUSE] [MUSIC PLAYING]