There are more pages on the web than people
on Earth. And while I haven’t checked, I am sure each
one is full of original, high quality content that would make our ancestors proud. Most people access web pages through a browser,
but as programmers we have other methods... Today, we will learn how to use Python to
send GET requests to web servers, and then parse the response. This way you can write software to read websites
for you, giving you more time to browse the internet. In a browser, you access a web page by typing
the URL in the address bar. URL stands for “Uniform Resource Locator”
and this string can hold a LOT of information. At the beginning is the protocol, which is
sometimes called the scheme. Next is the host name. Sometimes you will see a colon followed by
a number. That number is the port. If the port is not explicitly specified, you
can determine it from the protocol. HTTP uses port 80, while HTTPS uses port 443. After the host name comes the path. The text after the question mark is called
the “query string”. It holds a collection of key-value pairs separated
by ampersands. And lastly, you may see a hashtag at the end
followed by a string. This value is called a fragment and is used
to jump to a section within the webpage. Python 3 comes equipped with a package that
simplifies the task of building, loading and parsing URLs
The “URL LIB” package... This package contains five modules: request,
response, error, parse and robotparser The request module is used to open URLs
The response module is used internally by the request module - you will not work with
this directly The error module contains several error classes
for use by the request module The parse module has a variety of functions
for breaking up a URL into meaningful pieces, like the scheme, host, port, and query string
data. And finally there is robotparser. An exciting name, for a less than exciting
module... It is used to inspect robots.txt (“robots-dot-t-x-t”)
files for what permissions are granted to bots and crawlers. Today we will focus on the request module,
since this is where the action lies. To begin, import url-lib. Now use the “directory” function to see
what is available. Not much… This is because urllib is a package holding
the modules that do the actual work. So instead, import the module inside urllib
that you want to use. We want to use the “request” module. If you call the directory function on the
request module, you will see a lot of classes and functions. The function which enables you to easily open
a specific URL is the “urlopen” function. Just as the “open” function is used to
open files, “urlopen” is used to open URLs. As an example, let us open the home page for
Wikipedia. The function returns a “response” object. If you look at the type, you will see it is
NOT the response in the urllib package, but a different type of response from a different
package. To see what you can do with the response,
use the directory function. First, let us check if the request was successful
by looking at the response code. 200… This is actually good news. A 200 response code means everything went
OK. You may ask why the number 200 was chosen. I may ask the same thing... Next, let us see how large the response is. This is the size of the response in BYTES. We can use the “peek” function to look
at small part of the response, rather than the full value. This most definitely looks like HTML, but
notice that this is not a string. The “b” at the beginning tells us this
is a “bytes object” The reason for this is that web servers can
host binary data in addition to plain HTML files. Let us now read the entire response. If you look at the type, it is indeed a bytes
object. And it is the correct size... We can convert this to text by decoding it. If you look at the peek value, the character
set in the response is “UTF-8” So to decode this bytes object, call the “decode”
method and specify the encoding that was used. We now have a string… And if you display the value, you can see
all the HTML for the web page. By the way, look what happens if you try to
read the response a second time. Nothing… This is because once you read the response,
Python closes the connection. As a second example, let us send a search
request to Google. How rude! Earlier we said that a 200 response code meant
everything was OK. So things are definitely not OK. A 403 response code means that while our request
was valid, the server is refusing to respond. I can understand their reaction. If they let anyone scrape their search results
without restriction, then competitors would use this information to their advantage. Let us try a different example... We will now load the YouTube page for this
incredible video on Black Holes. Here is the URL. Notice that this URL contains two parameters
in the query string: V and T. V is the video ID, and T is the time in the
video to begin playback. One way to construct the querystring is to
append a lot of strings together. But there is an easier way. To see this, import the “parse” module. Looking at the directory, you can see a large
collection of functions for working with URLs. Here, we will use the “urlencode” function. First, we create a dictionary containing the
querystring parameters. Next, call the “urlencode” function. The result is a string that is suitable for
use as the querystring. Notice, however, that the question mark is
NOT included. We can now build the URL. Next, open the url using the “urlopen”
method. If you call the “isclosed” method, you can see we still have a connection with the server. The response code is 200, so our request was
fulfilled. How generous… We can then read and decode the server response
in a single line. Looking at the first 500 characters of the
html, we see everything looks to be in order. You have now taken your first step towards
bypassing the browser, and interacting with web servers programmatically. But there is much more to learn. What if you want to send a POST or PUT request? How do you include cookies in your request? What if authentication is required? And what if you aren’t subscribed to Socratica? Why don’t we make videos more quickly? Be patient. You will soon learn how to solve all of these
problems…