Practical Python Project: Web Scraper Prototype (Semi-Livecoding)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
all right so your first instinct is probably to rip open a text start writing some code and that would probably be wrong one of the things you want to do when you're starting a new project is to figure out what specifically you want for the prototype the prototype is gonna be the first phase of development it's a quick kind of proof-of-concept just to like let you kind of figure out what's what this thing might really look like when you build it so the prototype has to kind of accomplish one basic task maybe the core basic task of your program what a lot of people do is they rip open a text file and start writing code that's just like something that they know how to do that they think is gonna be part of this project that's a great way to get yourself kind of like into a lot of lines of code that don't actually help you solve the problem and that just make things more confusing and complex to deal with later so the first thing you want to do is actually stop and look at the problem you're trying to solve in our case it's fairly simple and straightforward it is we've got a website and we want to scrape that website that means we want our program to make HTTP requests to it just like a browser and then we want to grab some data from the responses that this website sends us so right now we're looking at the Humble Bundle cloud computing books bundle this is what we want to scrape so we have a URL that brings us to what we want to scrape now eventually we want to scrape all bundles but that's going to be kind of a separate part of our program right now we just want to do a proof-of-concept a basic script that makes a request to the site grabs and parses the response what I'm thinking here for the prototype is like we basically have it separate these into tiers so it'll be like the one dollar tier contains these books the $8 tier so we'll split it into tiers and then list all of the products that each tier has so I'm gonna go to my code directory and I'm gonna create a new directory just for this project and we'll call it bundle scraper and so I've opened this directory in my text editor it doesn't really matter what you use as long as you're comfortable with it I'm using sublime text here so the first thing I'm going to do as in any Python script is say user bin and Python and that means that if this script is marked as executable Linux will know to use whatever this environments Python is to run it I'm just going to change that to Python 3 but right now we'll just name it bundle scraper dot PI ok so we've got our well the very beginning of a script and just if you're a beginner make sure everything's working you can write a little hello world save that and we'll just kind of get this to executive all bundle scraper dot PI and now we can run bundle scraper dot pi and you can see it prints out hello world so we basically have an end-to-end program that works our environment basically works and now we can start developing now the first thing I like to do is set up a virtual end I've made a whole video on this so I'm not gonna cover it here in detail but I will show you basically exactly what I'm gonna set that up so first I'm gonna create a new file I'm gonna call it requirements dot txt and this is where we'll list all of the Python packages that we need all the libraries or modules that we're going to be like installing just for this project what I'm gonna do now is go back to my shell and create a virtual environment for this project so we'll say virtual end you'll need to install the Python - virtual end package for this but you'll say virtual end the Python is gonna be Python 3 and then we'll say avian and that's gonna create a new virtual environment which is just a directory that's filled with all the binaries and stuff that we need to kind of develop without messing up our system keeping everything all of our dependencies all of our stuff contained in this directory again just check out the virtual end video for more information but we can basically activate this by saying source nope viens been activate and you'll see that our prompt kind of changes and we have this little Vienna thing here which just means we are in the VN virtual environment you can name this whatever you want so like here you could just say you know the name of your project or something to get out of this you can just close the shell or I think type of deactivate yep so now we're back out and what that does is again this is in the other video but I'm just going to show you very quickly when we've activated activated this you can see that instead of using system dependencies now we're using the dependencies that we have locally in this VM so like a new Python binary has been installed here likewise all of our packages that we install will be accessible just in this virtual environment so they don't clutter up our system and that we don't have like dependency conflicts like version conflicts between different projects that we're developing okay it's all I'm gonna say about VMs check out the other video for more details but this is all you need to know for right now ok we have a virtual end now here's what we actually need to do we need to scrape this thing so here's our first line of code you ready this is our URL this is what we're going to scrape so we are literally gonna do this in the simplest format possible and we're gonna enter this string here which is just the URL we want just great then we need to scrape it so we actually need to make an HTTP request and then inspect the result of that so as always I'm gonna open a shell next to me and get into the repple so I can kind of experiment along with with what I'm doing and here I need to go into code bundle scraper and then I need to do that source blah blah blah thing so I'm using the same Python version on my script is going to use and Here I am so the first thing I need to do is the library we're going to use for this it's called requests so if you search for Python requests you'll be brought to the documentation for Python it's an HTTP library that kind of gives you a really nice and simple interface or abstraction for dealing with HTTP so this is one of the ingredients we're gonna use so that we don't have to do more low-level stuff or anything really more complex ourselves and you can see that what we really want to do is make a get request let's the HTTP request that asks for a resource in this case this web page so you can literally grab this we don't need the off the user pass thing it's just there if you do need it and you can see we can start inspecting that as soon as that request has been made so why don't we play with that so instead of we don't need this and we're just gonna pass the URL here right so this is our string and why don't we do this in the repple so facebook these things in and here's our first error requests is not defined so we need to import requests if you haven't installed requests yet here's how you do that you could just say hip install requests but then you're not really tracking this in any way so like getting back to a state where you can run this program is gonna be a little weird so here you're just gonna say requests you don't really care which version we have you can also specify that in requirements text but now what we can do is because we have this requirements file that just says requests which is the library we want we can say pickup install our requirements dot txt and then pip will go out grab every package in there and install it build it whatever it needs to do some packages will require that you have like build essentials installed which is just like a C compiler header libraries and like various other things if you run into errors usually one quick google will tell you exactly what you need to install to be able to build that stuff okay so now we can go back to Python and now we can import requests right so this is the library we need and we'll do that in here two important quests and now we can paste these things in here again now it's taking well do you see that it's just blocked before returning and that's because this is actually going out and making an HTTP request so if we say R we can see that this is sort of the string representation of this object that has been returned to us it's a response object but that's actually full of information when cheap and easy way to inspect an object is to do this magic underscore underscore dict underscore underscore method and that'll kind of turn the thing into a dictionary which in this case is not all that useful because it's you can see this is the HTML and JavaScript it looks like so this is like everything this object contains and part of that is the markup and where's the beginning of this object okay so you can see it also contains some other information so this is a Python dictionary now you can see how long the request took content consumed all this stuff I have no idea what half of this means but you can see this is some very HTTP like stuff which you could read out of this object if you wanted the cross-site request forgery cookie bla bla bla but what we really just care about is the status which is helpfully printed here and the actual markup so if we stay our status code that gives us the integer 200 now how do I know to do that well it's because I read the docs so you can see that our status code is a thing that you can ask this object this is why reading the documentation is so important don't waste time just like not reading Doc's and then just googling for every single thing you want to do and looking on Stack Overflow spend a few minutes reading the documentation I've worked with requests a whole bunch so it's sort of the basics are second nature to be now but if you're doing this for the first time read through this read through the QuickStart look at some of the some of the features that it offers so that you get an idea of how you can use this library okay and you're not gonna listen to me but after you waste a lot of time you will listen to me and you'll say Dave was right I was wrong and I'm so sorry you're gonna leave a comment below I've wasted tens or hundreds of hours and now I finally have learned to read the docs first so that I don't look like a big dumb idiot to everyone on the internet when I ask questions that are answered on the second page of the documentation who got emotional here for a second sorry I guess something something about having wasted hundreds of hours of my own time rubbed me the wrong way when I talk about it okay so we don't really care about this stuff but what we do care about is the actual markup so let's try our text let's see if that's what we want so we'll say our text and that is definitely looking like what we want so this is actual the markup this is the HTML document that we've actually gotten back from the server so that is excellent that's actually one of the first things we want so now we have this HTML that doesn't actually do us all that much good because we have to like look through here for all the product names and whatever like how do we actually do that well that's a process called parsing we're gonna take this and we're gonna take a Python module that understands XML or HTML HTML kind of XML and we're gonna parse this into a Python data structure that we can work with right now this is just like a string right it's sort of flat you can't ask this string hey please give me like the third div that you've got or please give me like the fifth line item in the section of products right now it's just a string python it has no idea what to do this except like string like things it's just a blob of binary stuff now what we want to do is parse this so we are going to use a library called beautiful soup the package is called BS four it's a beautiful soup four so we're gonna do the same thing again where we pip install our requirements text now it already knows that we've got the first one requirement already satisfied satisfied requests but you can see that it's installing beautiful soup now and it's successfully done that again if you have run into issues here just google read the docs again beautiful super Python also if you just search for like HTML parsing like that will lead you eventually to beautiful soup this beautiful soup 3 this might actually be the wrong one beautiful soup four beautiful soup documentation ok if this looks more like it we've got yet another QuickStart here obviously programmers are clearly can't be trusted to read documentation so there's quick starts on the first page on the landing page of like every single package you're gonna use so you can see that they have just as an example in HTML document which actually looks a lot like our response dot text and let's see how they actually use this so they import beautiful soup from this package we just installed and then they're capturing something in a variable and that's the beautiful soup eyes they're just using this as a function HTML doc and it looks like HTML parser so you're telling it that you want to use the HTML parser which kind of implies that it can do other types of markup so as any good programmer knows we're gonna steal this and we're gonna stick it right in our program so we're doing this and this has to come after so we've got our request you know this is getting a little ugly I'm gonna gonna move this over to eight so now we're gonna have our kind of exploration environment over here so I'm gonna do this again go back to my program there we go a little more a little more room for you to see and paste this at the bottom so HTML doc is not defined if you remember what we wanted was our text so we're just gonna call this RESP just to make it a little longer and then we're gonna say RESP text is actually the the thing that we want you could tack text on to the end of this text to grab just that but then you kind of lose everything else that it got in case we want to use that later you could create a new variable also and just say you know RESP or markup equals RESP text and then just use markup in here I leave that up to you but this is just how I'm doing it quickly we're gonna run the HTML parser on this and just see where that gets us so we're gonna do all this again paste it into our repple and you can see it's fetching the HTML page and soup has been assigned so that's already been parsed looks like look at what this is okay this is actually looking a little bit better as you can tell it's not just one crazy string it looks like it's somehow been formatted and well we know it's been parsed there we go so can inspect this object in a second but it just oh my oh my so much data Wow I think this might blow our scroll back buffer no okay I found the beginning so soup just poops out all of this information so this is like the stringify object basically when you type something in the repple Python will basically take these string representation which the developer of beautifulsoup has decided is the I represent a nice nicely formatted representation of the parsed HTML so let's see what we can do with this object how do we do that we read the docs correct that's what you said right read the docs always read the docs okay so it looks like if we just do soup it actually does this for us we don't really need to do this because we're not we don't really care about printing this out so we've got it parsed and now we need to work with it so let's see what we can do it looks like parsing this lets us access different parts of the markup so that if you know HTML you can access different parts of that HTML markup by using things this like this object has stored them in certain attributes so we can say super title let's see what our title is looks good to me let's see what paragraphs that are here P lovely beautiful and now we're getting it into the territory that actually looks interesting for our product now it's totally normal to have to keep reminding or so of what you're doing because it's easy to get caught into like oh the features of this library or this and you start playing around and you kind of forget like what you came for in the first place so especially when you're building a prototype stay focus on the actual thing you want to accomplish so before we get into this we need to figure out what we actually want to collect for that we go back to the web page so this is what we just scraped what we want is for some way to identify the tiers like pay $1 or more pay eight dollars or more blah blah blah and then the products so let's just see how we can do that one of the really useful tools is right-click on any element and then choose inspect element in I think all the major browsers have this I use Firefox because yay freedom and then just inspect the element so right click on the thing you're curious about and it will be highlighted in this representation of the Dom the document object model so you can see that you can inspect each of these you can open or close them just like sort of code hiding in your editor and it looks like this class that we want is DD header headline remember that for now we can copy this out and let's see if the second thing is also DD header headline yeah it looks like looks like these are all DD header headlines so it looks like the element we want has a class of DD header headline if you are totally new to HTML just like try to follow along you don't need to like perfectly understand it but basically hTML is made up of elements elements can have IDs or classes classes are used for they're often used for styling for markup and for like semantic kind of ordering of this like information hierarchy that is HTML just kind of open your mind follow along these are basically tags that we can search this document by right so if we search for everything that has a class of this every element that has this class assigned to it like h2 class blah will show up there can be h2s that have a different class there could be h2s that don't have a class added to them at all those won't show up in our search so for now just kind of stick with me and you'll kind of figure out how this works if you want to understand more about how HTML works it's really simple you can kind of learn the basics which is about everything you need in an hour or two I recommend you do that at some point but it's not it's not that important that you're an HTML expert right now okay so let's just like mark this down that we're probably will say bundle tears and that'll that'll remind us that each tier seems to have its like headline or the name and the price stored in this DD header headline class classes are basically accessed through this dot notation in things like well in JavaScript so how do we actually access that element that we want let's just like test to see if this is what we want we think we know so we've got this soup find all and that actually looks like it's something we might want to try so we're gonna paste this in here and instead of a we're gonna say we want the class DD header headline let's just see if this is what we want this is not working for us I wonder if it is what it was an h2 okay yeah so it's close it doesn't look like this method actually does what we want right so the find all things seems to be a tag element type so this can let us find all the heading twos but it returns a list of things that maybe well this is close this might be useful why don't we call this test e and then we'll say testy what's the first element this could work for us but I've also found this select method which it looks like we can get classes specifically and that's kind of what we want because this looks like we could work with it for now but if if something else if there's another h2 on this page ever this will break our code because we're not narrowing down by the class we want this is just any heading twos to make this a little bit less brittle and like scraping web pages is always brittle because people change their websites and then stuff breaks you have to fix it and figure out what they changed so there's a bit of that reverse engineering why don't we try using that select on the thing that we want dd header headline so let's do soup select and we'll do well let's try just the class in case they change it from an h2 to something else now this basically looks like the same thing but it's gonna be a little bit better because it only grabs elements regardless of which tag they are whether they're in h2 or a paragraph doesn't matter and let's try test here let's get the first element again and it's basically still the same thing so that's great so we're gonna use this select thing and we're gonna say we want the text of this element and it's a list so I'm a dumb-dumb so we'll say testes zero text so an actual element so that's looking realistic this is the text without the tags around it what we actually want to do here is oops not stip but we want a strip whitespace from around it and now this is looking like we want so this is actually one of the one of the pieces of information that we want so now we got to figure out how we got this again so it's this and then for each of those elements we're gonna grab the text and strip the whitespace off of it if that doesn't make sense pause the video and just like look over what we've done here and you'll see it in the code in just a second so we'll say soup select we're gonna capture this in we'll say here headlines I guess naming things is hard and like you can definitely change the name of these things if you say turn out to be something else so then we've got this list if you remember this is a list object like this and it contains all these elements that are actually the ones we want so what we'll do is I will do like a for loop for tear and tear headlines what do we do to it we stripped it we got the text and then stripped that text and that gave us what we wanted so don't be afraid to copy and paste here kids will say what we'll just print this out for now okay so we'll say print the stripped text of this tear okay so this will be the element this will be each one of those h2s in here like this so that'll be like one of these and then we're gonna print the stripped text of that like here let's try this in our see if it kind of does what we want so this is really just the Python Oh sized version of what we did okay this actually looks great except we've got one one straggler in here support charity that's annoying I wonder if this is not an h2 let's see yeah I totally missed that last h2 whoops my bed okay so why don't we do four tier two your headlines why don't we only select h2s I wonder if this is not an h2 print to your text strip whoops oh and then we'll do the for to your into your headlines nope it's still in there huh support charity this is annoying let's inspect it damn it is definitely a DD header headline in a DD header thing is there anything that makes this different from the others because then we can use that to like isolate the others main content I wonder if it's not in the main content oh I see so it's like three of these main content divs stacked on top of other huh is this one main content damn it is a main content one two so I am gonna leave this for now let's just see how it affects us and we can we can worry about it later I don't want to get hyper specific by like only accessing let's say one two three I just wanna let's just see what this gives us we can figure out a way to get rid of it later once we have a better idea of what this program is gonna look like a couple of rough edges are fine in a prototype as long as you kind of keep track of them so in the readme we're gonna say I like to put the two dues at the top or capturing support charity as a title mistake just just so we like know that this is a trade-off we're making so this is sort of kind of ish good enough for right now we have these things and we can print them out and we can get rid of this this last one later figure it out at this point we need to kind of ask ourselves what specifically we're trying to achieve like where we understand that we can access this data now it's actually like possible for us and so let's actually design kind of a data structure we want and we can do this right in the in here we'll just do it as some comments or I can just comment it afterwards so the data that we actually kind of want like what's what's the goal so we've established that we connect we can parse the response we can access stuff in there but what's like success for this prototype actually gonna look like well what I think we want just for the first phase of this project is a data structure that kind of looks like I'm not gonna tie it to a specific data structure yet because I want you to guess at the data structures in Python that we might use for this some kind of collection that for each of these tiers has the tier name and price and then free for each of those tiers it'll have you know like the product product to etc and then like the same thing for tier two right so this is kind of what we want now which Python data structures would actually map to this well a list this is definitely a list of things right it's you've got tier 1 tier 2 could it be a dictionary do we want these as keys that are then looked up I think that's not a bad idea tiers equals a dictionary and then it would be like tier 1 and then you'll say we really just want name price and products so I guess we'll say price will be you know like I don't know some number of cents right so we'll say 500 cents I suppose we'll do a list of products right this will be a list and that list might contain product objects we really just care about the names right so that should be fine so this would be like name one so let's think about this data structure for a second right how do we like what are the common operations that we're gonna want to run against this data structure are they easy to do or they intuitive are they efficient are they gonna prevent us making silly bugs because this is like weird and complicated to access so let's think about it let's make another tier just like because we're gonna have multiple tiers well name this tier two and this should be fun so why don't we paste this into our rebel and do we name this something yet tiers okay tiers okay we've got our tiers data structure am I missing a yeah okay so now we've got our tier tear its data structure let's try accessing some things so what are we gonna want to do we are going to want to get a list of tiers and that would be tiers keys so then we could do like we'll just say key four key in tiers keys that is kind of not what we wanted at all this is wrong I pasted this in wrong so tears where does this close I forgot to close it to your didn't I didn't I correct so I'm closing to your one here opening to your to here and then closing a hole data structure okay Bob let's try this again there we go okay let's try this again tears this looks correct now we have Tier one mapped not to tier two but to a price okay so let's try it again tears Keys there we go so here are the two keys that I want and if we want to make like a list from that we could do if you're not comfortable with list comprehensions it's one of the most powerful tools you have in python it's basically making a list from some other operation so we particularly you can use it a lot like collect functions in in functional languages it's really powerful but it's still really intuitive because it's just kind of maps to how we linguistically think about this thing so I want each key for each key in just print the key without doing anything to it you could do some other operations on this for example for each key in tiers keys which is just some iterable that we can iterate over and like collect stuff out of if you wanted to see how like is it up case goddamn it up case down case uh upper come on Dave remember basic Python upper its upper isn't it yes okay so for example if I wanted to modify this I could do some operation we're running a function on the key accessing some attribute of the key whatever that key can be whatever name you want for key into your keys so this is a for loop and then you're doing something for that for each one and then this returns a new collection that has the modified thing without actually like modifying tears keys let me show you how you might actually use that it's a little bit of a brain the first couple of times but it's super useful and you use it all the time in Python so like this three line thing you know it's fine it's probably good for readability but we could actually replace the entire thing with one list comprehension so what we could do is I'll just show you right under it will say stripped to your names equals and then we'll do our list comprehension tear text strip the thing we're gonna do to it to each one for tear in tear headlines I think we still have this in our loaded up in our ripples so let's try it so you can see that this actually has done all of that work of the for loop for us so it's just like a kind of occasionally a tighter and more intuitive way to write something that your you might use a for loop before it's still a for loop it's still essentially doing the same thing but it's just in some cases a lot more readable and that's that's really the test of when to use this is it more readable occasionally is it more performant if the answer's no then don't use this don't just use this to be cool clever or save a couple lines of code that's almost never worth it like no one's charging you by the line well if you're charging pad aligned you got a chat with your employer but um I hope that kind of explains right so this is equivalent to this except that it's not this isn't printing it out so really the equivalent of this would be even even bulkier so this would be like new tears oops will be this is your empty list that you're initializing and then you'd say new tears append each of these things as you go through them right just to show you that that's equivalent for tear into your headlines new tears append but new tears is an empty list so you initialize the list for tear and tear headlines new tears append right so these three lines really are the same thing as that one line new tears is this and stripped two names is the same thing make sense I just wanted to show you it's a really really powerful tool and in this case we can actually use it and have it be really readable you can see oh we're just going through the tear headlines we're getting the text image stripping the white space for each of those things if you wanted to get even crazier and this is where I start saying readability is affected this is just this statement right here so you could actually reduce this entire chunk of code to this right here my stripped tear names where we can really just rename to tear headlines because that's what we're actually talking about is a list of each item that is returned by this function stripped the text of that item and stripped you understand how powerful this can be it's like so many things can be happening but that's also the downside so again if it makes things more readable more debuggable in the future then I recommend you use it if it doesn't then maybe you don't want to use it make sense okay so I'm actually gonna leave this in two parts because you know I just I feel like we're just getting started on our beautiful Python journey and I I don't don't want to get confused the next time you look at this so I'm gonna replace this bag out we're gonna split this again across two lines it's still better than four or five and now you have the very basics of how Python list comprehensions work I hope that's been useful okay so we're actually doing with this thing well we have a data structure here let's make sure I still have it in here tears Wow a little bit of a learning detour I mean that's the way to learn when it's actually in front of you and it can be useful so let's talk about how we want to access this we definitely want to be able to list the tiers and then we want to be able to say for tier in actually what we'll say is for tier name and then like tier info in tiers enumerate and what this will do it's a Python function that will go through and for each key value pair it will assign the key and the value for each so that you have access to both so you can do something like for tier name to your info in tiers enumerate we are going to we'll just print to your name and then we'll print the name of each product will say priced at will do the price and then we'll say to your info price because now we're talking about the value so this is our value this is a little bit like iterating over a dictionary first couple times you do it's a little weird just always imagine which thing are we talking about so like tiers is obviously the whole data structure for each key value pair in this set okay so now we're talking about a key value pair it's got a name so that's the first bit here this is what's gonna print down and it's got a value which is this chunk right the key and the value and so now tier info is assigned to this alright that's just how this works I know it's a little confusing at first but just stay with me you'll see this kind of working and it'll become intuitive as you see it over and over so what we want is the price so we accessing just another dictionary access to your info price and then we'll get a list of the products right this looks good one two three four and then we'll say print products which announced that they're coming and then what we can do is another this is like super basic just to show you and then we'll print each product maybe we'll join those together into into like a single string so we can do a comma join so this is a string join of tear info data its products and that should get us kind of what we what we want and maybe for just some like a visual distance will just print a couple new lines after each one just we get some space let's see if this works and we have invalid syntax whatever I'm actually gonna bring this over here and then we can just copy and paste nicely so let's see oh these get replaced with a single tab for to name two your info print to name you guys are probably seeing the bug and I'm not it's just probably super frustrating for you yeah it's definitely like I'm missing a oh there it is sorry about that guys you must've been freaking out if you've seen that okay so just a missing paren okay totally thought this was a Python Python dictionary thing enumerate for iterating items I'm so sorry enumerate Python noob alert I've only been doing this for nine years enumerate is actually something we use on lists not dictionaries what we actually want is items in Python - this was called eater items just for clarity I'm pasting the data structure here again and now we're gonna iterate through it yikes sorry about that so this is the output we want we have the tier name priced at blah and the products are named 1m2 then we've got some new lines and then we start again for the next product I think that's that's reasonable output for what we want so we're gonna say this data structure works there's certainly other ways to do that you could make this a list and then you could make it a list of dictionaries for example but I like this we're gonna use this data structure so I'm gonna basically comment this out I just leave it here for reference to store bundle info and if we need to change that if we like realize that there's some something that we want to access and use in store from the markup that like this doesn't really support we can just like redesign the data structure change all the places where we access it and continue on our merry way I'm gonna comment this out this is gonna be just like will name this common access pattern for example okay so now really all we need to do is we have our tier names right so that's gonna be this bit now we need the price and the product list so the price is in here actually let's take a look at where the products are and then we can work on extracting the price so here is I'm just gonna well let's inspect one of the products okay now we're gonna hover until we see the whole thing there we go okay DD image box list just gonna keep this class name and let's see if the other products are in one of these two because I think that might be what we want yep okay so it looks like each product box is has a class of DD image box list and then the thing we want in there like an actual product DD image holder that looks good just go for the DD image holder I guess it's not the image sorry I guess it's the text we want right there we go so it's the DD image box caption DD image box text interesting call-out subtitle I don't think we care about the subtitle right now DD image box caption that looks like a contender okay so we'll say product names let's do a soup select for that do the image box caption let's just see what that gives us I'm just gonna do this again for for ease of reading I'll say select this and okay that's a bunch okay but that is a bunch of products this actually looks good interesting okay this actually looks like what we want DD image box caption so why don't we capture that and just call it testy again and we'll say text testy zero and just yes this looks good so like again this text attribute is what we want and then again we want to strip it so this is the same process as before let's just make sure this works with the last element too we don't have like something weird in here that we actually only matched and it looks good it actually looks really good okay so how do we assign tasks again soup select DD image box caption perfect so let's say product underscore names that's this and what kind of use the same the same thing that we used up here the same kind of structure so we'll say stripped product names is getting a little long these variables but that's okay and we'll just say prod name text strip for prod name in and it's not to your headlines its product names so let's try this out and see if this is what we want product names stripped product names let's see yeah that looks good that's what we want so let's just pick a random one the sixth one cloud native Python ooh cool okay so we have kind of the data structures that we want product names so that's this and the last thing we need to do is to grab the price let's let's think about grabbing the price for each tier so if you remember our stripped tier names actually include the price so it looks like every single one of them starts with pay those are the ones we actually want to match we're gonna have to clean this up a little bit right so this is where we have to deal with the support charity problem let's let's just do a kind of ugly ugly attempt at this we will say name for name in stripped to your names if named and what we want is we want to make sure that it starts with pay Python has a starts with function which is super useful so if name starts with pay so that will that will just grab the the ones we want support charity doesn't start with pay but what we actually want is we want to split this out for the price so we're gonna split which by default splits on a space and then we want the zero first element of that so what I'm doing is it's gonna split on every space and make a new list of that list we want the zeroth first element and we kind of remove the dollar sign but it doesn't really matter we'll just leave it as a string for right now there we go so you can see that what we've actually got is if we do name split and then grab the first or the start of the second element the first in deck you want index yikes for each name in these stripped here names as long as that starts with pay then we actually get the list that we want I think this is a little bit ugly I don't think this is actually a great solution but it gets us a little bit closer so I think what we'll do is we'll definitely have to iterate through this and like split this as long as it starts with this in fact I think what we'll do is will solve this as part of the next step we're gonna build this data structure and we'll kind of do it as part of that so we have our tier headlines we've got our product names I wonder if this will help us why don't we try a DD game row so I'm gonna try a little experiment here see if this targeting gets us a little bit closer let's look at what we get from accessing DD game row so this is a little bit ugly but it might actually let us target specifically what we want because we're gonna grab this by row and then so for each row we'll just be able to pull out the title and then the product for that thing so that we don't have to hope that we like scrape them all in the right order not that that would probably be an issue but you see here we're kind of doing this separately like we're basically selecting two different things and they're just hoping that the order of the list this gives us matches and there's no real reason why it wouldn't be but I like this method a little more so we're gonna select DD game row instead of DD header headline we're still gonna do this text strip thing but now we're gonna build this data structure and we're gonna do that in a loop instead of tier headlines we're just gonna call this tiers and it'll say for tier and tears if it has one of these just so we don't like accidentally match something weird if this is not if this doesn't come back empty we are going to grab we're gonna do this this text strip thing for this if it has it then we're gonna say to your name is this so now we're back to working with that same headline that we were that we were using here I'm gonna start here by initializing the tier data structure how about tier dict to an empty dictionary and then we can use this to your name is that and we're gonna we're gonna set this so tier dict to your name will be and then the actual tier data structure which we don't have yet so now we need to do the products do some comments here only for a headline grab to your name in price and now we're gonna have to grab tier product names right so here's where like the the products thing will go this is not I'm just showing you what I'm gonna do here so we're gonna grab the tier product names by what do we do image box caption but not on the whole soup thing but just on this tier that we're working with so instead of getting all DD image box captions on the page we're gonna get just those that match inside of this thing we've already selected so this is us filtering down more and more so these will be our products I guess we should name it products names and then we should make sure it's product names so we will select the image box caption just as we did but now instead of selecting everything in the whole document we're just selecting the ones inside of this original game row and we can still use this so we can actually just do tier headlines where is it proud name text strip for proud name in product names we'll reuse this to get all the stripped product names we can't just reassign this product names so we're just reassigning it to a cleaned up version of all of these product names and now we can actually build our data structure so this is a key value pair but what we're really going to say is product tier to our data structure so this tear dict which is presumably empty at this point we're gonna give the tear named use the tear name as the key which is the stripped name of this tear and then we're gonna add the products here but those are actually the product names so this will be a dictionary again the duck duck its products will say product names okay I suppose we could do this on one line it's still nice and readable so this is like one element in this list today products and price that's right you know what we'll worry about the price later week and all that fancy stripping in splitting stuff I just want to get to like down the rabbit hole right now okay so we'll have products product names let's let's run this I'm just gonna remove white space here because otherwise the repple is gonna throw up so oh of course we've changed this so tears is now this different selection their three-game rows and then for each one of those we're gonna drill down parse out the separate information and list object has no attribute text interesting line five two four five texture that's okay so we're gonna simulate this so forty or in tears we're gonna simulate testy again will be tears zero so testy is one of these tears we're gonna do all these operations on them does this have a headline element tears going to be testy yes so it has one of these headlines so it is one of the ones we're gonna use so let's do testy select this thing I think that's where the error was teste select boom let's see what we get when we select it ah this is a list element it'll only have one so little that will always be the zeroth element and that's the one that we pick out and strip the text from that makes sense so this returns a list because there could be more than one h2 TD header headline in here but we know there won't be because we're already drilling down into a single sort of product stripe here right this is the element we're dealing with no longer the whole page we've already split the tiers which are each section like this and now we're just searching inside of the mark-up that makes up this chunk of the page okay so where was that here tier select and then we're gonna grab the first header headline that we find with the only one strip the text from it and there we go again this will be this will be a list that's returned so we'll select again the image box caption and then for each of those we do this text strip thing so we are iterating over those grabbing each element out and assigning it back to product names so this will start as this kind of dirty list and then become a clean list as we iterate over it grab the text strip the whitespace and then assign it back to product names right then we create the data structure so why don't we create tier dict this will be our kind of like final test of this if we don't have any more bugs and see what happens okay so this worked well didn't have any obvious bugs anyway let's look at tier dick this is actually looking like what we want so let's test it out we'll say tier dict keys right so that's our access pattern that we saved before it looks like yeah these are actually the pay blah or more to unlock that's pretty exciting what was our access pattern again common access panel for turn em to your info in tears items instead of tears we're calling it tear dict why don't we uncomment this and see if it kind of does what we expect move price right we didn't set price but that's totally fine we will just remove that and run the rest come on baby all right yeah so this is basically what we wanted I think we have what we need so I'm going to I'll just uncomment this and leave it at the bottom just for reference like in a normal an actual program that I was writing I would like cut all the stuff that I'm not using out but because I'm gonna push this up to github and you guys can clone it and take a look at it I want you to have at least the reference of the stuff that we worked on and cut okay so why don't we actually try running this as a program now that we've basically done all of our EPEL work we've kind of tested each chunk of this let's make sure that this actually works so we are going to run Python bundle scraper dot PI and let's see what happens boom beautiful okay this is a pretty good prototype this is it's very basic I mean you can see by the time we're done and we spent like an hour two hours on this we were really we've got a working product in approximately ten lines of code right and the hardest part of this was just figuring out it's not that the programming is so complicated or complex the hardest part of this was like figuring out how to target the content we want how to get at the information that we want and that's pretty representative of scraping data munging finding out how to get the data dealing with rate limiting cleaning the data accessing the data reliably building it in a way that isn't gonna break as soon as something small gets updated on the website those are the things that are actually difficult and as you saw they're the things that are the most time-consuming if you've never built anything then you think the hard part is like oh like all these programming constructs of the language but in reality for most programs are gonna write the programming bit itself is simple the things that are hard are like you saw picking getting the right information thinking about the problem correctly building an appropriate data structure thinking ahead about what you actually want to see what you actually want to do and that's why you know doing this in a project-based way is so much more helpful than just like reading a Python book or doing a Python course even though that's naturally what you think of when you're like oh I want to learn to program it's like no you don't you really want to learn how to build useful tools learning programming is just like a thing that you have to do on the way to that and it's actually not even not the most difficult or even the most interesting for most people part of that journey so what I'm really hoping for is that this kind of ties things together in your head so that you can see that there's no magic at any step and that none of it is in isolation is particularly difficult it's really just about keeping a clear idea of what you want staying open to changing your mind or changing how you've implemented something for it to be better and then getting to a working prototype as quickly as possible and then building on that I hope that's been fun I hope that's been helpful if you're enjoying this definitely let me know in the comments and based on the feedback I'll do more of these and we'll we'll kind of build this up into something more complex so I hope you've been typing along and experimenting and I'll see you in the next section for this the next phase the next video it's either peace
Info
Channel: tutoriaLinux
Views: 77,391
Rating: undefined out of 5
Keywords: computer, how-to, tutorial, sysadmin, Python, programming, software development, programming project, python project, web scraper, web crawler, python software development, python for beginners, programming project for beginners, python project for beginners, python web crawler tutorial, python tutorial, live coding, treehouse, thenewboston python, thenewboston
Id: 7SWVXPYZLJM
Channel Id: undefined
Length: 62min 43sec (3763 seconds)
Published: Sun Oct 29 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.