Beautiful Soup 4 Tutorial #1 - Web Scraping With Python

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello everybody and welcome to a brand new tutorial series on this channel which is on beautiful soup 4. now beautiful soup 4 is kind of a web scraping and html parsing module so what this allows you to do is actually extract information from html documents and then modify html documents as well so you could use this for web scraping you could also use this to read in say an html file modify it programmatically using python code and then recreate like a new html file that has those modifications to it it's very versatile there's a ton of stuff to show you but in this first video here what i will be doing is just giving you an introduction to how it works showing you how to read in a local file showing you how to read an html from the web and then i'll kind of just give you you know like a brief walkthrough of how beautiful soup works and some of the main most common functionality that you're going to want to know in the very last video of this series i will show you how to write a relatively automated web scraping program that goes and looks for prices of graphics cards i know a lot of people are looking for graphics cards right now so i thought that would be an interesting application that we could kind of conclude everything with that's writing that code anyways i hope you guys are excited if you are make sure to leave a like subscribe to the channel let me know in the comments anything you want to see in this series let's go ahead and get started [Music] alright so in front of me i have the beautiful soup 4.9 documentation i'll leave a link to this in the description in case you'd like to read this yourself pretty much everything i'm going to show you here is coming directly from this documentation page i've just kind of summarized it and grabbed what i figured was the most important stuff from here anyways if you want to see all of the functionality you can see there is quite a bit of it this is quite a long document then you can click the link in the description all right so the first thing we need to do when we're going to start working with beautiful soup is we need to install it now what we need to do is install the python package which comes from pip so if you're on windows open up command prompt if you're on mac or linux open up your terminal and then type the following pip install and then this is beautiful soup for like that i think i spelt that correctly so you're going to pip install beautiful soup 4 and then that should install the package for you now for some reason this command does not work for you try pip 3 install beautiful soup 4. if that doesn't work for you try python hyphen m pip install beautiful soup four and if that doesn't work try python three hyphen m pip install beautiful suit four lastly out of three here those are the kind of different combinations you can try if none of those work i do have some videos i'll leave in the description that show you how to fix your pip anyways at this point i'm going to assume that you've installed that python package i'm using python version 3.8 i believe right now but you can do this in pretty much any version it should work the same alright so now that we've got that installed we can start writing our python code i am currently in sublime text you can use any editor that you like this is just the one that i prefer for these types of videos and what i'm going to do is start by importing from bs4 import beautiful soup like that so this is what you need to do to get started and then what we're going to do after this is i'm going to show you how to read in an html file and then to modify that file then later in the video i will show you how to read in kind of a web page so if you want to read an html file first of all you need an html file so i have this kind of dummy html file here i'll leave a link in the description it's a github repository that has all the code that i write here including this document so you can grab it from there if you want but this is just kind of a dummy html file okay so this is in the same directory as where i have this web scraping.pi file make sure it's in the same directory otherwise it's going to be a bit of a headache and what you're going to do is open this file and then use beautiful soup to read it so we're going to say with open and then this is going to be index dot html comma and then we're going to say html dot parser so oh sorry not html.parser this is going to be r because we're opening this in read mode i'm getting a little bit ahead of myself and i'm just going to call this f standing for file okay so with open index.html in read mode as f and then what i'm going to do is say my document is equal to beautiful soup and then i'm going to put f as the document that i want to read in here and then i'm going to do html.parser so there's a few other parsers you can use here i'm not really going to talk about what those are but pretty much since this is an html document we want to parse it as an html document so we write html.parser this is like an accepted type for the beautiful soup module okay then what i'm going to show you is just what this looks like but as kind of a python object so i'm going to print out the doc and run my code and show you that we get the html document like that so that's as easy as it is to actually read in an html file this is local on your machine now what i'm going to show you is one cool thing that you can do here so it's usually better because your html is always going to be all kind of like jumbled together to prettify this before you print it out so if you print doc dot pretify then what this does is give you all the indentation and you can see this is a lot nicer and it obviously is much easier to read okay so that is how you read in a document now i'm going to show you a few pieces of the functionality so usually what happens when you read an html file like this is you want to search for a specific aspect if we're going to the example i mentioned at the beginning of the video maybe you're looking for the price of something maybe you're looking for the name you're looking for maybe a table usually searching for some type of information so you need to be able to find that in the document so the first thing i'm going to show you is how you can find things by the tag name so actually i should go here and just print this out again so obviously we have like the head tag the html tag the center tag all of these things it's really easy to actually find stuff that is named a specific tag in beautiful soup what you can do is doc dot and then the tag name and this will actually give you access to the first tag that has this name in the document so just bear with me for a second here i'm going to say tag is equal to doc dot and then we'll go with title and then what i can actually do here is print out the tag and you'll see that this is the title tag right so if you want to access specific tag just put the name now obviously if there's multiple things named or using the same tag it's only going to give you the first one i'll show you how to get all of them in a second okay so now that we have the tag what if i just want to access what's inside of here well to access the string that is being held inside of a tag what you can do is use dot string so i can say tag dot string and then notice it gives me your title here now one of the cool things about this though is i can also modify these tags so what i can do is something like tag dot string is equal to and then hello and now if i print out my tag notice that it's actually modified this in place and changed it to hello now what i can also do is show you that when i print the entire document again so print doc we don't need to prettify it if we go here and we find the title notice it's actually changed in the document so the same way that you can access things you can change pretty straightforward now there's a lot of other things you can change as well i'll show you those in later videos but that's kind of the basics that's how you access what's inside of a tag and then how you actually get the string within the tag okay now what else can we do here well we need to be able to find tags that aren't just the first ones that occur in the document so in order to do that what you can do is say doc dot find and then you can put the tag so if i put the tag a for example here this will give me any links but again this is only going to give me the first tag that occurs that has a inside of it so what you can do instead is find all and excuse me here if now i print tags you'll see this will give me all of the a tags in the document actually i'm going to go with p because i don't know if there's multiple a tags here and when i do this notice i get all of the p tags being printed out right here and also shows me what's inside of these p tags okay so that is how you can get that so as you probably noticed here these p tags have things inside of them right like this p tag has another tag inside of it so i'm going to show you now how you can actually access the nested tags now this is the exact same way that you would access the tags just from your regular document but now you're going to do it on an existing tag so this will show you kind of how this works but so this is pretty straightforward but just so this is pretty straightforward but let's just have a look here so let's say we want to access the very first tag so tag 0 in fact let's just put a 0 right here and i want to access let's say the actually let me print this out and see what we get here uh maybe i want to access the b tag right or all of the bold tags well if i want to do that what i can do is the following i can say tags dot find all just like i found everything on my document and then i can access the b text and when i do this now it gives me all of the different b texts right and then same thing within here i could go and access the text of these b tags or i could go and access maybe the italics tag or whatever i want but that's kind of how you can search through and parse the document and again i'll do an entire video on how you can actually find stuff in more detail so we will continue in one second but i need to quickly thank the sponsor of this video and this series which is alco expert algo expert is the best platform to use when preparing for your software engineering coding interviews they have over 160 coding interview practice questions on the platform taught by the best instructors one of which is me if you want to prepare for your technical coding interviews make sure to check out algo expert today by clicking the link in the description and using the code tech with tim for a discount on the platform all right so now that i've showed you how to read in an html file from your local system remember actually have this file here in the same directory i'm going to show you how you can read in html from a website so what i'm going to do is go to my command prompt here in the same way that we install beautiful soup we are going to now install requests so follow the same format of installing as i showed you previously but pip install requests and you notice obviously i already have this installed but you guys won't most likely and now we can actually access a website so the website that i want to access is actually nueck and as i mentioned we're going to be looking for gpu prices later in the video series but for now let's say i just want to check the price of a specific gpu so i'm going to steal this link right here this is for a 3080 and this is the price i'm going to show you how we can actually find and access this price okay so what i'm going to do now is i need to leave this import and i need to import requests now what i'm going to do is say that my url is equal to the url of whatever website i want to access and then what i'm going to do is i'm going to send a request so i'm going to say my result is equal to requests.get and i'm just going to put my url url like this so super simple all this is doing is sending an http get request to this url it's going to return the content of the page and the content of the page will be stored in result.text so if i do this and i run my code notice we're going to get a bunch of gibberish here but we are actually getting an html document okay now to prove this to you what i'm going to do is now read in result.text using beautiful soup so i just need to jump in here for one second and quickly mention that the url that we're using here does actually allow us to grab its html from a script now there is a lot of websites amazon is one of them that i tried and that i failed with that have like bot protection built in and that don't actually let you grab the html of a page by just doing what i'm doing right here this is a super simple way we're just sending a get request from a python script websites can detect you're using a script and they'll try to actively block you now there's some kind of like policy and legal related stuff when it comes to scraping websites so just make sure you're not like spamming requests on any websites or like dosing or ddosing anyone or something like that uh what we're doing here is is most likely fine but i just want to mention that that there is a lot of websites this won't work for and if they don't work for it i'm not necessarily going to show you how to get around the anti-robot stuff regardless let's continue the video so i'm going to say that my doc is equal to beautiful soup result.text and then in the same way as before we want to use the html dot parser and then i'm going to print out the doc dot pretify okay so let's run this i'll go back to that code in a second in case i went too fast for you and now notice obviously it's quite long but we are actually getting the html document perfect so we can see all the div tags and everything like that so now what i want to do is actually find the price of this gpu so let me go back to the website right here and notice that this is what the price looks like now i'm going to assume that i don't know what the actual figure is i don't know that it's two thousand six hundred dollars and i just wanna look for the dollar sign and then find the price afterwards so to do that is actually pretty easy what we can do here is go to let's just make a new variable and let's say prices is equal to and then doc dot find underscore all but this time we're not looking for a specific tag we're looking for some text the text i'm looking for is a dollar sign so i'm going to say text is equal to dollar sign like that and then i'm just going to print out prices and show you what we get so run this and notice we get two dollar signs now that's not very helpful obviously we want the entire thing we want the actual price not just the dollar sign but the thing is these dollar signs actually allow us to access what the price is and the way we can do that is by using this thing called a parent so the way that this is kind of set up beautiful soup is everything is in a tree-like structure so when you read in the document the html tag is kind of the first i don't even know what to call it branch of the tree if you want to call it that or the root of the tree and then there's all kinds of tags inside of the html tag right so if i have html here i have a head tag inside of the html tag inside of the head tag i have the title tag and we kind of have this tree-like structure where a descendant of html is the head tag and the body tag a descendant of the head tag is the title tag and then these tags right here also have a parent so the title tags parent is the head tag the head tag's parent is the html tag pretty straightforward but it just works in kind of a general tree structure and so what we've accessed here let me just write some kind of pseudo-html here is imagine we have a p tag uh okay and another p tag and then we have like our dollar and 2613 whatever it is we've accessed this single dollar sign right here and so if i access this dollar sign and i want the entire price what i want is the parent of this dollar sign because this just like everything else is a descendant of whatever its parent is and so if i access the parent this will give me the contents of the entire tag that this dollar sign is in and then i can try to search for the 2613. hopefully that kind of makes sense but that's the best way that i can really explain that to you so anyways we have these prices what i'm going to do now is say price is 0 dot parent and let's just print out what this is so let's just say parent is equal to that and let's print the parent and run this and notice we get this kind of large tag here right and we have this list item tag and then we have the price current label and then we have strong and then we have what the actual price is so what i want here is what's inside of this strong tag i want the actual value so what i'm going to do now is search for the strong tag within the parent tag and then i'm going to look for the contents of the strong tag so now what i'm going to do is say strong is equal to parent dot find and then i'm looking for strong and then i will print not stung strong like that so now when i do this notice that we get two thousand six hundred and thirteen perfect now i just wanna get the two thousand six hundred thirteen so what do i do i use my dot string and i get two thousand six hundred thirteen all right so with that i am going to end the video here i just want to give you a quick introduction to how this module works in later videos i will show you more advanced stuff and all the other features that you need to know in my opinion this is a pretty cool thing really easy to use hope you guys enjoyed the video if you did make sure to leave a like subscribe to the channel and i will see you in another one [Music]
Info
Channel: Tech With Tim
Views: 36,117
Rating: 4.9842253 out of 5
Keywords: tech with tim, beautiful soup 4, web scraping, web scraping with python, python web scraping, web scraping python, web scraping python beautifulsoup, python, html, html files, tag name, parsing website html, locating text, tree structure, beautiful soup tree structure, beautiful, soup, beautiful soup 4 python tutorial, beautiful soup module in python, beautiful soup 4 python, html parsing python, python web scraping tutorial, web scraping using python, python tree structure code
Id: gRLHr664tXA
Channel Id: undefined
Length: 17min 0sec (1020 seconds)
Published: Fri Sep 03 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.