5 Things You Might Not Be Using in BeautifulSoup

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
beautiful soup is the go to html parsing library in python it's lightweight it's simple it's easy to use and it's really powerful when we're web scraping we can just give it html pages and we can let it do its thing and we can extract the data that we're actually after but i bet there's a few things in there that you guys don't use or don't know about and in this video i want to cover off five or so of those things that i think you might find interesting so the first one is to use regex when you're searching now if we look at the code that i've got here i'm importing requests and i'm using import re now that's for regex the page we are looking at we're scraping is this one i've actually got a copy of the html here but i'm just going to keep it like this for now now we can search the same way within the code within the html using beautiful soup but instead of passing in the class name we can just do a regex match this is particularly useful if your classes might have a certain phrase in them that you're after or if for example they have loads of numbers at the end generated automatically by the page and you just want to make sure you get the right ones so here we are saying for i'm calling it tag in this case soup.findall and we're looking for all the span tags with a class that match the compile of the word headline so i'm just going to run that and we can see that we've got all of these back so if i then go to the html code and i'm just going to copy this and i'm going to search in here let's remove that and we can see there's the first one there the next thing that is quite useful is that you can actually search by using a list now this is particularly useful for this example where i'm saying that you might want to get all the headline text off a page and in modern html there are many different h tags for the headlines it could be h1 through six i think but what if you just want all of them you can just give it a list and it will return all of them there i'm just going to remove the dot text dot strip so we can see the full tags when we run this but this is a good way of doing that so if we look at the results we can see the first few ones that we have returned are the h1 there's some h2s all the way down and we are still picking up the h3 tags here again this is particularly useful if the data you're after is coming in multiple tags you can just go ahead and get that specific bit out this is quite cool and one that i do use quite a lot another thing when it comes to searching is we can actually write our own functions to scrape data from the page i'm just going to zoom out one more so that all fits in line so what i'm saying here is that this function i'm giving it here will return any tag or element that has the attribute of title and has the attribute of href but it doesn't have a class now this could be quite useful for you if say you're trying to get all the specific links as in href but there is no definit there is no class attached but there is a title and by using this we can dial right in and only get those specific links off the page so i'm just going to run that and we can see that we have all of these links that have come back so if i scroll up to the first one we can see that there is the href and it has a title but there is no class attached to it and we could change that round as per we as per necessary or however we required again really useful for dialing in and doing zeroing straight in on the on the tags that you're actually after now we can combine a few of those things that we've looked at already and we can use regex when we're searching with our within our own function as well so this example here i have one step bigger so it's nice and clear we can say that we can use our own function here to search for category links so links that work mention the word category in them and i'll show you what i mean if i come to here and i search for the word category we can see that the internal or a lot of the internal category links for the wikipedia have the wiki category in the href we can just search using regex to just get these links out and these are more like the internal links so again if i run that we'll see them pop up and we can see every link on the page every link that i'm returning has the word category in the href this is particularly useful again for really honing in on the the links or the tags that you're after another thing we can do is we can search by using string and this will just match all the tags where the text within where the actual text of the tag has the word data in i'll run that and i'll show you what i mean so here we have returned just one and it has the data as the text again this could be quite useful if all of your elements or tags or links that you're after match a single text word and again you can use um projects with this too to match it properly and find the ones that you're after so i thought that was pretty useful and it also works with lists so we can then find all of the links in this case that have the text of the string that matches these three words and we should get back three links which is what we were expecting and the regex example here this will find any link that matches the text with data and we'll find that we should get a few more back and we can see that we do for example if we just look at this one this one has data mediation which is why it's matched with our regex but not with our specific string match so the next one i want to talk about is actually using css selectors to to sort of dive deeper into the tree and make it another way of just sort of finding the information that you want so the first one here i have is we want to find all of the a tags which is the links but within the body tag so if i search on the html that we're looking for and i find the opening body tag here so this one has a obviously a corresponding close tag and by using this method we're going to find only the links that are within this body tag and again this would work with any other ones you wanted so again i'll just run that and we can see that we get all the links back and the first one is this one here so if i just search for it here we can see inside our body the first link the first a tag link that we go back is right here so this is quite useful for quickly finding parts of the information so i'm going to go and comment that out and i'm going to say here's another example and we want to find all of the links the a tags that are inside p tags so when i say inside the tag what i mean is if we scroll down we'll find one we can see here is the opening of a p tag and there is a link within that paragraph that is what i mean with it inside so i'm going to run that and that will probably be one of our first examples and here we go so this one here so if i copy this tag this element sorry and we go to our html and we search that was the first one here so we can see that there is the opening p tag and it has a link inside it so by using this p and the greater than symbol a we're finding all of the a tags with inside paragraph tags this is particularly useful and i find very very helpful when looking for specific parts of the data there is another thing we can do as well we can find the very very specific selector so remember you can do this by going onto the inspect element part of the browser clicking the actual part of information that you want and going copy the css selector and we can see that this one is quite specific but if i run this we will get just this element back i should really i need to kill my terminal every time and that was just that specific element so that's a really nice cool way of really really just making sure you get that specific one on that page which can be useful so our last tip is going to be using the soup strainer now what this does is if you think when we give all of our html data to beautiful soup we have to load it into the object and then we have to pass through it what the soup strainer does is that when we actually give it to that object it will only take part of the information so it kind of strains out hence the name and i have this imported at the top here after my import beautiful soup what we want what we can do is we can write our own um little thing here we're going to say only a tag so what this is going to do is it's only going to let us get the a tags from that html so what we could do is we put it there and then after our soup dot beautiful soup i'm actually just going to go ahead and comment this one out so there's no issues we are saying pass only only a tags which is what we have here so now if i print the whole soup without passing any information we can see that it is all links and links only this could be useful if you have extremely large pages of html that you only want specific bits of information and it will save you memory it will save you time so those are the main ones there that i found really useful hopefully some of them have been interesting to you the last one that i've got just down the bottom here as a bonus because i think that's what you have to do when you do lists on youtube is to use the next element there's also a previous element and i've got my suit commented out so i'm just going to put that back on but what this does is it will just find the next element within your search criteria so what we're going to do is we're going to search for the h2 tags and then we're going to find the next one that is after the h2 tag this can be useful if let's say you've got a heading tag like i've got here and the next one underneath it has got no identifying information this could be a nice and easy way to get that so if i just run this we're going to say here we go after every uh h2 tag it was we can see that there is this one i'm just going to copy this one it's actually one of the ones that we were looking at before too there it is so again there's our h2 and there's our span that we were just looking at so that can be quite useful although i don't tend to use it so much i would use a different method but it can be useful so that's it for this one guys thank you very much for watching hopefully you found some value in this and some of these were new to you and you can then input them you can then implement them in your own projects and scripts so thank you for watching and i will see you in the next one goodbye
Info
Channel: John Watson Rooney
Views: 3,514
Rating: 4.98 out of 5
Keywords: beautifulsoup tutorial, beautifulsoup python 3, beautifulsoup python web scraping, beautifulsoup python, beautifulsoup and requests, beautifulsoup basics, beautifulsoup css selector, beautifulsoup code example, beautifulsoup example, beautifulsoup extract specific text, beautifulsoup guide, beautifulsoup html, beautifulsoup methods, beautifulsoup navigate html, beautifulsoup regex, beautifulsoup scraping tutorial, web scraping with python and beautifulsoup
Id: 3tUUVenpxbc
Channel Id: undefined
Length: 10min 31sec (631 seconds)
Published: Wed Jan 06 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.