Beautifulsoup vs Selenium vs Scrapy - Which tool for web scraping in 2021?

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
when it comes to web scraping with python these days we kind of have three main options of tools that we can use to get the job done for us picking the right one will depend on what you're trying to achieve the website you're scraping and what you want to do with the data so the first option we have is beautiful soup now this is a powerful python library which is specifically designed for passing html we can give it a page of html and we can use its functions to find the data that we're after on the page and extract it it's very easy to use and it has really good documentation and this is why i always recommend it to beginners however it's worth noting that it is not a complete web scraping tool although some people might refer to it as such it only does the past part of our web scraper for me a successful web scraper has three main functions it has the first one to get the data the second one to pass the data and a third one to output the data so this is only going to fulfill the second part which is passing the information however it is extremely powerful and it has lots of features in it that a lot of people might not use or don't know about my last video was on that subject actually so if you're interested in that go and check that out it is really good for structured html and scraping and it is powerful and lightweight for me it's best for beginners and people who are learning web scraping and or python smaller projects where you only need to get out a certain amount of data or you don't need complete control over everything and simpler web scrapers our second option is to use scrapy now scrapey is a full featured web scraping framework for python it can do everything it has really good ways to control the data flow you can create specific items you use spiders to crawl the data out and you can use the item pipeline to send it to where you want to be a database a csv file or json object it's also very customizable and powerful and you can have add-ons as well to make it even more useful like incorporating it with splash to be able to scrape javascript websites it's quite complex though and the downside for me is how hard it can be to learn and how daunting it might be for a new user the documentation is all there and complete but it's not necessarily beginner friendly and getting your first scrapy project off the ground can be quite challenging if you're not used to it i've got a video on that too where i take go from nothing to a complete basic html scraper in scrapy on my channel which you might find interesting so for me this is definitely best for advanced users full and complete web scrapers that require a lot more control and also repeat scrapers like you might want to run it at a certain interval every day and compare the data and get it through it's worth noting that scrapey is a complete package you only need that it controls everything for you our third option is to use selenium now a lot of people will refer to selenium as an actual web scraper but it's not it's a purpose-built browser automation tool for testing it's designed for testing websites you can control it with your python scripts and get it to perform actions and control your browser for you now within this a kind of byproduct is being able to web scrape because we can interact with elements and get their data out although it's not ideal because it's quite slow and resource heavy and not that easy to debug if it's not if something isn't quite working for you however some cases you do just need to use it i don't often tend to recommend selenium because often i think there's a better option available however sometimes you just do need to load a browser up it's worth noting that even though you're you're running a actual browser instance that you can still be blocked and websites can still detect the euro bot selenium sends that information over and although you can remove it it's a little bit more difficult and not quite as straightforward so this is only really best for last case results or when you need to click on some data or input something into a field if you need to type something in or click on a button and then get that specific piece of data out this could be what you're after so those are the main three but there are other options although they follow along a similar kind of line so request html uses the python version of the puppeteer browser called pipertier which is a lin slim down lightweight chrome browser that will actually load the page up for you when you run the render command and it will send you the html data back you can also pass it with request html or you can give that to beautiful soup to extract whichever piece of information you're after again because we are running an actual browser albeit a slightly lighter one in the background we still have to wait for the page that way so it is a bit slower and a bit more resource heavy although i do use this library quite a lot for my own web scrapers because i do find it quite powerful another option for rendering javascript pages is to use splash now splash is created by the same people that created scrapey scraping hub and it is basically a fully designed web scraping browser you use its api to send it the website url that you're trying to scrape it does the rendering for you and then sends you the html back it is useful and i do use it it does require docker to run but it can be run on servers as well so it's more definitely a more of an advanced tool although actually getting it to work for just your simple scrapers is really easy if you're just doing some basic html table scraping i would recommend checking out pandas it has a data frame read html function and you can give it a url and it will look through all that and it will pull out just the table data for you this works really well on websites like wikipedia that just have their data in tables and you just need to extract it all you can't really do much more of it than that but if it's just table data you're after this could be a good option for you too the last option i'm going to talk about is not so much a scraping tool but it's more of a method so we can use requests to simulate any request to the server that the browser does now quite often especially with modern javascript websites when you do something on the page it requests that information from the server and it's sent over by the api quite often we can actually get to the api endpoint of that and we can simulate that request ourselves and by bypassing the whole javascript part of the website we don't have to render it we don't have to wait for it to appear we can just request it ourselves from the server and get a nice load of json data back it's worth noting that that works really well in some cases not so great in others but it's always worth looking for in my opinion so that's it for this one i hope you've enjoyed it and it's brought some value to you leave me a comment down below and like this video if you do also consider subscribing i've got loads of web scraping content on my channel already and more to come along with more videos like this so thank you very much for watching and i will see you in the next one goodbye
Info
Channel: John Watson Rooney
Views: 14,033
Rating: 4.9886847 out of 5
Keywords: Beautifulsoup vs Selenium vs Scrapy, web scraping with python, python web scraping, python scraping data, web scraping python selenium beautifulsoup, python for web scraping, best tool for web scraping, beautifulsoup vs selenium, beautifulsoup or scrapy
Id: J82SxHP5SWY
Channel Id: undefined
Length: 6min 53sec (413 seconds)
Published: Wed Jan 13 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.