Python Web Scraping Example: Selenium and Beautiful Soup

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello guys welcome to Python and machine learning daily today let's have a demo project related to python web scraping which is kind of a shady topic to be honest but there are so many jobs around that on upwork and since I'm trying to be as practical as possible on this channel and upor is one of my sources for you guys to earn real money I cannot avoid that topic so if you search for python scraping on upwork there's a job 58 minutes ago then 1 hour ago than 2 hours ago 3 hours ago four five six and there are many pages of more jobs so in this video I want to show one of those jobs about scraping implemented in Python this is the actual job the screenshot from my phone as I found it from my phone while browsing and the task is to scrape all 140 y combinator companies listed on the page of Y combinator this is the actual page startup directory with those companies and the client wants to scrape them into a database or Excel or it doesn't matter how to save them because we will scrape them into a pandas data frame which you can use to store wherever then now disclaimer scraping may not be legal or ethical in ideal scenario you should use the API provided by that website or by that company officially some of those apis may be public some of those apis may be under some authentic iation scraping from website is kind of a shady tactic if the website isn't actually providing the data in that format then it may be the case that they don't want you to get that data and it's not even that much about scraping itself but what matters is what you do with that data if you publish it somewhere else under your name or something like that there's a high chance that it's not a legal thing but actually I found a good website listing those legal and ethical considerations so I will link that in the description below it discusses the copyright the do DOS and other considerations but it all comes down to be nice and ask first or get the data Through official channels if possible now that said let's go to the technical solution this is the python code and I have two versions of that with two tools actually selenium is what's under the hood as the main tool but then on top of that you may use another tool called Beautiful soup to process the data in a bit different syntax and there are more tools for scraping more complex one is called Scrapy for example but on this channel I don't want to dive too deep into scraping topic because as I said it's quite edgy and shady I want to talk about processing the data in this case so what are we doing here in selenium we import a few libraries actually in requirements txt there's a full list so selenium to libraries then beautiful soup optionally to get the requests from web we need the library request and then pandas is for saving the data into Excel or whatever sheet you want now we get the URL and we prepare the options for our selenium Chrome driver which is imported on top and the tricky part of web pages a lot of web pages these days are Dynamic with JavaScript like spa or other kinds of dynamically formed page elements so just reading the HTML wouldn't give you the data because you need to wait for a few seconds for the data to load but those tools like selenium and others have tools to work around that for example you can wait for 5 seconds to get all that elements that appear later and then the implementation depends on the page what is the structure but in this case we get the URL we sleep for 5 seconds and then we have for Loop to get five pages of data in this case we have infinite scrolling and this is how to work around that again it's one of the options for this specific page and then we need to process that HTML and with selenium only you work with CSS selectors you find the element like this it's not a pretty CSS class because it's probably generated with JavaScript and there's a big chance that this scraping script would not work after some time if they recompile with other classes that's just one of the challenges that you get with scraping especially if you're doing that without consent from the website owner but anyway we find the div element that we're interested in we save that into a box and then inside of that box we find other elements that we're interested in like company name location description and others we save that into variables and then we have the list of python and append each company to that list then we close the driver to finish our session and then we use pandas as data frame just for more convenient saving to CSV file so we use the first list element as columns and then all the others as company data and this is the actual result of the CSV that has been saved saved so if we launch that we need to wait for quite a while for the Chrome driver to be installed properly and in this video I also kind of missed the PIP install part so from requirements txt you can install all those libraries with Pip and then now as you can see on the screen page one page two page three is loaded and the delay the pause is 5 seconds so that's why the whole script takes roughly I don't know 30 seconds to 40 seconds and then process finished and this is our CSV as a result which I've shown you just a minute ago and then another alternative solution is with beautiful soup that is imported here in another script and this part is almost identical or in fact it is identical what is different is HTML we get the HTML source and then we create the beautiful soup object which allows us to work with its data like soup find all and element find next and other syntax but it's pretty similar to the selenium by default probably a bit more shorter and more convenient with more options specifically for working with HTML everything else is pretty much identical we build the list we close the driver we save the data into CSV so yeah what do you think about the script would you have done something differently let's discuss in the comments below and what do you think about web scraping in general do you work with similar projects and do you consider that legal and ethical and what challenges do you see with scraping maybe we can discuss that together how to overcome those or what tools to use to process the data but again on this channel in the future I don't want to dive too much into scraping I want to focus on processing the data and manipulating the data like for example this line is what interests me which means get all the elements and then use list comprehension and string join to format the data properly so if you want more examples of things like these but again with practical projects from upwork and elsewhere subscribe to the channel and see you guys in other videos
Info
Channel: Python ML Daily
Views: 3,513
Rating: undefined out of 5
Keywords:
Id: 4HK5tRy1fVc
Channel Id: undefined
Length: 7min 44sec (464 seconds)
Published: Tue Mar 12 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.