Selenium Browser Automation in Python

Video Statistics and Information

Video

Captions Word Cloud

Captions

what is going on guys welcome back in today's video we're going to learn how to automate website interaction and web scraping using selenium in Python so let us get right into it [Music] all right so we're going to talk about browser automation about automating website interaction using python using selenium and the basic idea is if we're going to do all of this interactively so this is not the classic web scraping where we send a request we get a response we analyze the HTML code we find certain tags this is an interactive approach so we're going to actually open up a browser automatically we're going to click we're going to scroll we're going to interact with a website and then we can still extract information so this is still web scraping what we're going to do today but it's interactive web scraping and this is way more powerful because oftentimes on websites uh you will have certain tags certain sections only appear if you scroll if you hover if you click on something and not just by sending a simple request so oftentimes you will have to actually open a website interact with it to see certain tags that might be interesting to you and especially if you want to automate something like a browser game you would need something like selenium So today we're going to talk about the basics by example we're going to use neural9.com as an example we're going to to the books page we're going to click on a book we're gonna go to the Amazon page we're gonna switch the tab we're gonna parse the price and then we're going to display the price so a very trivial and artificial example but it's going to show you the basics of selenium how to work with it so let's get started by opening up a command line and saying pip install selenium and in addition to that we're also going to need a web driver manager to be able to use the Chrome driver in the script that is the best practice state of the art way to do it right now so we're going to say pip install Webdriver Dash manager and once this is done we can go and save from selenium import Webdriver this is the actual browser that we're going to use the actual web driver then we're going to say from selenium dot um webdriver.chrome dot service we're going to import service this is again as I said the state of the art way um you can of course also just use the Chrome driver that you maybe already have installed and already have a path without providing anything this might work in my case it doesn't work because the version is problematic however you can try to do it you can also provide an absolute path but the the best practice way right now is to use a service then we're also going to say selenium Webdriver Chrome options from here we're going to import options and then we're going to save from Webdriver manager dot Chrome who want to import the Chrome driver manager and we're going to start by creating a driver instance so we're going to say driver equals Webdriver dot Chrome and we're going to pass here the service is going to be equal to a service object and this service object is going to get the Chrome driver manager dot install as a parameter here so this is how that works also what I want to do here is I want to add an option because what happens with selenium is that you do certain things you click you load you parse something and once it's done the browser closes now if you like that behavior you can leave it like that otherwise you want to add an option here so we're going to say options equals options and then we're going to say options at experimental option detach true this will leave the browser open even if everything is completed so you will keep the the window open uh and of course we need to add the option here so we're going to say options equals options and then what we're going to do is we're going to navigate now to neural9.com so browser or driver get https then neural9.com and this should already work let's see if that works uh install is missing oh this needs to be constructed of course there you go so now it loads and you can see that it navigates to neural9.com now we can first of all automate maximizing the window so I don't like to use it like that I want to have it like that so we're going to tell the driver to automate the process so to automate the maximization process and we're going to do this by saying driver dot maximize window and you're going to see then after loading neural 9.com it's going to maximize the window automatically as you saw so the goal here this is now not automated is we want the script we want python to find this books link here we want to click on it we want to go to that page we want to find the book that is the python Bible seven and one we want to click on that we want to go to the Amazon page that we have here this is a German version Don't be confused uh it's just default here because I'm located in Austria um but essentially you you want to find this page and then you want to scrape the paperback price again this is German but here the English version would be paperback and here we have the price we want to parse this price and we want to know what the price is interactively again this is quite a trivial and artificial example of course you can do it way easier you can just immediately load this link and just scrape the price even without selenium but this is just for the example here just so we learned the basics about selenium so what we want to do first is we want to go to neuronline.com and we want to find all the links of the website so we want to find all the atax all the anchor tags and we're going to do this with XPath or XPath so we're going to say links equal uh driver dot find elements and we're going to say here want to do this with XPath and the path is going to be slash slash so across the whole page give me all the a tags that have an href attribute so a link basically and then we can say for Link in links print link this is one way that we can do that and the first thing we're going to do it quite simple we're going to do it not with a professional XPath way but what happens right now is it loads the page and if we go to the python thing here you can see okay we have a bunch of elements and we can also actually say get the attribute called inner HTML the inner HTML is the text of the or whatever is inside of the anchor tag so we're going to do this again here load the page maximizes it and when we go to python you can see okay we have a bunch of a tags here and inside of those we have a bunch of things as well so what we're going to do here without XPath just with a simple full loop which is not the most efficient way to do it we're going to say now if the string books is inside of the link get attribute inner HTML if that is the case we're going to click on this link and we're going to break out of the loop so this is again not the cleanest version for the uh quarries after that so we're gonna have uh two more queries we're gonna do everything with XPath but for now we're gonna do it or almost everything with XPath uh but for now we're gonna uh do it like that so we just get all the links this is this is the syntax so basically slash slash means um in the whole page find all the anchor tags and then with these square brackets here we say Okay um this anchor tag has to meet the following condition it has an attribute href which is the link if we find across all these links one that has the string books in it uh we're gonna click on this link and we're gonna break the loop so if I run this now and of course don't forget to close these Chrome browsers if you have the option uh active and now you can see it clicked automatically I didn't do anything it clicked and now we are at the book page so what we want to do here is we want to click on this particular link so either here or on The View on Amazon button what we can do here is we can inspect the page and we can hopefully you can see that without my camera blocking it um but essentially what we do have here is we have a bunch of um a bunch of diff boxes and you can see here for example we have the diff box the diff container with the class Elementor uh column wrap so this is a class that we can look for we have a diff box Elementor column wrap and inside of that we have all the information so we want to find the links here we want to find um the the link that we can click on but we want to find of course only the one div tag which is the Elementor column wrap where we also have this uh python Bible seven and one string so in this case this is an H2 you can see this is a heading two we want to find the diff container which is the um what was it Elementor column wrap where we have an H2 so a heading that has this python Bible seven and one tag or text this heading and then we want to click on the link that is inside of that same box so this might sound quite complicated and if you've never worked with XPath which is not uh the focus of this video it might be a little bit confusing but I'm gonna try to explain it as simple as possible so we're going to get all the book links here we're actually we're going to get just We're not gonna get all the book links maybe that's not the perfect uh perfect name for the variable but we're going to leave it like that for now and we're gonna say now the driver has to find the elements based on XPath and now we're gonna formulate the query that I just explained to you so we want to have as I said we want to have the links where we have this div box so we want to find a div box with the respective class that has a heading to that has the uh the text seven and one in it and we want to find in that the links and click on them or click on it on the one link so what we're going to do is we're going to say okay look for look for a div container that meets the following condition it contains um a class and a class is called Elementor column wrap so this is how we filter by class um and this is basically the condition so we have the diff container the diff container has to have the class name Elementor column wrap and if that is the case we're going to add another condition here with with square brackets and we're going to say okay we also want this diff container to have a child so dot slash slash anywhere below the diff container we want to have a heading 2 and the heading 2 has to meet the following condition the text of this heading to we want this text to contain so basically contains as a function itself so the text itself should contain seven in one with capital letters um and we're gonna close this here now let me just make sure that we closed all the brackets like that um and then what we want to do this this might now uh this might now be a little bit uh complicated we want to count also or actually before we do that let me show you what the result of this is here so we can save for uh book link in book links print book link dot inner HTML uh sorry not in HTML book link dot get attribute inner HTML so we can see what this actually gives us what results that we have here so it clicks on books and then we get the results here so you can see we have quite a lot of stuff here so you can see this is way more than we expected and the problem is that we oftentimes have diff containers that are above these diff containers that contain all the links so what we're looking for is actually a diff container where only two links are present we have if we look here we have in this diff container we should only have two links because the whole page here also has multiple diff containers that contain all these things so it also contains this tag this heading and it also contains the links and all that so this is problematic but we want to find the one diff container where only two links are present and this is going to be this link in this link so what we're going to do now in the end here is we're going to have another um condition which is going to be the count of um dot slash slash a the count of all the anchor tags that we have as children uh has to be two and then we want to get from that all the anchor tags so this is a little bit complicated let me repeat the whole process and again don't Focus too much on the XPath XPath is a language that we can use to parse XML you can also do different methods here but XPath is the go-to way it's the best practice way it's not too complicated so I'm going to repeat the process here we have all the diff tags that we can find we want these diff tags to contain this class the Elementor column wrap they have to have a heading to that contains the heading seven in one and we want to find all the anchor tags in that diff but the div also has to be a div where only two anchor attacks are inductive so we get the whole page we have a bunch of different containers that contain the class name Elementor column ramp we want to find those where this heading is inside of them now this heading is inside of this box but this div box is inside of other div boxes that also contain all of this so if a diff box contains these three for example we will have the problem that we have all these links in here so we want to find the one diff box that only contains two links and the heading which is going to be this box here I hope this is not too complicated but if it is just ignore it because XPath is not the focus of this video um but I think we should now have a pretty limited result set and we can just pick the first one to get the actual link so when I look into pie charm um let me just oh actually we don't want to get the inner HTML we want to get the href we want to get the actual link now to see if that works there you go so you can see now we have the link if I click on it this opens up the respective page so we seem to have found the correct element here so let's close this now again and let's go back here what we want to do now is we want to just click on that particular book link so we don't want to get the ahref and navigate to it we just want to get one of those so we're going to say just um book links zero the first one we're just gonna click on this one so we can run this again and you can see now it's going to locate that particular button or that particular link it's going to click on it and it's going to navigate to Amazon automatically so I didn't do anything now this was all selenium and now we are on the Amazon page now one thing that is tricky here and I struggled with that in the past a couple of times is even though you right now or we right now in the browser are on the Amazon page if I now try to parse something if I now try to find something selenium is still here in the books page so I'm not here I need to switch the tab inside of selenium to make it um to make it move to the Amazon page so we click and then what we do is we say driver dot switch to window so driver dot switch to dot window and want to switch to the driver dot window handles and here we say one for the second tab so zero would be the first tab one would be the second tab and this is how we switch to the tab so then we're gonna just add a simple time sleep which is not the best professional way to synchronize but we're going to let the page load so that we need to import the core python module time and what we're going to do then is we're going to just look for the element that we want to find and what we want to find is let me just do this manually now this was the wrong one what we want to find is want to find this particular box here so we want to find um in this case what is the whole thing here this should probably also be inside of a diff container right or what did I find um there you go you have a span so actually we don't want to find a diff container we want to find we want to find a link this is what we actually want to find you want to find a link inside of that link we have a string which is for paperback again this is German so don't be confused by that um but in my case this is the string this is the German version of paperback in English version I think it's paperback so you have to look for paperback but we want to find the ATAC we want to find the anchor attack that has a span that contains this this string here for paperback and also contains a span that contains uh the the Euro symbol so what we're going to do now is we're going to say uh buttons is equal to driver dot find elements and we're going to say again this is an X path and we're going to say okay we want to find all the anchor tags all the links that meet the following condition there is a span inside of that anchor tag that we're interested in where we have the following condition for that span being matched which is that the text of the span contains um contains the string that is for paperback in your respective country or language so this has to be a condition this is the condition that has to be met we need one more square bracket here and from all these anchor tags we want to find the actual spans that meet the following conditions so notice that this here is not inside of a square bracket this is not a condition the condition is inside of this one so we want to find anchor tags where there is a span that contains the string which represents paperback and then once we found those anchor texts we go deeper into those anchor tags to find the span elements that have the following property that the text contains the text contains the Euro so the Euro symbol or maybe the dollar or maybe the pound depending on your uh country right so we want to find those and then we can just say four button in buttons we're going to print from that button we want to get the attribute inner HTML so this should give us the price hopefully automatically so it goes to books it goes to Amazon it switches to tab it hopefully finds the box and when we go to pycharm you can see we get the price here as a result and of course we can also format it we can say replace the ants nbsp semicolon biospace and then this is going to work as well so this is one example you can do that with all the examples that you want of course this is not entirely applicable to automating games so you cannot automate Cookie Clicker like that this is a little bit more complicated maybe I can make a video on this in the future as well but if you just want to go to websites you know you can just go to Amazon you can scrape you can interact you can click and all that uh in this case you can see we just get the price the task that I showed you in this video can be done very simply also with requests and beautiful soup uh but still you know you can you can use that to open up a browser to actively tell the browser what to look for how to do something uh and what to do to maximize to scroll to click and all that um if you want to go deeper into that I recommend going to the documentation or you can Google certain things you want to do for example how to scroll with selenium how to go to the end of the page with selenium how do I go page down with selenium and stuff like that you have functions for all these things but this is how you use selenium in general you create a driver a web driver Chrome you go to URLs you find elements you interact with them XPath is your friend in this case um you can do it also with different different things you can find by class name you can find by ID you can find by certain other things but essentially XPath is the most efficient way it's quite quite simple a simple query language you can use um yeah this is how you use selenium so that's it for today's video I hope you enjoyed it and hope you learned something if so let me know by hitting the like button and leaving a comment in the comment section down below and of course don't forget to subscribe to this Channel and hit the notification Bell to not miss a single future video for free other than that thank you much for watching see you next video and bye [Music] thank you [Music]

Info

Channel: NeuralNine

Views: 167,438

Rating: undefined out of 5

Keywords: python, web scraping, python web scraping, selenium, selenium web scraping, automation, selenium automation, selenium browser automation, python browser automation, selenium website interaction, selenium website automation, automated web scraping

Id: SPM1tm2ZdK4

Channel Id: undefined

Length: 21min 37sec (1297 seconds)

Published: Sat Aug 27 2022