Login and Scrape Data with Playwright and Python

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
like selenium we can use playwright to control a browser with our code it's designed for automation and testing and in this video i'm going to show you how you can log into our website and pull out some data so let's go over into our code editor and i'm going to say the first thing we need to do is we need to install playwright so you're going to want to do pip 3 install playwright or pip install playwright once that's done you need to do playwright install this is going to install the browsers for us there's chromium firefox and webkit and with installing it this way we don't have to worry about any chrome drivers not being in the right path or in the right place or having issues like that so it does it all for us it also allows us to change the browser that we're using on the fly so once that's done the first thing we need to do is we need to talk about using it synchronously or asynchronously so playwright can be used with its sync async api now this is really going to be quite useful and powerful going on but for this demo we're going to use the synchronous api just to keep things nice and easy so what i'm going to do is i'm going to do from playwright.sync api we're going to import sync playwrights this is going to give us access to everything that we need so we're actually going to use a context manager with sync playwrights as p so what this is going to do is it's going to close our browser when our code is finished that means that we are not going to have anything left open that we didn't mean to causing massive memory issues so now we can say that we want a browser which is our browser instance is equal to p dot chromium i'm going to use in this case dot launch so this is going to give us a browser object that we can work with now in here i'm going to put in headless is equal to false and i'm also going to put in slo-mo it's equal to 50. now by default playwright will always run headless so if you want to actually see the browser you need to put headless as equal to false in here and slow-mo 50 is just going to slow it down a tad so we can hopefully see what's going on a bit better so now we have our browser instantiated we need a page object that we can interact with something that we can tell our playwright to click on links follow things to type into boxes so we're going to say page is equal to browser dot new page now from here we can use page dot go to to go to certain websites or wherever we want to so i'm going to do https and we'll just do google.com i'm going to save it there and i'm actually going to run it now and what we should see is a browser pop up and then just disappear straight away and that's our context manager working so with our context manager like that our code finished with page.goto it went to that page and then it closed the browser so there's no chances of it being left open so the actual website that we're going to be going to is this it's a demo page for the open cart so i'm going to click login and it's going to take us to a nice looking dashboard which we are going to use playwright to login and then we're going to pull out some of this information so what we want to do is we want to go to this website with playwrights i'm just going to copy the url and then we want to tell it to type into these boxes now i can see that this already has the username and password typed into that's not my browser saving it's just there on the website but that's okay i'll show you how to do it anyway let's put the correct url in here save that and to type into a box we're going to use page.fail this is going to tell playwright to put into whatever in this selector that we give it whatever information that we're going to give it so let's go back over here get our inspect element tool hover over these inputs and as we can see here we have this input type of text and name username and an id of input username now if i hover over the password one we can see that exactly the same it's an input tag with the id of input password so i'm going to use these ids and we're going to use css selectors i believe that you can use xpath as well if you want to i've just always used css selectors so that's what we use now just an input tag so let's do page.fill and let's have our input and for an id we use a hashtag input username and we'll do demo we're going to copy this because it's almost exactly the same and what it was input password and the password is also demo from here what we want to do is we want to actually click on that login button so we're going to do page dot click nice and easy syntax now we just need to find the actual selector for this button i'm going to hover over it here and we have page uh sorry we have button type submit so i can use the css selectors on the type for submit i'm just going to copy that and we can come back and it will be button and it was a type it's equal to submit close the bracket this is going to hopefully load this page up type demo in the username and password and then click submit button so let's run this so we should see it flash up there we go so it finished so once it clicked on this button it finished so what do we want to do from here well we have a couple of options you can use the playwright does have a query selector and a query selector all for selecting bits of information for using css selectors or whatever but what i'm going to do is the same thing that i used to do with selenium a lot and it would be that i would use selenium to load the page up and then i would tell it i would capture parts of the html and i would send that to beautiful suit so that's what i'm going to do here because i think that just makes it a lot easier especially if you're already familiar with bs4 you don't have the extra worries about new selectors or new ways of getting information out so we need to pip install bs4 if you haven't done so already i'm going to do from bs4 we're going to import beautiful suit at the top so now we have our html parser installed we need to find a way to actually tell it to give us all of the information all of the html on that page and one of the easiest ways to do that is to use page.inner html we need to give this a selector though because it's going to pull in just the input just the html from inside this main selector that we give it so if we come back let's log in and we get presented with this if we hover over the piece of information that we want we want to get the total orders then we start going back up the tree a little bit we get this div id here of content which has actually got all the information in it that we're after so if we move that over we can see this is the content here so i'm going to copy that and we're going to tell it that we want all of the html from this selector and it was an id so we use the hashtag of content let's go ahead and print out what we'll do actually is we'll save this into a variable and then let's print out html let's run this it's going to pop our browser up and there we go so let's move this up see what we got a load of information which is great and we can see that it's all appearing there now sometimes what you might find is that you can see that it's this loading thing here if it's not loaded properly for you or you want to wait for a certain piece of information to appear the easiest way to do that would be to use page dot is visible so we can say wait until a page wait until an element is visible before carrying on so if we come back to this one let's just say we want to have uh let's say that this tile body so this is important to us this thing so let we could say div dot tile body so it will wait until that is visible before it gives us the html content now i didn't have that issue in this case but just in case you do that's one way that you can do it we're now going to pass this into beautiful soup so we'll do soup is equal to beautiful html we're going to give it and let's have the html.parser will do fine for us what we'll do we'll just try printing out print soup um dot find let's do find all h2 tags and we'll see if that gives us some elements back so we know that it's working we see it logging in and flashback and down here you can see that we do have a few elements that match the h2 tag class and this one here is the one we were after the 10.7 k which was the number of orders if you wanted any of these other its information we now just have a beautiful soup object that you can pass and work with now we've seen that the elements are working fine we'll just do our total orders is equal to soup dot find and we're going to use that h2 tag that we just saw and it was a class and it was pull dash right and we can have dot text on there and then we will do um let's just do print let's use an f string total orders is equal to our total orders so that should work fine for us let's run that and let's see what we get logging in piece of information there's our total orders down there with and are missed off in our text but you get the idea so if you wanted to not run this as headless we can remove all of these i'll remove slo-mo as well we'll clear this up and we'll run it one more time we won't see anything pop up we'll just know that it's working in the background headlessly and we get our total orders back so there we are we're done we've logged into a website with with playwright we've done it headlessly very quickly pulled out little bits of data after the login it's not something that you could do very easily without some kind of browser automation so it's very very handy to know this has been the first in a very short mini series which i'm doing on playwright but in the meantime whilst you're waiting for this one if you've enjoyed this video you might find this one is interesting as well
Info
Channel: John Watson Rooney
Views: 7,061
Rating: undefined out of 5
Keywords: playwright, playwright tutorial, playwright python, scrape data behind login, browser automation, python browser automation, playwright automation, playwright scraper, web scraping, web scraping login, john watson rooney
Id: H2-5ecFwHHQ
Channel Id: undefined
Length: 10min 22sec (622 seconds)
Published: Sun Nov 14 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.