How to Scrape JavaScript Websites with Scrapy and Playwright

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
all right so we're going to use our new favorite tool playwright and our old faithful scrapey and so we can web scrape those really heavy javascript websites using scrapy playwright we have access to playwright within the scrapey request what it will do is it will use playwright to load up the page and it will then send the response back to scrapey so we still have access to all of the scrapy features that we know and love so what you want to make sure you have is you need to make sure you have your virtual environment set up pip install scrapy and scrapy playwright and don't forget to do playwright install if this is the first time running playwright okay so now we've got everything installed let's do our scrapey and we'll do start project to create a new project and we'll just call this one pw demo let's cd into the directory and we'll use the genspider command because we're going to be using a general spider here so space scrapy gen spider and we'll just call this one pw spider and i'll just put in test.com because we're going to be changing all of this in just a minute so now that that's done we have all of our scrapy general files up here set up for us so there's two that we need to change we're going to need to change some of the settings and we're also going to need to edit the spider so let's get these open now within the settings we need to add in a couple of playwright specific things i will leave in the description the link to the scrapy playwright github that has these in it for you which you can copy and paste and there's also some other cool examples in that i'm just going to copy them over from it now we need the download handlers here and we also need this twisted reactor here so i'm just going to put a comment above and i'm just going to write playwright should i ever come back to this i know what i did that's all we need to do for the settings i'm going to remove these though and we're going to use these the uh start requests in scrapy with self in there then we can yield out our scrapey dot request so this all looks the same so far we need to put our url in here but then we need to add in a couple of extra things so we want to do our meta is equal to and a dictionary and then we want to put in here playwright is true so this is basically all you need to do for a really basic setup to get playwright to load the page that you give it the url and then send the response back to scrapey so i've got the website url over here copied which i'm just going to grab and i'll paste this into our code here and we'll go and have a quick look at the page so this is it here it's a again it's a test website and it's extremely javascript heavy if you were to inspect the source you would see the dom and you would think okay this looks a lot this looks alright we can work with this however in these sorts of websites what you want to do is you want to make sure that you view the page source if i line wrap this you'll see this really important sentence we're sorry but this store doesn't work properly without javascript enabled and this is why we need to use a browser to load this page so we can actually get past the browser check and load the elements and all the data so we can scrape it out so let's close this and come back to our code what we're going to do here is we're actually just going to when we get the response back we're going to yield it out in a dictionary and we'll just say our text response dot text so all this is going to do is it's going to load playwright load this page up and hit us back with all of the text that it gets back now this should include all the html that we can access using our scrapey selectors item all that stuff so what to run this we just do scrapy crawl and then it was pw spider and uh we'll do the output because we're going to expect a lot of uh stuff come back we'll do output dot json so i'm going to run this and you'll see a couple of key things here launching browser browser chromium launched that's a good sign that means that we are actually opening up our playwright browser which is obviously headless always is so we can see that we're getting information back now one thing i want to show you which is quite important once we get past all of the code that it sent back as you can see here that we're getting and we're getting doing a get request to all of the javascript files which is quite cool to see so let's go and open up our output file and we have this here let's just move that out of the way now this looks like an awful lot of text and stuff and if i come to the website again we just look for the first product which is called oxford loafers let's search in here oxford ah no results that's interesting because weren't we supposed to get back all of the the dom and the html back that we can actually work with well yes however if i come back to the page and i refresh it tell me what you see so that means what's happened is although we've used playwright we all we've done is it's loaded the page up as soon as it's found some information it sent that response straight back to scrapy so what it's done is it hasn't given it enough time to load up the actual page content that we're after which is all of this now this is probably going to be quite common unfortunately we can get around that we can use the page co routines which will give us access to a certain set of actions that we can do one of which is the wait for selector so what i'm going to do is i'm just going to quickly go to the inspect tool and we're going to go up to the top of the page and we're going to find a selector that we want to see so this one here this div with id of product listing has all of the information for the products in it that we're after so if we tell playwright to wait until it finds this div before sending the response back we should get all of our information back to scrapey let's go back to our code let's close that out because we don't want that anymore and we're going to change up our meta here a bit more i'm going to get rid of this and we're just going to write the keyword dict so this all gets turned into a dictionary now we need to import in the co routines at the top so we'll do from scrapey playwrights.page because it's the page that we're going to be working with we want to import page co routine now it's these code routines that's going to let us perform those actions so within our dict we still need to have our playwright is equal to true we still need that but we also need to actually keep the page so we can work with it and to do that we need to do playwrights underscore include page is equal to true without this we don't have a page object to be able to work with to tell it to wait for so now we have that we can actually say that the playwright page co-routines that we want to use a lot of playwright words here page co routines is equal to and it's a list and yes it does include the page clear routine thank you very much vs code for filling that in for me so here is where we can select what we want to do those actions like i said we wanted to wait for selector now if you're doing your own project with this or if you're following along you can also put in a scroll function in here which is also quite cool which i'll show another time so we can say weight for weight for selector and then we'll have a comma and then in here we'll say it was i think it was div it was an id so the hashtag product listing so what we're saying is that it's going to wait for this so it's going to use this pageco routine to find the selector before it turns the response now because we're using co routines this is asynchronous so we need to put in the async keyword in front of our pass function otherwise it won't work properly so that's important however we don't need to do anything else here so this should be good so i'm going to clear this up and we'll do exactly the same thing so we'll just call this one output1.json and it will do its thing and it will load the page up hopefully wait for this selector and then send us back all of the stuff that we want in the response so that's finished so let's go to output one and we'll search for the word oxford and i can see it there already so we have this class card here with the information in so you can see that if we didn't use these page code routines that we wouldn't actually be able to use these actions like way for selector and to scroll which we really do need when we're working with pages like this to load the data dynamically once the page has been accessed so from here all we need to do is we need to just change our yield so we actually get the information that we want and just not a whole load of text so we'll do four products in response.css now we need to find the selector that's going to have all of that in and we'll put our yield in our response let's go back to the page and we're looking for that i saw a card uh something something there so let's try and just find that here we go card body let's copy that and then underneath we have our h3 with the name so let's do div dot card body and we'll yield the title is going to be we want to access the product dot css and it was the h3 we'll try that and we'll do the double colon to ask for the text and then just dot get so we get that information just the first one and then we'll do price and it will be product dot css again let's go find our selector and if we go down click on the select tool hit the price we'll see we get this label and it's kind of got nothing really here but fortunately there's this one above this div class form group so we can just say go here and then find this one really easily so we'll do div dot form group space find then label again text dot get just like that so let's get rid of that and from here we will then run our scraper again this time we'll just call this products dot json we'll hit enter we'll see that it's launching our browser again loaded good responses come back to products.json and here we have our list of titles and prices for the products that are on that page so now that you have access to the actual response properly from the page object you can do anything you like you could do items item loaders all of the good stuff that we've talked about in scrapey in my previous videos and i'll just quickly show you that these are the products here so any of these information any of the information from here would be accessible from here you could use any of the scrapy functionality items item loaders saving to databases pipelines all of that stuff all we've done is we've put playwright in front of it to access the page and load it up for us just like i talked about in this video here which was my introduction to playwright
Info
Channel: John Watson Rooney
Views: 3,103
Rating: undefined out of 5
Keywords: web scraping, playwright, scrapy, scrapy playwright, scrape dynamic data, render javascript, scraping javascript, python web scraping, web scraping tutorial, web scraping with python, learn scrapy, python scrapy, web scrapping, john watson rooney
Id: 0wO7K-SoUHM
Channel Id: undefined
Length: 11min 12sec (672 seconds)
Published: Sun Nov 28 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.