EASIEST way to web scraping using Playwright!

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey how's it going today I want to talk to you about web scraping I want to show you how easy it is to get started by the time you're done with this video you're going to have all the fundamentals that you need to extract virtually any data that you want from any publicly accessible website now once you have the skill set think about the possibilities that it's going to open up for you for example you could virtually automate any kind of searching that you're doing online like maybe you're looking for a job post that matches your specific criteria maybe you're planning for a trip looking for flights for hotels maybe you're looking for a price drop on a specific product what if you could delegate that work to a script that you run automatically and just produces you the results and maybe sends you notifications that's what I'm going to show you in this video we're going to use playright playright is a web automation tool a lot of people actually use it for end to-end testing which has a lot of the same goals right like you write a script to basically automate working with a website so those same apis that we would use for testing you can actually also use for web scraping so maybe let's talk about what we're going to do in this video specifically I was thinking that we could do a flight search currently I'm on unitedairlines united.com and when you go to their homepage they have this small form here where you can provide uh the airport you're coming from the destination and then there's a date pecker and we can have playright basically fill out this form and then click find flights then once we get to the results page you'll see that it's going to list out a bunch of flights bunch of prices and this is the page that we're actually going to extract data from I actually have an example uh file output for you here the results that Json you can see that we have an array of flights and our script can even take screenshots so if you want to have like actual visual uh pictures of the browser as you're working with it uh our script can do that so this is actually a snapshat of a uh results page that we we extracted this data from so you can see if you look closely this first flight here is exactly the same as the first row over here now before we get started I need to quickly talk about the biggest blocker to web scraping which is that a lot of websites nowadays will try to prevent it so for example if a website thinks that you're possibly a bot they're going to block your IP address or sometimes you might have seen things like captures where it makes you select you know where is the car in this grid and you have to you know prove that you're not a robot those are the problems that are kind of hard to work around and that brings me to the sponsor of this video bright data now bright data is a company that revolves around creating solutions to make web scraping significantly easier they have this massive proxy Network so that you don't have to worry about getting your own personal IP address getting blocked they you know I mentioned capture solving they have a tool for that there's also a lot of different ways that they help you with web scraping so for example example uh maybe you don't want to write code at all they actually have data sets that you know they scrape themselves that you can actually just get from them so if you want to start from just the data you know maybe you're more of a data scientist you can just get data sets they have several scraping Solutions uh one interesting one is the web scraper ID it's sort of like vs code but specifically for you know web scraping and the nice thing about this is that you can pull in templates for a lot of popular websites for example you can see here you have Amazon on YouTube Walmart Zillow booking.com and you can start from that template and just manipulate it a little bit to get exactly what you need so if you want to get started quickly that's another option now for our case specifically what we're going to use is what they're calling the scraping browser the best way that I could explain the scraping browser is basically if we weren't using right data you know we could still use playright to initiate a browser and then connect to that browser and then you'll do our attempted web scraping but like I said that browser is running in our local network in our own machine which again can be blocked but if we instead were able to connect to a remote browser that's living inside bright data's infrastructure their massive proxy Network then we're basically protecting ourselves from getting blocked and that's exactly what the scraping browser is all about so go ahead and go to Bright dat.com if you want to try it out they have free trials in there when you log in you're going to see a page like this which says view proxy products and this is where you're going to see all the different options that you have but specifically we are going to be using the scraping browser when you get to this page it's going to show you the credentials that you need to connect to the scraping browser so the host username and password and we're going to provide that as environment variables to our script later all right so let's go ahead and start writing our script in a terminal I recommend creating a new directory for example you can do make their scrape uh once you're in there you can do npm init and it's going to ask you a couple questions you you can just hit enter in every single one of these and that should generate us a very basic package.json okay so you should have a package.json that looks like this we do need to install a few dependencies we just need async retry and playright so make sure to npm install that in the terminal we also need to modify this to have type module which if you're not familiar is going to make us work with es modules which just allows us to have a nicer import export syntax and also things like top level of weight next we're going to replace this test script with a start script so all this start script is doing is really running node on an index.js so we do need to create an index.js in our file system and this is where our script is going to live and you also notice that there is a D- EnV file in here which is where we're going to provide our scraping browser credentials so you can see I already have aemv file in here but I created an example for you to show you what that will look like so you're going to have username password and host and you're going to grab these values from the bright data website all right so in our index.js file we're going to import playright we're also going to create a URL that we're going to use to connect to our scripting browser and that's going to use our username our password and our host right so we're going to use that in a second here but first we're going to set up a function let's just call this Main and this main function is basically where our scraping code is going to live right so when we run index.js we're really just running main so a couple different approaches here you can do something like this which just runs Main and then if there's an error we'll exit however a slightly better approach I think is to add an automatic retry to this so that's why we installed async retry which we're also going to import up here so instead of doing this we're just going to do a wait retry we're going to provide Main in there as a function to to run and then you can provide a number for how many times you want it to retry if it fails right and the purpose of this is just to make our script a little bit more resilient to failures all right at this point we can actually start connecting to our scraping browser so the way to do that is we're going to do browser equals await playright or pw. chromium doc connect over CDP and this is where we provide that URL that we constructed up here CDP is the Chrome Dev tool protocol right you can see that it says it attaches playright to an existing browser instance using that protocol I also personally like adding console logs throughout the script so that I know you know if it fails where exactly it failed so I'm going to add a console log here of just saying connecting to scraping browser and once we're connected we can create a new page off of that at this point we can actually start working off of this page object and basically you know start interacting with the browser so what I'm going to do is await page that go to and we're going to united.com the Us website I'm also going to provide a time right here which I believe if you don't provide it'll just indefinitely try to connect so we at least will allow it to time out if for some reason there's a connection problem and then I'll add another console log here just to say that we've navigated to that URL and now we can start uh scraping the browser all right so at this point I just want to verify that we're actually able to connect to our scraping browser in bright data and actually be able to navigate to the United Airlines website so I'm just going to take a screenshot of the page once we get there so the way to do that is to await page. screenshot you could provide a path program here for the file name just going to do page. PNG and I'm also going to do full page through and that's just going to make sure that if the page is really long it's just going to take a screenshot of the whole thing and then at the end of our script we're just going to close the browser and that should complete our initial script so on the right I'm just going to do npm start remember we made a script for that and package. Json and there you go you can see that our script finished and if we go to our file explorer there should be a page. PNG there and if we zoom in you can see that this is the United Airlines website right and now we know that we can connect we can start actually scraping now there's a few improvements that I want to add in right away uh I want to be able to take screenshots but also provide a nice log every time I do those screenshots so I'm going to introduce a take screenshot method here which just takes in the page which is going to do that same exact screenshot that we did right so it's just going to log and screenshot together that way we can replace things like this to be take screenshot of the page and then some kind of log uh basically we're going to take screenshot at multiple steps in our script that way we can kind of get a visual of what does the browser currently look like as we're working with it since we can't see it right it's remote I also want to wrap this in a TR catch actually where we're going to move our scraping code inside the try after we're able to connect within our catch block what I'm going to do is take a screenshot that way every time my script fails I'll get a screenshot of what the page looks like and then I'm just going to rethrow that error then finally I do want to close the browser for whether or not it completes so at this point we have a script that gets to this web page what we need to tell play is to basically hey find me this form and actually type in these fields so for example in the from we wanted to type in something like New York and it's going to bring up suggested results and we're going to ask player right to select JFK or any of these airports and then similarly we wanted to be able to search for perhaps Orlando and select MCO right and we'll talk about the rest here but fundamentally first thing we need need to do is to just be able to interact with a page how do we do that well before you can interact you need to be able to locate elements on the page so you can see with playr they have this locators API which basically gives you a couple built-in locators for example maybe you're trying to find a button you can do get by roll button and then the text that's on that button and then once you have that element you can do actions on it like click there's also get by label which is you know if an put has a label on it right think of it as visually if you're trying to find an input box you'll probably first scan for the label right it works a lot like testing library in reactive if you ever Ed that same thing you can use if it doesn't have a label you can use placeholder get by placeholder and it's going to get you the input that has that placeholder right pretty simple and get by text is your your most generic way to search for anything visual on the page with a given text so again in our website we're trying to select that these input boxes how do we do that we can see that it has this like floating label of from Star so we probably can use get by label from Star here right and one of the things that you can do is you know you can inspect it so if I were to Target that input box directly you can see in the source code it also has a placeholder with that same text right so you can also do get by placeholder from Star let's give that a shot so in the code I'm going to do get by placeholder from star and I'm going to fill that with New York fill is basically the same as like type it in right and then remember from the website it's going to give us an autocomplete of options once you type in by the way just a really quick note this is why the scraping browser from Bright data is so useful because it allows player right to engage with a page in order to actually pull in more data data than what was previously on the Dom right like for example previously we didn't have options of the airports in a Dom we type into the page and it gives us the autocomplete box which we can select items from and that's only possible because of this integration with the scraping browser and using playright against that and we need to look into what exactly are these how do I select this so that I can click it so let's going to go ahead and inspect that as well and we're going to see that this is actually a button so all we need to do is really tell play right to find a button that says JFK in it and click that right pretty straightforward so we're going to do await page. get by roll in that case we're going to do button and we're going to say I want the one that says JFK so again once you have the element you're able to locate it you can interact with it using actions we're going to do click all right next we're going to do the ex exact same thing but we're going to do it on two we're going to fill it with Orlando and we're going to select MCO and then after this we need to select our flight dates but just to take a pause we're going to take a screenshot log it out just to see how our scraper is doing so far so back in the terminal we're going to do npm start and there you go it completed we're at the point where it's about to select dates let's take a look at what our screenshot looks like our form now looks like this it entered JFK and MCL so it's able to you know select our our destination anyway so back to the browser uh we wanted to basically click on this and we have a couple different options this is actually still an input box so we could just like try to type in here like Jan 6 that might work but instead what I want to actually do is have it click on this and then uh have it actually select a specific date from the date picker so again we need to inspect what is this element and you can see that each of these dates is actually a TD however it has the role of a button right and it also has ARA label and it has the full date in there so I just need to basically query something that either has a roll button that has you know 15 in it although that's a little bit tricky because you know you have multiple 15s in here right so instead we want to query something that has the label uh you know January 15 or any other specific date so let's go ahead and try that so to Target the input box that has the dates in it I'm actually going to find uh the one that says depart and I'm just going to click on it instead of filling it out at this point we know that the date picker should be open and we can do something like get by label January 26 right and it's only going to show the current year so you know that uh you don't we don't have to provide the full text here uh let's do January 26 to January 30th right and that's going to fill in our dates and at this point once you select the dates you pretty much have enough to submit right like we just need to click find flights but if you're following along you can further extend your script to you know select one of these options or perhaps change Travelers right I'm just doing the the simplest example here of let's pick the airport the dates and then we're going to click find flights so what I'm going to do at this point is I'm going to log and screenshot again to say that we're about to submit then we're going to do get by roll button that has the name find flights and we're going to click it and that's going to trigger a submit now at this point in the script we need a way to tell if are we actually in the results page or not right because when you hit submit it'll take a couple seconds for it to actually search for results right like you'll see a skeleton page so back on the website let's take a look at what happens if I do find flights it's going to go into the skeleton page and once it's actually finished loading you'll see that there is this depart on so that can tell me that the page is ready if the dep part on is on the page so what we can perhaps do is do a wait page. get by text I'm just going to find a text that says depart on and real quick by the way if you actually look at the HTML page there's actually two things in here that says the part on and it doesn't really matter which one we select however playright is going to fail when it finds more than one element for the thing that you're looking for so what I'm going to do is I'm just going to do first which is just going to get me the first element that it finds which is all I need I just need to know that that's on the page we're not going to do anything with it uh so that's fine and then wait for is just a generic thing to say hey wait for this to show up but this is another good spot for us to log and take a screenshot but this time we're we're going to do results loaded right so if our script is working correctly at this point we should get to the page that actually gives us the flights and then we can work off of that to scrape data so we're going to do npm start if you want to you can actually open up that page that PNG and as it's taking a screenshot it's just going to automatically update for you so it this kind of gives you a window of like what does a browser currently look like right so you can see that it's already filed out our from and the dates it's loading the results and there you go our script completed it says results loaded took a screenshot and now our screenshot looks like this it's actually uh the results so we're almost at the Finish Line we know that our script is able to get to this page right we got a screenshot of it now we just need to understand how can we query uh this grid of flights here so that we can construct a data set out of that right so how how do we extract it we need to do a little bit of inspection we can use the Chrome tools selector here and let me just try to find what this is so I can see that you know if I find this element if I move just a little bit higher it's wrapped around this this div that has a roll a row right makes a lot of sense so every single one of these has a div that has a roll of row keep that in mind because remember we can do get by a roll and we're going to be able to query this entire row if we wanted to now let's dig a little bit deeper once we have the row how do we get the individual cells so if we take a look at a single row here we go it's a bit hard to see but you can see that within a row there is a div that has a row of grid cell that pretty much marks the individual cells within that row right so we have this one and then within this we also have a bunch of divs with grid cells ultimately we want to be able to construct a data set right so let's create an array of an empty array that we're going to fill in as we're extracting data we said that there's rows so we're going to query basically all of them so we're going to do get by row row and you can actually do dot all here which tells player right to get every single one of them it's going to return an array of those divs that we have so at this point to get our data we're just going to Loop through the row and then select the grid cells within that so we're going to do four const row of rows and to keep things simple we're actually going to do a slice here we'll just do you know the first 10 items now you might be wondering why start at one here you'll notice that the label Row the the one that doesn't actually have the information that we want you know also has a rle row so this is actually the first item that is going to get queried in our rows but we want to start it starting at the second row now at this point we know that a row basically represents a flight so we're going to do cons flight and we'll create an options array in here and this is going to provide the the different pricing options for that given flight right so we want to fill in this data basically so similar we mentioned that within a row there's a bunch of grid cells and we're going to use that to get each individual item so we're going to do cells equals a weight row. get by roll grid cell and again this is going to get basically all the cells within that row so you can kind of think of it as we're doing a nested query within each individual row right so imagine that now we have the cells each cells and then to get the data we're just going to Loop through that again and I want to actually check for the index here so we want to do uh index cell of cells. entries and why is that why do we want to know the index because on the site if you remember the very first box is like the flight details so we know that the first one is the only one that has like uh the airport the time the duration so we need to do uh if index is zero we need to parse the flight information off of that otherwise we're going to parse the pricing information for the other cells but first we need a way to extract text out of each cell so we're going to do cell text equals a weight cell. all inner texts and you can see that this returns a promise of a string array so I only really want the first item in there and let's go ahead and log out what this cell text looks like so that you can understand uh how we're going to go about parsing that information to actually put it into our data set so at this point we need to run our script again so terminal npm start all right so I just fast forward a little bit here and you can see that it is starting to log out the results in the terminal and you can kind of use this to infer how we're going to extract the data right we just need to know like what each of these lines are so let's think about some intuitions that we can get out of this first of all cell text is actually just one long string right so for example this entire thing is pretty much one string it just has a new line character at the end of every single one of these right so but this is one string right that's one cell text so in order for us to get the individual things we just need to split up the cell text on the places that it goes goes into a new line right like if we were to turn this into an array then based on the placement of the string we'll know what data we're working with so let me show you what that looks like so if we were to do const lines and then we're just going to take cell text and split it at the new line character and then we basically we have these as an array so we said that the flight information only exists on the first cell so that's why we have index earlier so we're going to do index if it's zero it's the first one we can start filling in our flight details so we're going to do flight. is nonstop how do we know if it's non-stop basically if the second item here says non-stop right pretty simple so we're just going to do lines at position one equals nonstop so that's going to turn into a Boolean similarly for the other things uh line two here is the time that it's departing at and and then we probably want to get the the airport that it's departing from uh which is actually a little bit down here right so the EWR so departs from is actually at position six and then same thing for arrival information the time that it arrives is at position four and the airport is at position 10 right so this is MCO at position 10 the duration of the plight is at position 8 so that's this 2 hours 59 minutes minutes right so that's just for the first cell the rest of the cells are pretty simple we're just going to do options. push and remember we're running this for each cell right so if we take a look at uh what the rest of the cells looks like so the rest of them looks like this which basically the cost is the second item so if we were to add the cost to each option we just need to do lines one and then we need to determine you know really what type of flight this is you know for example there's United economy basic economy regular economy economy plus business flight now what we can do is you'll notice that this is always showing up in a certain order so within ourselves we know that the first one with the cost here is basic economy followed by economy economy that's refundable economy plus and then business and having that information we can just create a map of types maybe up here I'm just going to add flight type so this is going to represent what each of the cells are so we're just going to do down here we're going to do type flight type and I'm just going to decide what to print here based on their Index right so all throughout this what we've been doing is basically filling in a flight object that we eventually want to push into our data set so at the end of this for loop we're just going to do data that push flight and our expectation is that at the end of this data has all of the flights that we selected in this case it's just going to be the first uh 10 items here and then if you want to you can do another screenshot here of the page just to get the final result and then you know another thing you can do is you can write the results to a file right just to complete the example in ogs you can just use the fs module we're going to use the promises version of that and to wrap things up we're just going to write a file results. Json on we're going to put the data in there and then console log that we've completed at this point our script should be complete so let's do one more final run through back to the terminal and PM start all right so we can see in the terminal that the script completed results. Json was created if we take a look at the file it'll look like this you should have an array of flights with information of is it non-stop where is it departing from and where is it arriving and then the different uh cost options that you have all right so now we have data where do we go from here and really at this point what happens next is up to you once you have data you can do all sorts of stuff with it for example uh you could take a script and put it into a crown job and have it run you know every day every hour however much you want and then as it creates this data set you can create alerts for for example maybe I want to add like if if I see a a flight that's less than and 200 bucks notify me right if I want to find like the best deal from that so again what whatever happens next is it's up to you up to your use case but the fundamentals that you need to take away from this video is that uh you can do this for any website it doesn't have to be flights uh you just need to connect and then you need to navigate you need to interact with forms probably play right is really good at that perform actions click on things and then once you get to the page that you want you can just write a little bit of JavaScript to parse data out of that page and that's basically uh web scraping in a nutshell so hopefully you learned a lot from this video let me know in the comments what you think um but yeah anyways this wraps up the video I hope you enjoyed it and I'll see you in next one
Info
Channel: Marius Espejo
Views: 11,031
Rating: undefined out of 5
Keywords: web scrape, web scraper, python scrapy, create a bot, web crawler, web crawling, bright data, selenium, puppeteer, playwright, scraping browser, captcha solver
Id: VH3gj1J_Ba8
Channel Id: undefined
Length: 29min 15sec (1755 seconds)
Published: Thu Nov 16 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.