Web Scraping Google Maps Using Playwright Python And OpenAI API

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey how guys right so I think this is going to be a pretty interesting tutorial for this video we're going to learn how to use play right to scrap Google Maps in Python Google Maps is one of the most difficult websites to scrap just because there's no organized structure from the results so for instance if I search for restaurant restaurants in Chago now here I have a l of results so if I exam the element for each element block so for example we know this element here represent this rack right here now the reason Google Maps is such a such a difficult website to script is because if we look at the attributes so every single attribute is a is a key and Google Maps constantly Chang the key so it cannot have a single script that will always work so the script may work today but it may um not working tomorrow so to create the script I'm going to break down the process into two steps the first step is we're going to download the results list into a HTML page and when we save the HTML page page we're only going to save this table right here we're not going to save the entire page that way we still retain the record for each result items from the list and the second step is we're going to script inion and convert the items into ajacent object now because the attribute IDs will always get updated so to script the data into a table or jent record or any data format that you're looking for I'll be using open AIS API to do the scraping and you can use any generative AI to perform this task all right so first launcho terminal and we're going to install play right so type P install play right and once play right is installed we need to download the drivers and we can do that by run the command Play Ride install now create a blank python script for the import statement I'm going to import time from play right. Sync API I'm going to import sync play right I'm going to Define my variables basee URL which is going to be the URL pointing to Google Maps next I'm going to Define my search query in this case I want to search for bakeries in Chicago Illinois all right so if we look at this code block here so I'm doing here is that I'm creating a playr instance then I'm creating a Chrome browser instance and can use different browser like webcats or Firefox and make sure that you set the Headless uh mode to F to display the browser then want to create a context instance a context is basically an isolated environment to store cookies and sessions data and from Context we going to create a new tab using that new page method and I name the object is page so page is going to represent a tab then we're going to navigate to Google Maps website and we're going to uh wait until the page is fully loaded let me go ahead run this C block to launch the web page and that's going to take us to Google Maps homepage now I want to type something in the search box which is going to be this input field so here I'm going to provide the xory to point into the input field then I'm going to enter my Curry using the field method then I'm going to press enter and I'll go ahead and run these three lines to populate the input to locate the element here that containing the results I already Define the expector so if I take this xory and if I do a here let's go back to fir Fox if I put the xory here in the uh search field in inter it's going to navigate to this element here if we look at the attributes we have row is equals to feet and from this element here we have ch elements that containing each uh Resturant in this case so this each record F on this element here you can see this is the parent element and these are the child elements so what I'm doing here is that I'm defining the xory to locate this L element here then using Wave for selected I can check whether or not if this uh element exists they want to create a reference to the uh list block element using page. C selector they I'll pass the expect Cy then I need to set the focus to the uh Target element using scroll into view if needed now to scroll the page we can press the space key to uh to the next page so here I'm going to insert a while loop I'm going to create a flag to indicate that I want to scroll the page and here I'm setting the condition to while this condition is true then keep pressing the space key because I don't want Google Maps to block my IP so I'm going to wait 2.5 seconds until I run the next itation and here I'm insing a check to check if I'm at the bottom of the page and if the condition is met then I'm going to press the space key one more time then I'm going to set the keep scoring flag to false then save this element here this a list block element here into a HTML page then I can close the context browser and play right objects all right so let me terminate everything and now run the script by s so right now in my project folder I do not have the HTML page so let me go ahead and run the script and it may take a while for the script to finish and I'm going to speed up a little bit this like the script is finished and right here is the HTML page let's open the page now as we can see that found the HL page we do not have the entire Google Maps and page that like we saw before like this view here the only thing we have is the results which represent this element here now that's going to be step number one which is to download the HTML page as the route data the second step is we're going to create a password to pass the information in this case I'm going to use open a API now here I'm going to get my API key first so I'll go to dashboard then I'll go to API keys and I'll create a new key to pass information this's one more library that I forgot to install which is the beautiful super library and we can install the package using pep install beautiful super for enter and to use open as API we can use open as package so we can run the command pep install open AI all right so for this module I'm actually going to name the script pass P it do maps passp for the import steam I'm going to import Json from bs4 I'm going to import B4 soup and from open a I'm going to import openi all right so here let's go ahead to initiate openi client instance now Define my API key here and I'll pass the API key when create my open a object now for the passor module I'm going to first of all create two functions the first function is going to load the EXL file and we return as a beautiful super object then I'm going to create another function called extra text elements and this function is used to extra text giv the elements and to use the function we'll pass the beautiful super option then we'll specify the tag and attributes and to organize the text string or beautiful subst string into a Json record or you can uh convert that into a CSC file I'm going to create a function called generate Json prom so from the prom I'm going to say convert the following list into a Json object with each record based on this Json record schema this is going to be the Json record schema so we're going to extract information such as the business name rating reviews price category location our services in actions so astion are basically the s that you can do directly on Google Maps so sometimes can order food online from Google Maps so you can make a booking from Google Maps directly and to ensure that the output is going to be returned as a Json object you will reply only with the gson itself and no other descriptive or explanatory text then we're going to provide the input and for the input is going to be the uh the HTML uh text string and to do the open air API code we're going to use this function get open air response we're going to pass opening a client object in the pum so so I noticed that using gbt 3.5 doesn't always extra all the recus and I believe that is because uh 3.5 capability is not powerful enough so I recommend use gbt 40 or gbt 4 which is a lot more powerful and more capable extracting large amounts of information I'm setting the max token to 4,000 and here's the the prom and we're going to send only one response and I'll set the temperature to 0.1 and for the response we're going to retrieve the message content and use stripe to remove the Extra Spaces on both end now just in case if the output is return return as a markdown format then we going to remove the code tag now that's going to be all the functions that we need to create and for the entry point I'm going to create main function here I'm going to load the HTML file then I'm going to extract the text and it's going to be the tag and the attributes then I'm going to create a list to hold the records now because there's a limitation on how many characters that you can provide so just to be safe I'm going to provide certifi records at a time I'm going to call the generate Json pump and I'll provide the input list and the input list is only going to contain the text Str for certifi records they put Mage running they run the prom using get open air response and I pass client and prompt and from the response content we're going to use json. los to convert the response content into ajon object and I name the output is data now just to be safe let me go ahead and also install pendas I no that in my script I have a procedure to translate the Json object into a CSV file so for this print statement I just want to make sure that I'm getting CI records from the response then I'm going to pend the records to the record set and here I want to also import pandas package oh s PD now we can create a data frame by passing the Json record by passing the Json record set in this case be a dictionary and found DF we can save the output as a CSV file using the to CSV method and we provide the file path and the encoding and I'll set the index to false now to demonstrate I'm going to run the script let's look at the print statements here so for the First Response the re one we're getting 32 Rec is because they hit us and I believe that the other two Bates is also the same reason now if we look at the CSV file in a that should contain 100 records based on the allocation yeah so here if we look at the record count we have 101 row but if we take out the head it's going to be 100 records and which match to the uh the loop here now I just want to make sure that we're getting all the Recs in the correct format so here we have the name business name rating reviews price category location hours services and action all right so this is going to be I'm going to C for this tutorial just to show you how we can use play right and generative AI to uh SC Google Maps now if you enjoy this video please don't forget to like the video and subscribe to the channel I'll see you guys in the next video
Info
Channel: Jie Jenn
Views: 509
Rating: undefined out of 5
Keywords: playwright, playwright python, web scraping, python
Id: 6WxaEbkOPKM
Channel Id: undefined
Length: 16min 15sec (975 seconds)
Published: Sun Jun 23 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.