This AI Agent can Scrape ANY WEBSITE!!!

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
So lately there has been a couple of libraries that can use the power of large language models to scrape the web without us having to basically do anything they will read the URL and give us either a markdown or sometimes even a structured data that we can add in our Excel sheet or store in our database and since we have really strong and really cheap large language models today it makes sense to use these libraries instead of going with the free alternative of beautiful soup or other ways where we have to inspect the web page understand the structure and locate the specific elements that we want to scrip so the advantages of these new packages are so many and the biggest of all of these advantages is as we said saving the effort but also creating a script that can act as a universal web scraper for that specific use case that you have say for example you want to scrape data out of a News website you can use the same code to scrape from multiple news websites and sometimes you can even use that same code to scrape from other websit that has nothing to do with news where you are looking for totally different information and today we are going to see how we can create such code and how it can help you scrape the web with minimal changes so let's go ahead and jump to my screen all right so before opening uh vs code and starting to work let's just discover firec which is the library that we are going to be using by the way it's it's open sourced and it has 4,000 Stars so here if we come back to firec and create an account you can just basically go to accounts in here and get an API key that is the API key that we will use later on in our project so if we go to playground and let's say for example we want to scrape let's say for example open pricing open a pricing here and click on run we will see that we will receive markdowns of the entire page and the thing is we don't have any type of divs or lists or any type of tags that we have inside the HTML so this is very important because before if you want to do extraction using HTML you will basically have to pass the whole structure into a large language model and that means that that is a lot of tokens sometimes it goes well over 100,000 tokens so the fact that we have markdowns will help tremendously in order to get a cleaned enough data that we can pass to a large language model and then of course from there it will make sense financially to use any new cheap model in order to do the extraction and make it a structured data so now let's see the universal web scraping agent workflow we have to see that before opening vs code so you know exactly what I am doing whenever I am writing whatever code so our input is going to be the URL that is always going to be the case this URL will be passed to fire craw in order to get the markdowns once we get the markdowns from fire C we going to give that to a large language model it could be an open AI model like GPC 3.5 or or gp40 or Gemini Flash from Google or any other model then of course we are going to ask it to extract something from the markdown according to our fields we have to basically tell exactly which fields we want to extract from this markdown after that we going to get semi structured data so even though that we are going to get a Json answer from the data extraction we cannot control 100% of the names inside of that Json this is why I call it semi structured even that it is structured up and once we are going to get that data we're going to go to another stage where we are going to format and save the data so we're going to format it according to Json and then we are going to have it in a data frame a pandas data frame and we're going to save both of them here you can basically have a database or any sorts of storage medium that you prefer and here I chose Json and Excel so let's go ahead and open vase code here we are going to create always a new folder so let's call it listing file CR and then inside of here we are going to create a new file let's call it appui and of course we are going to create a virtual environment so python DM VNV VNV and then let's go ahead and get inside of the virtual environment so let's do bnv scripts activate for this let's clear this out and then the latest step that we are going to do in the initiation of every project is creating a new file sorry a new file let's call ITV this is where we are going to place our API keys so here if I go to accounts I will be able to C copy this API key and then I will of course use open AI so for that we are going to use open Ai and at this point everyone knows how to get an open a API key let's copy it and then we are going to place it inside of here okay so now we are going to install all of the requirements that we need so I already have a file and this file I am going to put it right here these are all the packages that we are going to need in our project so now we are going to pip install DHR requirements by the way it's fire c dasp not just fire C if you want to install it independently without the other ones okay now that it has finished installing everything we are going to start coding let's clear this out and let's import so from fire C we are going to import fire C app from open AI then we are going to import OS import Json and then import FAS SPD and then lastly we are going to import date time okay so the first function that we are going to start with is going to be scrape data and the scrape data will only take the URL first thing that we're going to start with is to load the do EnV so we can load the API keys that we have here and then we are going to use that API key to initialize the firec app that we have here and then we're going to use that app to scrape the URL that we have here we are going to delete this for now we're not going to need it and then we are going to check for markdown in case we have an empty response or we have any kind of problem then we are going to return the markdown if not the case we are going to return an error the second function that we will have is basically saving that row data cuz I am not a fan of doing an extraction and then not saving the data somewhere so we are going to save it inside a folder called output that we have here so it is pretty straightforward by the way it just created so let's call it output let's create it in here and then we will have a text file at. MD which is basically a text file inside of our output folder that we have here and it's going to be saved according to the datetime of when we are running the process so now we are going to go to the most important part and please it is not very complicated I'm going to paste it in here it's just that the system prompt is basically very long but other than that it is not that complicated so this is the format data function and this is the function responsible of taking the raw data that we will have and then extracting the structured data from that markdown that we had before so here we are going to initiate our open AI clients inside of clients and then if we don't provide this optional parameter we will basically end up with this Fields by the way the use case that we are going to use is going to be Zillow so in our use case this is going to be the website that we are going to extract as you can see here we have a map in here we have the listings around here so basically what we want to do is that we want to extract data out of this website here and structure it so here as you can see the fields that we have here are basically the fields that we can find normally in a real estate listing so the address real estate agency the price the bets etc etc so these are the information that we are going to be extracting from the website then we will have the system message and the user message that we are going to store inside of of a variable in here so here you are an intelligent text extraction and conversion assistant Your Role is basically to take a row data and then extract a Json format from it this is very important we have to mention that it is a Json format and then later on we are going to see that the response format being a Json object this is incredibly important if we don't do this we're not guaranteed to have a Json response every time and then we are going to get the response out of the chat completion from open AI we're not going to use gp40 or gp4 Turbo we only going to use gbt 3.5 turbo 110 this should be 1106 and then we are going to pass the system message and the user message by the way inside of the user message we have is extract the following information from the provided text then we are going to have the data this is the data that we have already saved that we will get from scrape data and then of course we will use the fields that we have provided in here good now we're going to go if we had response and our response is not empty we are going to parse that Json and we're going to use J jon. loads in order to get that string into a Json format that we will later save in the next function so here we will have our last function which is basically save the formatted text that we will save in a Json format and then in an Excel format in order for us to be able to visualize it easily so inside of the output folder always we are going to get that Json format that we have just got from here so the return value that we have here is going to pass into here and then we are going to create a Json format from here and then we have a little function that we will see why it will be useful later on and then of course we are going to get that format of data in a data frame that we will later on save as an Excel sheet okay so now let's go ahead and add our last bit of code in order to run the process and here we have the last part of our code where we are going to initiate a time stamp in order to use it later on inside of our functions we are going to call our functions using the URL that we have here so this this is the first website that we are going to extract and then we are going to save that row data that we just got here using our function and then we are going to pass that row data again to format the data and then we are going to save it in the format of a Json and an Excel sheet so let's go ahead and run our code and see what's going to happen okay so here we have a problem and it's because of datetime so we should use from datetime import Daton because there is a problem of naming detection it actually uses the module instead of using the function inide the daytime but that's not important so let's run it again and see what's going to happen all right so we can already see that it have saved up the row data that we have in here so we are going to open it and as you can see this is basically the markdown of the whole page this markdown will then be handed to open AI in order for it to give us a Json and then Excel sheet so this is the Json that it has came up with and then from this Json it has been able to format it and then to save it as an Excel and here we have the sorted data and then we have it as an Excel sheet we are going to see both of them so as you can see here in the output we can see that instead of having just basically the information that we want usually it will give us one key and inside of this one key we will have all of the information this is why in my code I have told you in format I have to basically check if our dictionary have only one key and if it's that the case I am going to go inside of dictionary to get all the keys for my formatted data so here if I go to my project and then go to Output I will find my Excel sheet that I can then open and inside of here I can find all of the information that I want so basically now from this website I have been able to generate a structure Json and then a structured Excel that if I open I find all of the information that I want structured and even URLs to take me to the exact listing that I have scraped the data from so here if we click on this listing here it will take me exactly to the listing and I will be able to compare between my scraping and the listing that I have here and all of this without using any tags or any type of page inspection that I would usually do if I am using beautiful soup or traditional ways of scraping the data and even more than this we can use the same code that we have here to scrape other websites that have nothing to do with the structure of this website that we have just scraped this is the closest code to a univers web scraper that I have ever created and it is amazing how easy it is to create these processes using large language models today so if I go back to the website in order to compare the data we can see that here we have $630,000 it is exactly the price we have ,600 square ft so exactly what we want and then we have the home type and the listing age so I would say that on the listing Age part this has not been very successful it should have kept it empty but for all of the other ones we can see that it has been able to actually get the data and if we do a simple date substraction we can get the exact date when this house has been listed so that is a problem that can be easily resolved okay so this was the first URL now let's go ahead and go to a totally different website that has nothing to do with this URL so this is going to be a different website so let's open it here so let's see if it's going to be able to get data from here okay so let's run it okay so we got the RW data now and it has been able to get the sorted data and then get us the Excel sheet so first of all let's go to the sorted data and see what it has been able to uh extract so there is the sorted data that has been able to extract and then if we go back to our folder we can find this Excel file that if we open it we are going to find this information so literally this code has been able to scrape two different URLs that has nothing to do with each other simply because we have used large language models so if we go back to our Excel sheets and do the same thing we are going to find that for example the first one is this address with this real estate agency with this price and two beds and then one bath and 1,000 square ft okay so that is already very good now let's see if we are going to be able to do this with a website that is basically in a foreign language for example let's see if this going to be able to work on a website that is in French so this is a French website these are homes in a city called Leo in France let's see if it's going to be able to get those listings and understand how to get them even though that our prompts are in English so let's run our code and here we have an error and this is very important the error is that the model's maximum context length is 16,000 tokens so here we can see that the model that I have used does not have enough of a context length size in order to be able to treat the raw data that we have just gotten in here so here we have the raw data that is basically so long so it hasn't been able to actually get all of this information and basically process it so if I go back here to response here I am using GPT 3.5 turbo 1106 we can just change that to 40 and we can run the code again and let's see if it's going to be able to treat it this time okay so we got the row data it's here and this time it's going to take more time in order to process it because it is so many tokens and finally we got an answer finally I have waited so long for this one okay so we got the raw data it is very long as we said and then we got an answer so let's go ahead and visualize the answer let's group them by type Yep this is the last one so let's open it and as you can see here even in French it has been able to scrape all the data so here I can see the price of course the price is going to be in Euros the address the real estate agency the number of beds and of course if I click here and I open it it's going to open exactly the listing so here we have to this so as you can see here if we compare between the two we can see that we have here the address the right one we have the agency I don't know where I found it but yes here so this is the agency the price is correct and then we have here three bets I don't know what beds mean in English in French rooms that can be closed are called CH so maybe it is the equivalency but it's like everything like rooms that can be closed or living rooms or anything else so here we have cat so I don't know if it should have four in here you guys tell me what beds mean what do you mean by beds in English you can have three beds in one room I don't understand the logic of it but you guys tell me in the comments if this is correct or no we don't have any indication about bats it's not really something that the frch talk about that much but here we have the first clearly wrong field here we have square feet and here we have the square meter so this is the imperial system this is the metric one the correct one and it could not understand that this is basically square meters it should have at least indicated that here we have the square meter and of course here we have the the rest of the information so basically that is it so it has been able same code to extract a totally different URL so that is already very good this code has been impossible a year or two years ago it is impossible to create kind of a universal whip scraper that will work in any instance without having to deal with any inspection or web page specifications anyways that has been me guys thank you guys so much for watching it has been a long video I know but thank you guys for staying all the way through I really appreciate it and catch you guys next time peace
Info
Channel: Reda Marzouk
Views: 43,531
Rating: undefined out of 5
Keywords: open source, python, gpt, chatgpt, mistral, Ollama, gemma, gpt-4o, scraping, data, agent, AI, AI Agent
Id: ncnm3P2Tl28
Channel Id: undefined
Length: 17min 44sec (1064 seconds)
Published: Fri May 24 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.