Easy web Scraping, crawling, web search for LLMs with GPT-4o chat

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey everyone in this video we are going to do a deep di on Far C which turns websites into llm rated data by colling and scraping URLs and also has a search functionality as well when you sign up you get 500 free credits I have done an alternative gen AI as my latest video before this one if you want to check that I one app to you can find it on my we uh YouTube channel so we are going to look at how we can actually scrape a website how we can scrape with options which are many are available how we can scrape structured data out of the URLs using pantic how we can craw a website and also how we can perform search the fire craw is able to return us search results both with uh requests or with the SDK and I have brought all of this together in a fire chat which you can actually scrape documents scroll websites and form search and then chat with gp4 Omni about those things you can actually add to those documents too during the uh chat as well uh and start a new conversation as well we are going to take a look at each one of these files and review them and see how they work but let's actually just run the chat functionality and see how it works since all of the options are present here but before I continue I would like to say that code files for this project will be available at my patreon everything except fire coll chat will be available for free for free members and fire coll chat will be available to bosur and plus level subscribers my currently active patrons have been supporting me on average about eight months so they are really enjoying the content that I provide to them so consider becoming a member yourself so when we run the far croll chat we get some options C to craw a URL s to scrape a URL Q to perform a search and to start a new conversation or enter to continue chatting if I were to press enter now I would just chat with gp4 Omni but we want to start with scraping so let's just say s and then we are asked to enter a URL to scrape let's choose the fire craws python SDK URL and then simply paste it here and then we will this far c will automatically scrape it save it to our file there we go this is our scrape data save to Json file now we can actually add to this if we wanted to by selecting s again or we can enter to start the chat I can ask uh before I ask something I just do want to say that chat also has commands new to start a new conversation add to add more content to this conversation I can just ask what is this about and it says that this document appears to be a homepage and documentation for fire craw and so now our l has some additional context I can go ahead and say add and I'm going to copy the URL for Universal touring machine and here select to scrape again paste this URL and then the script is going to ask us after it's being scraped whether we want to start a new chat with this or to append it to the ongoing chat it's asking do you want to add this content to the context or replace it we want to select a for a pen and then now uh we can add more URLs or call or search but we want to enter chat again and then ask what is this about one more time says the content provided is two about two distinct topics one is a fire craw documentation another one is Wikipedia page on universal touring machine Let's perform a crawl at this time by selecting C and by copying and pasting the fire craws documentation main page page and at this time we're being asked how many if you want to limit the what is the to enter the limit for the number of pages to craw defa is three but you can set it to anything you like and at this stage we should actually query check the status of the fire craw is it's crawling here we see the crawl status active current one current step and then we are actually printing the entire thing because we didn't need that let me fix this real quick okay let's try this again we should print the crawl status currently active every 3 seconds until it's done I have actually just fixed it so you won't print like this uh in the future that will print with a new line but anyway our craw is actually been completed and then saved to the file but now we can actually press enter to start a chat and then we can ask what is this about and we should get that the content provided is the details of the documentation for fire crawl so this is how we can crawl we can then go back add and crawl more stuff C uh scrape and let's now this time actually uh start a new conversation by pressing n and then actually use the query functionality to see how we can actually search with fire craw let's ask the question what do developers think of cursor IDE and we can actually limit the search results as well default this three let's stick with three in this case and now fire C is retrieving some search results for us we will also be writing them into file and after that they will be automatically added to our chat context search do take uh few seconds to complete just keep that in mind okay our search results are written to search result. Json now we can actually start a chat with this ground text and ask the same question with since we know that the lar uh gp4 has this context based on the provided context uh opinions of developers are here is being written so this works really well and it provides you with scraping web crawling and web search abilities now let's start reviewing the code for all files but before I do that I'd like to mention that if you enjoy my videos you can find all my videos at my website eive dolive along with the code download links for patrons when you become a patron you get access to all my projects code downloads uh there's over 250 of them at my patreon right now and also I recently started working on my THX master class and I was just doing a h on my patons uh to see if they actually find it useful and so far everybody who has used it found it useful the only other response is that uh they haven't tried it yet so I recommend you check it out my THX master class includes code walking throughs for 11 projects currently I add to it on a regular basis but these are not just Co regular coding walkthroughs the goal is to actually build these entire projects using cursor using AI assistance I barely wrote any code writing any one of these projects so the information the knowledge that I want to to transfer is how to use AI en coding effectively so check that out as well the code link to code files for this project and the TX master class will be in the description let's start reviewing the code with the simplest example scraping a simple website requirements for this is the fire C library and open AI for the fire chat so as you can see we are importing far Json to save it to file OS to actually pass our API key it's I'm passing it from my envirement variables but you can enter your key here is a string now we simply specify URL then say app. scrape URL and then whatever data returns back we save it to uh Json file now let's take a look at scrape with options again we are importing the far C testing the API key specifying URL and then you can pass in parameters such as to only return the main content if you visit fire CS documentation and then click on API reference and click on scrape or craw any one of these you can actually see all the options available here and you can also read more about them here as well and then we scrape as we did before except we are passing the parameters to the scrape URL method now this is very cool let's take a look at the scrape structured data uh this actually allows you to define a pantic schema and then scrape a URL with that uh schema in mind and this actually costs more than the usual amount of credits just keep that in mind but everything else being the same you define a pantic schema if you don't know much about pantic just have a chat with chat GPT about it it's to Define uh data types and classes around them so here we are specifying we are going to scrape uh news.com ycombinator decom which is going to have many posts so we are defining schema to include a title points how many points that uh post each post got byy and comments URL it's going to be based on the base model but then we are defining another schema is a list and say that each one of the items in the list is going to have title points spy and comments and we are specifying what each um each key has the type of and now we are producing a field and set the mix items to five so only the lar language model in the background only returns five items and description is the type five stories and we simply scrape the URL with these options enabling the extraction schema to top article schema that model Json schema and mode is set to llm extraction and we're just going to save them let's quickly run this like I said this costs uh many more tokens than usual scraping and crawling but when it's done not only you get the content of the entire page but at the bottom you get along with the llm extraction key exactly the predefined schema here is the top posts title points byy and comments URL as you can see we have five of them so this is how you can use the scrape underscore structured you can Define any type of schema you like now let's take a look at simple crawl you can again Define your your parameters passing a URL and then actually use the crawl URL along with the craw URL method with the craw URL and the parameters and then save it to uh Json file now let me show you a more complex case for The Crawling again here we can set the parameters we can actually include stuff that you want to exclude from the uh URL if you do something like black SL blogstar then any URL that includes this block will be excluded you can also set uh something like includes this applies to the URL itself you can set a limit and again you have page options if you want to know more about it just go to crawl and check out its different options at the documentation now normally speaking you can actually start a craw like this just like we have and then save it results but uh a better way to do it is actually set the weit until down parameter of the crawl URL method the false previously we've this is by default true when you do that the crawl result actually will return an ID you can grab that ID and then you can actually check the status of the crawl but the better way to do this is actually with a while loop while status. status not incompleted or failed then we actually print the status. status we sleep for 3 seconds and then check the status again and when the status is completed or you can also include the option for failed then we actually save it to a Json file so let's take a look at search we can perform search both with request and SDK with the firec SDK it's so much simpler to do it with the SDK let's take a look at this first we pretty much Define a query and then pass it to app. search you can pass in the query and I did want to show you that you can pass in the options as parameters here as well and I just simply save the result if you wanted to use request and you define the endpoint and now you define your payload with your options along with the query right here uh set your authorization is with the fire C API key then you do uh request start request with the URL payload and the headers and when you receive your response you save them save them to file so this pretty much concludes all the simple stuff which are will be available at my patreon uh for free to my free patreon members but now let's actually review the code for the far craw chat which automatically manages all that for us and inserts scrape R and the web search results into the llms context so we are doing quite a lot of imports actually we do need term color as well for that chat project because this allows our colorful printing in the terminal I just added that to requirements other than that we are setting the open AI API keyy and the fire C API key we don't need this this was from the project with Gina and then I did Define the open AI as async open AI this wasn't necessary and therefore all our functions are async but I thought if you wanted to maybe make this a parallel if you wanted to take this and modify it to suit parallel use cases maybe you want to perform multiple person searches at the same time crawls or maybe you want to make multiple calls gp4 you can easily modify this this way to be able to use async with open we simply open import async open Ai and Define it right here with our API key and then everything else is exactly the same except we Define an async function and we await the results and also we use async for for the loop when we are processing the streaming responses so here we have our function which makes a call with a messages list Stream True we process the streaming responses we print them to terminal and return it if you wanted to do something else with it for examp example in pending it to the messages list and then we have a handle C mode for handling craw uh crawling of a URL we pass in the messages list which we are using in in common with the call GPT function we take a user input for the URL and then we take a user input for the limit of how many pages you want to limit it to for crawling if uh this part is skipped the default we set to three then we create our par craw parameters we set the limit and set that to return only main content and now we simply call the crawl URL with the URL parameters and wait until downfalls here set a time out for 10 seconds this API can be slow at times we free to change this then we retrieve the job ID and then we check the status one time I guess we didn't really have to do this here because we're going to continually check it in this while loop while the status is not completed or failed then we check the status and print it and sleep for 3 seconds and once this uh status is done then we can actually this Loop will break and then we we are going to actually extract the content from the status. data uh looking at the content keys and combine it into a list and we're going to save it with a Json into a Json file with nice indentation and then if this is the first message that we are encountering then we are you're going to insert then we are not going to have a messages list to begin with so we are going to insert a system rle saying please answer us a questions based on all the contents provided and we insert the initial content with json. since we had created the list right here in line 65 otherwise that means we already have a messages list which means we have already added context this time we we ask the user if they want to add a new content to this existing list or they want to replace it and if it is a for append then we actually add to the say uh system message with two new lines additional content and I we just dump it in there uh otherwise if it was replaced them we create an entire new messages list with the exact same prompt we used before so that is to handle crawling and the scrape mode is pretty much the same we take in a URL we scrape it we get the scrape content from the return scrape data we save it to file and then we perform the exact same checks here if we are in the beginning of the chat Loop we initialize it otherwise we end it to at the end of the system message the new content that we have or uh start a new clear the messages list with a new system message with the new content being the initial content and the search query takes in a search query from the user and takes in a limit again we are defaulting the three here if there is none provided and then we perform a search with query parameters and with the limit and then we extract the content again from search results turn it into a list we save it to a Json file and exactly perform the same thing initialize the system message with the initial content and if the user wants to add more content to it we ask if they want to add it or replace it with a or R and then we perform the same uh process and then we also have a handle the new conversation mode function which she going to just reset the messages and print started a new conversation and return messages which is going to be an empty list then we enter our main loop as you can see all these functions are acing so you can call them in parallel we print to the user and we enter the chat Loop uh this is this is when we actually answer the chat Loop the chat with the uh large language model this is not the initial message that the user use sees it's actually this one right here which we are we will get to here in a moment so when the user enters chat we tell them that available commands are new to start a new conversation add to add more content exit to quit the chat and within a continuously running while loop we take in user input if it is exit or quit we break out of it if it is new we call the handle mode n for new conversation get the messages which is going to be empty list and aate the main Lo and if it is ADD again we just await the main Loop because it's going to our main Loop is going to trigger this input and go back to the beginning so in a way our main Loop is our main while through Loop and chat Loop is a while through Loop but it's this chat Loop is going to run within our main Loop so then so then after after all these options are checked then we append the user message and we wait for a we make a call with a wait to GPT 4 get the response resp and then append it to the assistant response also now let's take a look at the main Loop but before I continue I keep forgetting to mention that since we using open a library if you wanted to use an open source model with Ama or or LM studio all you have to do is change the base URL for this async open AI I forgot how to do this exactly but base URL right here yeah so you can actually enter the base URL for your uh model okay and if you're running AMA for example enter the base URL for it as long as the model is compatible with open AI you should be able to run your open source models or run open router for example or groc just wanted to mention that I guess I should have mentioned it in the beginning but better late than never so in our main Loop we we asked this initial question to the user to enter c for crawl s for scraping a URL Q to search the web and to start a new conversation or press enter to chat so if it is C we await and call the handle mode C function if it's s we do likewise for Q and for n and otherwise we enter a chat Loop by calling that chat Loop which we were just looking at yeah it's our chat and that's it then we run our main fun main function which initializes message to empty list and runs the main Loop that's it and then run the entire script by calling ASN KY run main I hope you enjoyed this if you like large language model related programming content do join our Discord Channel we have over thousand members there uh hope to see you there thank you I hope you enjoy this video let me know what you think this was the end of our video but I'd like to talk quickly about my auto streamer version 3 project the streamer version 3 is a pqt powered P installer package python project that I came up with uses Z open API key to create course websites such as this one in real time this is also deployed at Railway and including with audio the El if clause in pyth so you have quite a lot of choices such as six different voice choices and over 50 languages that you can choose from you can choose light or dark team when you go here to generate hes you just enter a course that you would like to generate for example we just said for a culture Basics and I can pick how many chapters I'd like to generate and I'm just going to go ahead and generate it real quick this should this shouldn't take too long our curriculum was created successfully I can go into view course outline and search for that uh permac culture and I as I can see ethics and principles design methods and tools and practical applications I can then actually uh select this uh course outline and continue to generate the course in light mode and then the website will launch automatically and we'll be created for you in real time and you can record it as I'm doing right now or actually live streaming it's really up to you and once it begins we'll be able ethics of care for the Earth permaculture revolves around three core ethics one of which is the care for the Earth this ethic I'm going to go ahead and pause it if I delete this run then this entire course will be generated live and I can listen to it live I'm going to go ahead and and if I were to let this course be generated then it'll be under my view and launch generated courses for example I just created a course called Financial basics for go ahead and it's like this I can actually switch to uh light mode as well I believe uh I'm sorry dark mode and then I can re revisit the scores both I can zoom in both is in text and the importance of emergency funds yeah it has three chapters which I can easily use the benefit of this and what you'll get out of it is that instead of chatting with in a disorganized manner this allows you to create uh structured courses that you can run and listen to before you go to sleep or just fill your time when you have just 5 minutes or 10 minutes worth you can visit these courses back whenever anytime you like so Auto streamer you can download a free demo for from autost streamer. live I'll put the link in the description Mech version is coming soon you can if you click on the download free demo it will take you to my Google Drive download and these are the files you'll be downloading Auto streamer demo. exe is the same thing as this except with limited features and if you wanted to download the full version then clicking this will take you to my patreon shop where it's currently only for $200 instead of 300 you can read all about it and uh the website you do need an open API key for this to work and sometimes you're this is a p installer packag pqt python application so your mchf or malare bites May flag it as is not good but as a matter of fact you all you have to do is just make an exception for the program and if you have any questions feel free to join our Discord or ask me a question that you have in Discord well thank you for watching and do let me know what you think of this project I was really proud of this one and like I said the code files will be available at uh patreon and I also have special tiers for one-on-one meetings with me if that's something you're interested in and uh thank you for watching and I'll see you in the next video I would like to take a moment to talk about the benefits of becoming a patron as some of you may know in the last year and a half I've spent 3,000 hours over 300 uh projects as a patreon you will have access to all the code files so you can get inspiration and iterate quickly another benefit is that you'll have access to all my courses and my most recent and most proud one the THX Master Class teaching how I what I've learned on how to code fast and efficiently also the streamlit course and the fast API course in my patreon I also have TI in which you can connect with me oneon-one check those out as well
Info
Channel: echohive
Views: 874
Rating: undefined out of 5
Keywords:
Id: 6k7Qyt-V8EA
Channel Id: undefined
Length: 26min 1sec (1561 seconds)
Published: Wed Jun 12 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.