How to scrape the web for LLM in 2024: Jina AI (Reader API), Mendable (firecrawl) and Scrapegraph-ai

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
2024 is when I start to see a couple of examples of startups especially from like the recent YC badge starting to Pivot into web scraping and I'm just here to like try to connect the dots here so this probably has something to do with perplexity and like the amount of interest in terms of like we want to scrape the web so we can have like the best I guess upto-date answer for your LMS or like best upto-date um search for for a platform um mendable is an example so uh early days if you ever were like L chain or llama index documentation site you might see like a little robot icon in the corner you click on it you can do like a natural language search query off of their documentation site um but they came out with this thing recently called fir crawl which is specifically for scrap scraping the web using large sandwich models and we'll see a live example in a quick second Gina AI um really cool company um their policy is actually uh they have embedding models and I think no no language models but they're embedding models you can try them without having an API key and I don't know who's footing the bill for this but they keep coming out with like really cool free tools that you can try one of them is reader API which is all you got to do is aen g.com before any URL and you're going to get back some clean data from that website it's almost mind-blowing and last I'm going to show you this open- source project called scrape graph AI so this is a very elaborate orchestration of like different python modules that create graphs so you can create a pipeline to scrape the web um using large sandwich models so these two um it only give you back clean inputs but this one actually incorporate AI in and answer your question at the very end or you can have like 10 different steps of what to do when you go to a website and scrape it so what I'm going to do today is I'm actually going to be scraping my competitor's uh pricing Pages like this is I'm doing this for myself like I'm I'm building a product right now and uh this this this matters to me this is not this is not a madeup use case like I'm actually looking forward to see what's going to come out of this so I'm building in the Learning and Development space so my competitors are obviously popular tools like articulate 360 um some Challen Challengers that are new to the market like seven taps mmith some other companies but I have about four websites here today um and I'm keeping them all in a in a nice array so what I'm going to do is I'm just going to run this to save it in memory and the cool thing about this is that once I give you this you can go home and play around and you know try to do some market research yourself did I service attack on those four companies and now um I'm also going to set up this thing called Tik token you guys know what Tik token is I know you know what Tik token is but I want to know anyone else know what Tik token is okay so uh for a large langwich model when it's being encoded uh maybe you can explain this better than me because I'm a software engineer so um and it depends on the model too right like depend on what kind of tokenization or encoding mechanism that you use for example gpt3 had a different encoding um scheme than gp4 or gbt 40 um that's why they were able to like reduce the cost for like newer Generations a little bit just because of the way that tokenization works um and you get built by the number of tokens so like less tokens but the same sentence is is usually cheaper for you as a consumer so we're using Tik token here which is straight from open AI this is the exact the exact library that openi uses to encode their GPT models we're using that this not to create a new model but we're using this to count the number of tokens that we're getting based on the the scraped content of this website so I want to know how expensive it is for me to scrape all these things and I want to convert that to a dollar amount um comparing beautiful soup and Gina Ai and menol see which one will save me the most amount of money so I ran this and this this is just an example sentence what's the difference between a beer nuts and deer nuts beer nuts are about $5 deer nuts are just under a buck so anybody get the joke so this one cost about 0.135 if you're using gbt 40 and the funny thing about open ey is that they didn't update Tik token for GPD 40 yet so this is just a guesstimate so last thing that I want to set up is a pretty table so pretty table is just library in Python so you can make like tables in your terminal I want rows and columns so I can see you know which one is cost more than than which so that's all I'm doing here very long function all it's doing here is take into account scrape content put them all in the right columns and rows um you can read this at home um but yeah all right so now let's set out the scrapers we're going to first we're going to install uh a good old friend beautiful suit 4 this is like the most straightforward way to to scrape any website and probably the easiest way for them to detect that you're scraping and ban you um so it's not very sophisticated unless you bundle on a bunch of other tools okay so we got um a function to scrape the web with beautiful soup we're not going to use that yet what we're going to do is we're going to run all of them at the same time so I'm just going to set up a bunch of stuff here so here here's Gina AI this is what I'm talking about when I say this one is dead simple all you got to do is just add the string before the actual URL that you want to scrape and it's completely free I don't know what the deal is but you use it it's free those strings attached I mean maybe they're trying to do some market research or something but yeah it's great gos of VC money they can oh yeah they can blow absolutely like free rides in Uber yeah in the early days you just get like free rights over and over okay last provider is a mendible so mendible like I said recently pivoted from I guess not fully pivoted from but like they were doing like documentation chat bots so like you go on like blank chain llama index all these companies you can chat with a little chat bot on the bottom so we're going to try this one too but this one requires an API key and they gave me uh they give everybody like I think 300 tokens for free I'm not sure how much how many websites you can scrape with that but so from these other tools the output is all text normalized or uh is still pretty um it's only markdown so the I they all in markdown for some reason maybe because of large sandwich models uh like they made these tools spe specifically to Output a markdown from a bunch of HTML tags um yeah so and and it's also like string so there's no like just pure string you get a string back like very long um okay uh this is the the moment that we run everything so I'm just going to run this um all I'm doing here is just go on these website trying to scrape it with all the three different tools that I have it's going to take about a minute so uh we can watch this bar going from left to right are you guys can ask questions they got a table back good thing I got my pretty table so I can see like the difference between all of them okay this is my biggest comparator by the way they they have so much money I don't even know what to do with them um beautiful soup you get your regular HTMI stuff very very clean well not really fire crawl this is fire crawl so for fire crawl you get uh a little bit better you know like very much marked down like you got brackets and you got like Lings and all that stuff so you can already see that you can already skip a bunch of stuff here and if your your large sandwich malls will love you if you give it clean data like this but you take a look at Gina on this side Gina is even more human readable even though they say it's supposed to bring you back Mark down they actually took away all the brackets and everything so like you got bad and then you got markdown in like actual markdown format and this one is promises you markdown but it's actually string in a human readable like U format and we can see that for uh pretty much every single um examples here the the G one is usually the one that you can actually read for example this one uh from in the middle by uh mandible it's not actually human readable that much especially if your if your large language model doesn't care about URLs maybe you just want it to know facts then you might not want all of this stuff in there right you might want just what's human readable maybe for a reasoning task for example anybody have any like questions so far okay so that was surprisingly fast all right so we got also got a cost table here let's take a look at this real quick that's that's a great question it reminds me of something I think when I was playing with this Adobe actually blocked me soup so let me show you um I'm GNA do is I'm gonna just comment these two out and then I want to show more content and run this again that's a great question it reminds me of of this okay let's see articulate yeah so this is like the most bare bone beautiful soup setup like this is like you're in turn doing it this is not like very sophisticated so you get you get 4 three yeah yeah that's why there there was nothing there yeah so great catch great catch so you can't compare the costs not not for the first one uh the first one was blocked um but actually I got to run this again but the the the table here all we're doing is we're comparing the input cost uh between the three if we were to put this into gp4 so let's say let's just uh forget about the first one yeah this one um seven times seven times less and this one isn't even way less like compare these two 100 times uh or 70 times um yeah any question right now these costs are just the cost of just the uh input tokens and output tokens right just input tokens for GP for yeah but it has separate cost for input and output right it does but this table alone it's just input I'm going cuz uh the outputs I'm going to show like the last step which is using LM to like extract out of Json of just the things that I want which is going to be like pricing tier names and the actual pricing so the output's going to be the same um regardless um but the input if you use different tools you get different amount of inputs um dramatically different yeah especially like this one for example yeah like it's pretty nuts I guess it's because of the extra tags that are there right in the fire craw yeah especially if I think for this one if you hit like an image that is has like the entire binary on it then you get the entire thing this one also get the entire thing must yeah scrub all of that information I don't know what they're doing here but they're doing some crazy stuff and this is like this is very clean by the way like you can it's human readable um but again if you want that that you know the actual URL for most things that you might be better off using this otherwise you lose a lot of resolution but again if you want a reasoning task to be done then you don't need that you just need the factual things all right so it's pretty obvious that which one we should be paying for um not knowing anything else about these companies um but now we're going to use open AI to kind of like try to do some uh extraction cuz I don't want to look get just the input I want to I want just one Json with all the data that I was looking for in the beginning which is which of these competitors having how many tiers and how much does it cost per each tier that's all I want to know okay so this part all I'm doing is setting up an open a client and then I'm going to use my open a key here so we got a fresh key this week key from last week has been deprecated it and I'm using uh latest and greatest GPT 40 you guys know what the o stand for Omni there you go you guys know why uh in the demos like chbd for oh chat chbd sounds so uh flirtatious her Scarlet Johansson yeah there's so many memes about that on Twitter um but okay so this is just a utility function to then display the uh the extra content on another table um cuz you know table in console is what we need to compare these things all right so what I'm doing here is just run gp40 through each of the inputs that we got before like the entire input so it could be like 50,000 tokens like it's not my money like it's invest ottawa's money so I'm just running it through uh gp4 right now it's like how many okay so very very simple um entity extraction task using gbt 40 um all I did was I put in a kind of like a a chain an LM chain and just be like uh get me the three pricing tiers from this website SC 10 so I give one website at a time and and then return a Json with three keys cheapest which is the cheapest tier name of it and then the price and then I just give each one like a type so it knows what I'm actually looking for and then there the middle tier and it's the most expensive one and that's it that's my uh extraction tier and always um open AI does is they tell you that you should always if you want Json back uh use type Json object but also say in your prom that you want Json back so like you always need like those two things but we can it's not really surprising that from beautiful suit we get nothing so we can't extract anything here our Json is like completely zero and empty string because we got fall three forbidden we got forbidden from Adobe articulate 360 uh fire crawl seems like it's been able to get personal plan 0199 teams plan 399 and then reach 360 pro which L their Enterprise tier and then the prize is contact sales for pricing this is weird because I gave it a type which was a float and it gave me a string here it's still useful but not technically um accurate and then comes out to Gina AI prize says variable so they're kind they're kind of ignoring my instructions a little bit maybe I should have said you know float or no if it's not not not found or something like that um just goes to show if you're working with large Lang models unless you're working with like dpy or something like that you need to be very specific about your Proms and just take care of like most edge cases um so again uh seems like most of these tools pass the test um it's just a matter of whether or not you want to burn 10 times the amount of money and do scraping manually using beautiful soup and your custom tools or you can kind of like use one of these third party tools and just get clean mark down or clean user human readable text bag um and just worry about your your llm stack instead of like your web scraping um one last thing I want to show you guys which is scrape graph so this is completely open source this is not like the startups that we just saw earlier this is completely open source and in Python on um anybody here familiar with uh graph data structure I think we had the conversation before about this yeah um seems like you're a big fan of graph structure too yeah okay so open a API key and then link to scrape does anyone have a link that they want to scrape which model you going to use okay so this is what the website looks like oh I pretty cool so what what would be a question that you want to ask what I meant to say is what do you want to get out of this website so model and speed perhaps yeah what are these called electric unle okay EU and what color is the fastest model of electric unicycle uh and model and weight model name and speed okay yeah initially I thought that you can just ask questions but I guess the prompt here is just what do you want to get or scrape out of the website yeah not like what do you want to know about the website okay so we got we got uh an or answers back um so we got here how do I get rid of this okay so electric unicycle models so you tell me if these models are legit yeah yeah I saw the names are correct and speed is which one's yours it's be Go Master this one the master 50 plus miles per hour true that's true on full battery okay are you serious you know I drove like 70 72 kilm per hour my top speed and it was scary you see like like all the all everything like flying by you and think if I'm going to fall and hit something it's not my my my body protection not going to help wow so that's Prett yeah Jesus so this is accurate that's accurate v13 yeah you know what I actually uh I actually went on their project and I asked cuz I couldn't figure out cuz I was trying to build this and I'm like how many tokens is this consuming probably a lot um so I ask and apparently there like a function that you can do to get that um there there's two ways so if you use this this Library just go to the discussions T tab look at my question and you'll find the answer cool yeah I asked this like last night I like why why are you trying to like hide this away from us or something like no there's two ways to do it but the documentation doesn't cover so uh but yeah I that's that's all I got anybody have any questions about anything
Info
Channel: LLMs for Devs
Views: 39,177
Rating: undefined out of 5
Keywords:
Id: QxHE4af5BQE
Channel Id: undefined
Length: 20min 22sec (1222 seconds)
Published: Fri May 17 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.