Web Scraping 101: A Million Dollar Project Idea

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
one of the best projects that you can work on that has real legitimate potential to make you a ton of money has to do with web scraping the ability to collect real-time data about travel e-commerce Healthcare real estate you name it is already a multi-billion dollar industry and you can tap into that with a project like the one that I'm just about to show you so first let's discuss the potential imagine you're a drop shipper or an e-commerce seller and you're constantly competing against hundreds maybe even thousands of competitors where your main differentiation is price and maybe stock or availability now if you had access to real-time information and were able to undercut your competitors prices or raise your prices when your competitors went low or out of stock think about how much money that could potentially make you seriously think about this for a second if you have access to real-time information this can inform Million Dollar business decisions and if you can provide that data to different clients or people who can actually use it you can take a cut of those profits now I won't lie to you this is not as trivial as I might make it sound most companies are going to actively block you from grabbing this type of information and even if they do provide an API you're most likely going to deal with rate limiting out of date information and a whole bunch of other issues that can completely ruin the whole point of even having this system that's why what you most likely need to do is build a web scraper now that's exactly what I attempted to do for this video I set out to build an automated web scraper that would scan e-commerce marketplaces for products of interest for either me or my clients automatically tracking their prices and alerting me or my clients of any changes in those prices such that we could react and modify our prices so that we could take advantage of an opportunity now before I did all of this I did ask my community if they had any advice on how to go about building a project like this what they suggested is to use a popular framework like playwright or selenium set up a web scraper go to the website scan for a specific product go parse the different HTML grab all of the prices of that product store that in a database and then scan the database and see if there's been any changes since the last time we updated that product they then said I could run the scanner automatically at a set increment so every day every hour every minute whatever increment I choose and that would be how we can work on this project now unfortunately every time I attempted to do this my IP address got blocked I got stuck behind captchas and I just got completely denied and shut out of the website after just a few attempts at scraping it now it turns out the websites are pretty smart they can detect Bots and they really don't want you scraping their website and doing exactly what I was attempting to do now that's what I remembered that about a year ago I had worked with a company called bright data now bright data essentially unblocks your browser for you it'll automatically solve captures and all that kind of stuff so I reached out to them we connected and they agreed to sponsor this video so that I could continue building this project now what bright data gave me access to is something called their scraping browser which is a head full browser that works with Puppeteer selenium and playwright and automatically unblocks websites for you it will connect to a proxy Network rotate your IP address and automatically solve captchas essentially allowing you ungated access into a website and to perform web scraping now most importantly it also allows you to do scaling meaning I could run hundreds of instances of my scraper at the exact same time and I wasn't limited to a single instance or whatever my local machine could handle so that's what I ended up using for this project I'm going to show that to you in a second but if you do want to check out bright data they are the sponsor of this video you can check them out from the link in the description and you'll get access to some free credits that you can try this out for yourself and see the power of a browser like this all right so now let me show you the project that I actually ended up building that's fully functioning and working right now if you want you can extend this project I'm leaving a completely open source so click the GitHub Link in the description and feel free to do whatever you want with this code you can make something really cool this is kind of a solid base and you could extend this and make a legitimate business from anyways you can see that what I have here is a product search tool really it's kind of analyzing e-commerce prices and right now it goes to Amazon and it will automatically scrape Amazon every single day for all of the different products that I have that are set up here so I can either enable or disable tracking of this product I can add a new product that I want to track and then I can view the individual prices for these products so let's click into ryzendine 5950x that's going to generate a table for us down here of all of the information we have on this product currently from amazon.ca now you could scrape multiple sources at once for right now I've just set it up to do Amazon but you'll see the way I wrote my code you can automatically add other web scrapers and set up different sites now it shows me all of the relevant listings that contain the ryzen 9 5950x processor from Amazon gives me the current price and any price change since the last time this scraper ran so you can see down here we have kind of a combo of a CPU and a motherboard and this has gone up 0.76 percent since the last time I ran this so you can see here we have this Micro Center AMD whatever gives you all the information you can view the product URL view the source newest price time current price and then it gives you a graph actually telling you the whole history of this product from when this web scraper has been running again the way this works is every single day it automatically scrapes all of your tracked products and then gives you any updates on them now you're not always going to have a price change and it depends on how frequently you run the scraper but in my case I've seen a few different changes I was actually pretty surprised to find out that even large sites like Amazon are constantly updating their prices and you can see quite a few different price changes here so we'll go into one more to see if there's any changes in something like Dove men shampoo don't think we're going to see any here just because this is a pretty basic kind of product and I only actually ran this scraper one time so that makes sense we wouldn't see a price change there now if I want I can even track another product so I can go up here and just search for one one time or I can add it to my tracked product list to be automatically updated and ran every single day so let's just track a random product here uh let's do something like Intel I9 uh 10 900k I don't know if that's recent or not but let's scrape this okay we'll give that a second to update and then I'll show you the table that it generates alright so this is finished so I'll click on this now gives me the table and shows me all of the different listings from Amazon that were relevant that contained Intel I9 10900k so I can then run this every single day if I wanted to and I could add this into my tracked product list if I wanted to automatically run the Scraper on it every few days then again I can kind of go here and toggle them on or off if I no longer want to track a specific product obviously a lot more stuff you can do here but you see we have a very solid base setup so now let me hop over the code I'll give you a bit of information on how this code functions so you have some insight on how you could maybe change this if you wanted to work on a project Sim similar to this all right so I've just hopped over to the code here and I'm just going to give you a very quick overview of the architecture of this project and then again feel free to look at the code from the link in the description and extend this as you see fit what I have for this is kind of three or four main components I have a front end that's written in react I have a back end that's written in flask with python and then I have the actual scraper which is a separate process I run from my back end that's written in Python and uses playwright I also have an automation script which is very simple and just automatically sends a API request to my back end every day now you can set this every hour every minute you can use a Cron job you can use all kinds of stuff but that's kind of how I've set this up so I'm just going to walk you through very briefly the back end and then the web scraper and we'll kind of go from there okay so in my back end I have my app.pi haven't organized this a ton and what I do is I set up here or I connect to a local SQL Lite 3 database this is what I'm using to store all of my data obviously you could use something like postgres or mongodb but this was the simplest for this project I set up my different database tables here using kind of the SQL connector or SQL Alchemy whatever you want to call this in flask and then I have a bunch of different endpoints slash results is actually where the web scraper submits its results so the API really allows the web scraper to interact kind of with my database and the back end so I send a post request to results that submits the results and then gets added to the database and then the front end can view that data so that's what slash results is unique search text this is going to give all of the unique kind of searches that we've done so the product names essentially we then have a get request here for slash results which gives all of the results for a specific product name which is the search text we then have all results which will give all of the results for all of the different products in our database we have start scraper which is a post request method which is going to start the web scraper based on a URL in a specific search text we have ADD tracked products straightforward we then have a put for a tracked product which allows us to update it so we can either toggle if we're tracking it or not then we can have the ability to get all our tracked products and then to update the tracked products where what this will do is automatically run the web Scraper on every single product instance so you can see what this does is create a new sub process where it then runs our python scraper script which I'll go into now so I'm just going to very quickly kind of run through this but I have this main.pi file and what this does is kind of dynamically allow you to add different websites that you could be scraping for now I've just set it up with amazon.ca but you can connect this with really any website you want so long as you write a very kind of specific piece of logic that will get the uh kind of elements that you're looking for which I'll show you in one second now what I do here is I connect right away to the bright data scraping browser it's actually extremely simple I was surprised how easy it was all I do is just have to have my username password and then my credential which is what I get access to from my auth.json file which is inside of here and then I just use that as the URL in playwright for the browser I want to connect to also works with Puppeteer and selenium but it's literally like three or four lines of code and you're automatically connected to this scalable Network where you can run multiple instances of the scraper so I won't really run through all of the text here but you can see we fill the input field we press the search button we retrieve the products we grab all of that and then eventually what we do is we post the results here to our back end where they get submitted to the database now the core logic here is kind of connecting to the browser so you can see we're connecting over CDP to this browser URL generate a new page go to the URL we want to go to load the initial page Etc we then have amazon.pi which is actually what's responsible for scraping the search page on amazon.ca so it goes in gives us the image name price URL get access to all of that and then just return that for each individual product obviously there's a lot more logic that's going on here but that's the basics and really everything that you need to know so that's kind of how this works then I have my react front end don't need to walk through that and then lastly I have this scheduler here where all it does is just hit this URL here so update tracked products once a day the way this works is I've set up a Windows batch file that I put in my windows process scheduler where every day it's automatically going to run this where it spins up the API and then it will wait 10 seconds then call python main.pi which is right here which will send that post request which will then go retrieve all of the different products which will be spun up in parallel in separate processes where we are scalably grabbing all of this data and adding it to the API now just to quickly show you in case you want to do this for yourself what you would need to do is go to Bright data create a new account you can do that from the link in the description and access the scraping browser I'll leave some links in the description but you can see it kind of gives you some information here so from right data go in I'm gonna go to my proxies and scraping infrastructure I have scraping browser right here I'll click on that you can see if I went to uh kind of the parameters here it would give me like the hostname password Etc which I don't want to share with you and that's really all you need once you activate this and you have access to that information you put that inside of the file so you get your username password password host sorry and then you can just start using this there's really no more configuration required if we look at my stats here you can see I used it a ton when I was actually developing this project and then recently I've been running it a few more times I believe you pay per gigabyte um anyways super super useful tool and again you get some free credits if you want to check it out from the description so with that said I think I'm going to wrap it up here the last thing I will mention is that if you want to extend this project the next logical thing to do would to be to build an alert system where what this will do is automatically tell you in any of the prices change I haven't done that just because I figured that'd be a nice thing that you guys could work on if you want but what I would do to accomplish this is just simply check every time we submit new results if any of the prices have changed on the same product since the last call if they have then we just grab all of those products throw them in some type of email template send an email to someone or you could send a text message do whatever you want with that said I'm going to wrap up the video here I hope you guys found this helpful and this gave you some inspiration for a great project idea if it did leave a like subscribe to the channel and I will see you in the next one [Music]
Info
Channel: Tech With Tim
Views: 207,018
Rating: undefined out of 5
Keywords: tech with tim, intro to web scraping, web scraping, easy money online, make money online, web scraper, easy money box, easy money, data selling, youtube money, data reselling, easiest way making money, creating bots, data collection, make money, web bots, web development, bots, code, programming, learn to code, web scraping tutorial, web scraping projects, web scraping 101, web scraping python
Id: DJnH0jR8y5Q
Channel Id: undefined
Length: 13min 6sec (786 seconds)
Published: Tue Jul 04 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.