How to Rotate Proxies with Python

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
one of the main problems we face when web scraping is the fact that we get blocked from the website we are trying to scrape for too many requests that's because we have one ip address from our main computer and it sees too many requests from that ip address and it will temporarily stop you accessing the website so one way around that is to spread out the requests over multiple ip addresses and this is basically what rotating through proxies is in this video i'm going to show you how that works in principle and i'm going to explain a few upsize and downsides and stick around because i want to talk a lot about free proxies and why they are not actually any use to us at all so hi everyone welcome my name is john and let's get right into it so the crux of it is that when we send a request to the server through requests in python we want to be able to send that through a proxy now the easiest way to do that is to import requests and let's set up our r is equal to requests.get and i'm going to use a website called http http bin and it has an ip part that will let you um it will send you the json response back and it will tell you what ip you connected with and whether it was good or not so we can do htbin dot org and it slash ip so if i was to run this and print out our dot status code we could see that we're going to get a 200 response down here so that means we're connected fine i'm not going to show you the json response because that's my ip so if we wanted to use a proxy we need to go and get one so if you googled free proxy list you might come up with this and if i refresh this page and we find one that says https says yes so let's copy that and the port is three one two eight let's go ahead and do proxy is equal to three one two eight after our url we can do proxies it's equal to let's create a dictionary we'll say http and then proxy and we'll also say https there we go like that and proxy as well so i'm going to add another thing here i'm going to say timeout because a lot of these fail so i'm going to do a timeout of three seconds and let's run that and see if this one worked okay we got connection refused which means it to me that that does not work so let's try a different one um https let's try this one oh we did that one oh that's this one 8080. okay uh yes this one worked right so let's do print r.json the response the json response from the website and clear that up and let's run it again and we can see that our response is this the origin that we've come from is 164 100 130 128 which is this proxy that we specified here so basically that's it um the downside with free proxies uh is the fact that they are free and they are widely known and open and google blocks them all so you cannot access any sites with this proxy so this we just saw this one worked so if i change this to https google.co.uk and try to run it again we're going to get the same connection refused error that we got before in fact we've got a json error because we tried to print out something that doesn't exist so what i'm going to do is i'm just going to put that into a try and accept and we'll show you a few more examples and we'll just print um failed and we'll pass and we'll put our json response in here so if it does work we get that instead okay so let's try it again and we'll get after our timeout time which was about three seconds failed okay so i'll change this back to the http bin http bin sorry slash ip and we'll run it again maybe this one still works there it does great so this is all very well um what if we had a long list of proxies that we know work with google and we know work with what we're trying we know that work with what we're trying to achieve so the sites that we're trying to scrape the quickest and easiest way to go through that if you had them in a csv file would be to import them in see it as a csv so i'll show you how i would do that now i've got a long list of of uh proxies about four or five hundred saved as a csv file on my computer which i downloaded from various free proxy sites there are 458 so what i'll do is i'll write a script that loops through all of these and checks them all to see if they work all right i'm going to get rid of all of this for now and we'll keep that and we'll do import csv um and then we can do let's create a blank blank list to add all of our proxies to and we want to do open our csv file so we use a context manager with open which keeps open for us and closes it for us when we don't need it and mine is proxylist.csv and i'm going to open that read only so that's all we need and i'm going to do as f so we're opening the file as f and reader then is equal to csv underscore reader of the file which is f we called and then for row row in reader append this to our proxy list append and because it's multiple um because it's multiple rows we not to make sure we do just the first one so we're going to index it and say only this row so if after this to check this is working if i print the length of proxy list hopefully if that's all good it should be csv reader is not defined that's because that should be a dot sorry about that there we go 458 which is what i said it was so now let's go ahead and write a new function to loop through every proxy i'll close this to collapse that sorry to rotate through every proxy on our now in our big long proxy list so i'm going to call this one extract and we're going to pass in the proxy that we want and then i'm going to do pretty much what we had before and we're going to do try because we want to make sure that we don't fall over when one of them fails and we do r is equal to requests dot get and it was i'll just type the url manually again like this and proxies is equal to proxy so this matches here so this is what we'll get passed in except what we need to do is we need to do the http and we'll do https just so it works like that close our dictionary and timeout i'm going to lower this to two because there's quite a lot to get through and most of them are gonna fail i think um okay so if that works we want to print our r.json so the response from the website which we get and then after that i'm just going to say working just like that we'll add a little dash in there so we can see that's all good and now what we want to do is we want to deal with it if it fails and we'll just do print uh actually what we'll do is we'll just pass it if it fails because otherwise we're gonna get a lot of text on our you know on our console screen and it's gonna be quite confusing we won't see which ones work and which don't um so after this we want to return and we'll just return proxy for now okay let's test our function um we can use the same proxy that worked before this one which was 80 80 8080 80 so now we'll just run extract and we'll give it this so this then becomes our proxy variable and we'll see what we get out working fantastic so we know our function works so basically what we're doing is we're giving it a proxy and we're trying the request which i just showed you with that proxy and if it works it prints it to the console if it doesn't it just skips over and goes to the next one so if you watch one of my last videos i talked a little bit about concurrent futures and how we can use that to speed through a long list of requests or items or in that case i used h i used urls that we could potentially scrape and got one piece of data from each one i know this is another great example we could use it so i'm actually going to go ahead and import concurrent a futures and we're going to run through this super quick and check if any of these proxies work if you're not quite sure what this is just follow along now and i will have a link somewhere to the actual video i did recently which explains it a bit more detail so with concurrent let's spell this right dot futures dot thread executor as executor this is just lining it all up and we will do executor dot map and we want to then give it our function extract and then our proxy list so when i run this it's going to do it all sort of simultaneously and we should hopefully get some responses back that some of the proxies in my csv file work and doesn't appear that any of them do you can see them slowly coming through so so far there are four that actually work and of course this is as in just work and not work with google which these ones won't okay so that's finished so we went through 458 free proxies of which i downloaded and saved into a csv and there are one two three four five six seven eight nine of them actually work as in you can use them however remember that none of these will work with google so they're kind of pointless but this is not really the outcome not really the point of the exercise that i wanted to show you it was more that um if you had a list of proxies that you know worked you could use this method and you could use your request and pass in your proxy just like this to spread out your request and hopefully not get blocked from the website wherever you're trying to scrape so all of the code that i've written out today will be in my github i'll put the links down below to that i also have another version where we web scrape this free proxy list and pull out the information from there so that might be a slightly more up-to-date free proxy list so maybe a few of them more will work so give it a go follow along and see what your outcome is and again if you have a list of good proxies you could use this to scrape uh websites properly and quickly without getting blocked so hopefully you guys have found this one interesting um it's quite useful it's good to know how to do it i'm going to do a follow-up video to this where i'm going to try and find some proxies that actually do work with google and do that in a proper exercise so thank you for watching guys don't forget to like comment and subscribe lots of web scraping content on my channel already more to come more python stuff to come main videos on sundays and some more live streams are going to come up next week as well guys so make sure you keep your eye open for those and come and join in we'll have a chat thank you very much and see you next time bye you
Info
Channel: John Watson Rooney
Views: 33,232
Rating: 4.9413919 out of 5
Keywords: proxy python requests, proxy python script, rotating proxy python, proxy checker python, python http proxy server, python proxy scraper, proxy with python, python web scraping proxy, python proxy server tutorial, proxy server using python, random proxy python, how to rotate proxies with python
Id: vJwcW2gCCE4
Channel Id: undefined
Length: 13min 4sec (784 seconds)
Published: Sun Sep 27 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.