Reddit Image Scraper using Python. Get the data you need for any project!

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
👍︎︎ 1 👤︎︎ u/Dwigt-Snooot 📅︎︎ Jan 08 2021 🗫︎ replies
Captions
what is up clarity coders today i'm going to show you how to use my reddit image scraper now i love scraping images from reddit when i need images for a project because they're already classified right you can go to a subreddit like ah and get cute animals or you can go to a subreddit like old school cool and get a ton of old school cool pictures this sets up really nice for ai projects and things like that and i use it exclusively in my next video coming out in a week or so so make sure if you're not you subscribe so you don't miss out on that upcoming reddit box so today i'm gonna walk you through the program that i created to scrape images from reddit let's jump right in [Music] first thing we're going to do if you haven't already you're going to need a reddit account now i've showed this in videos before so i'm going to go quickly here you're just going to sign up for a reddit account and go through that process once you're logged in you're going to go to this url i'll have it in the description below and you're going to click this button are you a developer create an app you're going to name your app you can name it anything you want for the redirect uri you're going to use http colon localhost and press create nap now once you do this you'll be redirected to a page that looks like this now don't use these numbers as i've already disabled this account but you're going to grab from here the client id and the client secret and once you have that you're ready to start now i have all the code formatted for you in this nice little github package here so i'm going to make some assumptions i'm going to assume you have python installed on your computer if you don't already go ahead and go out and install that and i'm also going to assume that you have pip installed with it and can download these packages i'm going to be using git bash which you can grab from here or use whatever you would like to download this code base so what i'm going to do on the github itself the url remember is in the description of this youtube video i'm going to hit this code button and then copy and i'm just doing this on my desktop as an example i'm going to open a git bash shell here and then i can simply do git clone and then paste in that url that we copied now once we do that it's going to download the entire code base onto our desktop now that we got it on our desktop we can go ahead and cd into that directory so i'm going to cd reddit image scraper that's going to get me inside that directory and from here we can go ahead and run our program now if we run it right now and we run our python sub download you'll see that we're missing modules so when you down when you clone a project like this you don't necessarily have all the modules well i have a nice little requirements.txt file that you can download all the requirements in one swoop here so we're gonna do pip install dash r requirements.txt while we're inside of this directory and that's gonna get us all the requirements we need once all the requirements finish downloading you're ready to run this program however you probably don't want to download the same subreddits i was so let's go ahead and open this folder in some type of coding environment i'm going to open mine in visual studio code so now as you can see we have our project opened in visual studio code and i'm going to walk you through a few files here the most important is your sub underscore list this is going to be the subs that it grabs images from so in my case i'm looking at the subreddit awe and food and you can add whatever other subreddits you want in here any subreddit now we got also a couple and a couple ignore images in this ignore images folder so if you look at these these are the standard images that show up if someone deletes an image from imager or whatever so that's just to help you clean your data so if we go ahead and open up sub download this is the heart of the program itself the first is how many posts you want to search in the each subreddit so in each subreddit right now i'm searching five posts that doesn't mean you'll get five images back because some posts aren't image posts so let's up that to say 20. now you can see here it's going to create a directory images if one doesn't exist the important part here is this token.pickle part what it's going to do is if a token.pickle doesn't exist it's going to ask you for all your credentials when you first run this program if you would happen to make a mistake when you enter your credentials you can delete the token.pickle file and it will ask you for your credentials again now in this case we're just downloading images so you don't need to enter a username or password but i left it in here because you may want to do something with that in the future and that's pretty much it so now it is going through each subreddit that we listed in our csv and in my case i'm looking for new you can look for top or hot whatever you want here and we're limiting that search amount to whatever we placed above so now we're going to go ahead and run this you'll see because i don't have the pickle file yet it's going to ask me for all my credentials so the first we're going to be asked for is our client id that was this number here our client secret which was here and our user agent now our user agent i think pretty much anything will work here they do have some suggestions on reddit of what you should put in here but i'm just gonna say bot version one something like that now our username doesn't really matter because we're not going to be making posts so we don't really have to authenticate our account so i'm going to leave it blank for right now if you want to make posts you'd want to fill this out and then same with this password if you wanted to make posts you'd want to fill out your password as well now you can see it's off to work so it's going to start downloading images in each subreddit and you'll notice as this process goes it's creating files in this images folder it has the subreddit name and the post id now i did this so we don't get duplicates if you wanted to run this every 10 minutes or something like that you wouldn't get duplicate posts or anything like that if you look at any image you of course have the nice cute image there for you or whatever you're downloading for your purposes now you also see in here that you could do some transforms right now i do do a transform to compare the image to our ignore images i resize it to 224 by 224 but you could do other things to the images that you're actually saving now i don't save that version so you'll notice it's still not it's still in uh whatever format the picture was in but you could grayscale all your images or whatever you're trying to do if you're doing some sort of image classification or whatever you're doing and that's pretty much it you can do this with any subreddit in just a couple minutes if you have any questions jump in the discord and ask away we always have people online working on different projects and working together if you haven't please subscribe and hit that bell so you don't miss my next project it's going to be an awesome reddit bot that's running 24 7 and lives on heroku and as always keep coding you
Info
Channel: ClarityCoders
Views: 4,356
Rating: 4.9597988 out of 5
Keywords: image scraper, image scraper python
Id: sEIv8UcR3Go
Channel Id: undefined
Length: 7min 7sec (427 seconds)
Published: Fri Oct 23 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.