PRAW - Using Python to Scrape Reddit Data!

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello and welcome back to bits and bytes today i'm going to show you how you can use python to scrape data from reddit what we're going to be answering is the question what were the top 10 most popular questions asked in the data science subreddit in the past year and so you might be asking yourself why do i care why do i need to know how to scrape data from reddit i can just go to reddit on my app and see the things uh well i'm glad that you asked because it's actually really interesting reddit's been in the news a lot recently especially because of everything that's happened with wall street bets and the gamestop stock and it's really amazing what happened when this you know community came together with a like goal uh and you know all acted uh towards that goal they accomplished a lot uh but you know how does this all like tie together to like scraping data well if you think about the stock market and you think about like predicting stock prices there are companies that get paid a lot of money that you know their goal is uh to you know try and forecast what the stock market is going to do and with traditional forecasting methods you're typically going to use a model that in one way or another incorporates past performance to predict future performance so if the stock price was four dollars yesterday and three dollars the day before the stock price today is probably going or tomorrow i should say it's probably going to be between like four and five dollars so you know it's trending upwards we're using that past behavior to predict future behavior the problem is what these models don't account for is the human element what is the hume what are people going to do if people don't buy or sell stocks then the prices aren't going to change well oversimplification but in essence how do you control for that human element and what these big time consulting companies do is they basically go out to twitter and they go out to reddit and they scrape massive amounts of this text data and they put it together and they feed it into an nlp natural language processing model and with that model they're able to identify human sentiment and once you know what the human sentiment is you can use that to control for the human element in these predictive models and these forecasting bottles and so uh you know there's there's like political campaigns you know that do this if you watched the netflix great hack thing you know they talked about it a little bit on there but uh it it is very legitimate data science work and it does happen and it sounds like this big complicated thing but it's really not and i'm going to show you that even with a beginner python skill set you have the ability and the skill to go out and scrape that data maybe in a future video we'll use some of that data to build a sentiment analysis model and you know just see like what we can do with that but in this video we're going to go on ahead and get started with the python reddit api and we'll show you how to scrape that data and we'll answer those questions we talked about in the beginning so again don't forget like and subscribe if you haven't yet also before we get started i just want to mention that the google collab notebook is available to download and it's in the description below so if you just click on that link you can download the notebook to your computer you can save it in your drive and you can follow along as we go through these exercises and you can also you know save it if it's you know code that you want to be able to reference back to in the future all right so what are we going to be doing today today we are going to use the reddit api to scrape data from the data science subreddit once we have that data we're going to do some basic manipulation and analysis on it so that we can find out in the past year what were the top 10 questions that were asked in the data science subreddit so first of all it's helpful to know what an api is an api for short is application programming interface and it essentially will allow you to use a language like python or javascript to write some basic code that will interface with an application like reddit or twitter for example and so in this example the way that we're going to be interfacing with that application is by using the api to scrape data but you can use the api for so much more you can use it to send up votes you can use it to send reddit coins you can create new topics you can comment on topics you basically can use the application to fully enter you can use the api to fully interface with the application like you would from the actual reddit app or reddit website so it's pretty cool stuff i've linked the documentation here and taking a quick look at that there's a quick start section that lists some prerequisites you need to know a little bit of python to use pro you need some basic reddit knowledge you know you have to know what subreddits are and topics are and things like that it makes sense you need a reddit account and then you need some authentication tokens so i'll show you how to get these authenticate authentication tokens set up and then we'll go through some examples uh and and i'll explain to you how uh how the the code works for how we're interfacing with the api now the really cool thing about this documentation is it's really thorough and after i go through my examples with you i'm very confident that you will be able to return back to this documentation and do so much more you'll be able to do pretty much anything that's listed in this documentation as functionality with this api once we go through this video and you have an understanding of how to uh simply write code or you know interface with this uh with this api so we're gonna go on ahead and jump right in uh starting with installing the api let's go ahead and i install prof first of all so this first line is going to get installed in google collab and then we have additional options for installing in other environments once it's installed we're going to import it and i've already already actually installed it so we're good to go there now we're going to get credentials to log in click on this link it's in the notebook you'll click on are you a developer create an app whatever you want to call the app it doesn't matter i've gone through this once before so it's going to have some information saved here and the only thing that does matter is the redirect field so you just want it to have this localhost 8080. some only important field there click on create app it will give you some credentials here and i'm going to leave this stuff up uh i don't think that there's any risk for like the public out there having access to it but regardless after this video i'm going to delete the app so you're not going to be able to do anything malicious with this if if there's any kind of risk for that i'm not sure all right so once we get our credentials we want to log into the app i've provided this sample code here and this sample code is is also the same code that they provide you basically in the in the documentation so you can copy and paste that code in you put your client id here so your client id we go back to the app page it's this right here right this little guy right here we paste that your is there space there no space your client secret that's uh this token i don't know why they don't call it a password maybe some weird programming thing that i don't i'm not privy to uh but your secret goes right here and then your user agent that is the name and we'll just paste that here run that code now we're logged in all right so let's go ahead and check out some hot submissions in a given subreddit and hot is uh you know it's just standard reddit terminology it's kind of like trending sort of there's controversial new rising top these are the different ways that you can kind of filter out or filter through different types of you know topics or submissions in reddit so we're going to start with hot and to do this there's a couple different ways i'm going to give you a couple different examples as well to show you how you can write this code so that it works but basically for us we're going to point to the subreddit data science and we're going to call hot and we're going to limit it to 10 results and then we're assigning that to the object subs like submissions subs and then we're going to loop over that and so we're going to say for submission and subs print the submission title now we haven't talked a lot about loops yet but loops basically this this terminology here doesn't matter it's basically your item that you're you know just like a placeholder for what you're iterating over so a lot of people will do like 4i in subs print i dot title or they'll spell it out you know that's pretty common as well point being it doesn't matter you know what you use here what matters is this this item here so this is what you're iterating over and then this is the output of that so you can do i you can do enter you can do submission whatever you call this it doesn't matter all right so and then you can see it prints out the top 10 hot uh submissions in the data science subreddit so we've got a pinned thing here official 2020 year end salary a weekly probably another pinned thread why did python become the language of choice best data camp courses and uh you know there you go all right so next what we're going to look at is the top submissions in the past year and i told you that i was going to give you a couple of different ways that you can write this code and then you can kind of choose whichever way makes the most sense for you uh whatever's most interpretable for you and so i will expand this accordion and uh we are so so this code is it's very similar to the code that we had above i've just written it two different ways here and instead of calling the hot submissions we're going to call the top submissions and instead of limiting to the top 10 results we're going to look in a time frame of the past year and i think if i remember in the documentation correctly this can be like the past month the past day i think you can even do like the past hour but we're just going to look back in the past year because that's basically what we're filtering down to we're going to look at the top questions top 10 questions eventually in the data science subreddit in the past year and so in option one here we're using the same format that we had before where first we create an object and then we loop over that object in option two we're going to get the exact same result but instead of looping over an object that we created we're going to loop through the actual query and so the output is a little bit different here so i'm just showing you that instead of just getting your title which isn't of much value i mean it's interesting we can see the titles but instead of just getting the title we can also get the number of comments we'll consider that as engagement and we can also see the score and the score is just your uh it's like your updos your up votes so that that's what they call score and so if we run this there we go we can see there are a lot of results and it automatically it's so it's giving us the top results and so it's automatically filtering top down and we can see the number one result is uh has the title of data science uh 75 what did that say comments and then 3 000 up votes so my guess is that's probably a meme i'm not sure probably a meme we can see the next one shout out to all the mediocre data scientists out there 260 comments 3 000 upvotes so a lot more engagement i guess there's a lot of people that you know resonate with being a media hooker data scientist they wanted to talk about it it's never too early uh my guess is that's uh that's some kind of motivational story about an older guy who or gal who got into data science i got 59 comments 3 000 up votes people really appreciated that story i guess and yeah it goes on so we can compare these results to these results and you know slightly different code but it accomplishes the same thing and so you can choose whichever option works best for you whichever option is most interpretable for you whether you create the object and loop over that object or whether you just loop over the actual query all right so we are going to suppress that output just to get it out of the way so up until this point all we've done is print the results and we can't really do anything with that data except for view it when it's printed in our output here and so we want to get that data into a format that can be manipulated and panda's data frame is a very manipulatable that's even a word it's a very manipulatable format and so we want to get that data into a pandas data frame so we're going to import pandas as pd and this is standard nomenclature in the industry so if you do import pandas just call it pd uh it's basically just a a nickname for pandas so that anytime you call pandas you can just type pd instead of having to type out pandas and then i'm actually going to comment this out because i want to show you what it does so i'll circle back to that these are all individual lines of code they don't like all run together until we get into our loop here but we're creating an object called subreddit and that subreddit is the data science subreddit and then we are creating a loop and we're saying for post in subreddit so data science hot limit 1000 so we're going to get uh 1000 uh results and then we're creating a data frame here and that data frame is going to have the post title the score the link the number of comments and then the basically the the message body so we're going to go on ahead and run that and we're going to get this warning i think that this warning it's probably because i'm running this in collab if you're running it in collab you'll get these warnings as well if you run this code and just jupiter notebook on your local machine you probably won't get the warnings but unless if you set up async pro which i haven't yet uh you're going to get that warning so nothing to worry about just if you do this a lot you would want to do it the right way and all right so these are this is uh our result so we can see there's 507 rows by five columns uh we've got the post title score link comments and message body and depending on what you're doing with your data um you know what you're doing for your analysis you know this is helpful but it's really it's only so helpful because we can't see really everything that you know is is relevant here and in many instances so if you're building like a predictive model for example you're not so concerned with seeing the individual values you know you might you're going to do some some exploratory data analysis and you're going to create some plots and those plots are going to visualize what's happening in your data you'll have you know oftentimes way too much data to worry about looking through each record line by line and so you know it doesn't matter that it kind of shrinks things down and cuts things off the data is there it's just not displayed for us to see for this exercise though um and at least while i am in the exploratory data analysis phase of a project i do like to see what's in the columns uh or what's you know what's within each feature and i like to see it because it just helps me initially to understand what's happening in the data and for this example it's going to be helpful to see it all too and so that's what this line of code up here does we're saying set the option for pandas that the max column width is none so we're removing the preset parameter within pandas that gives us a fixed column width and now it's going to remove that maximum column width and it's going to display everything in the data frame in every column you're going to see you know that this isn't something that you would necessarily want to do with like a massive data set uh and the reason why is because you know it can blow up and get big like really fast but uh here we go so now we can actually see the title we can see the full link we can see the message body and uh you know this this is you know it's kind of it's nice you can kind of like read through it and again it really depends on what you're doing you know why or whether you need to do this or not but what we're going to be doing i want to get jump ahead what we're going to be doing is finding the top 10 questions in the past year and so it'll be helpful to see whether or not what we're looking at is an actual question all right so that's that moving on to the next section finding all of the questions uh topics and this is we're going to leverage regular expressions for this and we haven't really leveraged uh regular expressions yet but um i'll give you just kind of a brief overview it you know in this instance it basically is going to let us identify all of our topics that end with a question mark so that's basically what's happening here we're saying uh create data frame one from our original data frame where in the title field uh it ends with a question mark and again we're keeping that and assigning it to the object df1 is this perfect no why isn't it perfect because if there are topics for example that are in fact questions but the user didn't end it with a question mark this isn't going to pick those up so for what we're doing here not a big deal but if you wanted to be super accurate you could you know do some more uh you know and sentence structure analysis and you know really figure out what all questions exist in our data set even if they aren't labeled with a question mark at the end so anyways that was a kind of long drawn out explanation but when we run it you'll see what happens now we can see that we have all questions and we still have their score and we still have the url and the number of comments and then the message body and this is good this is a good starting point now we want to get the top 10 questions and how you do this you know how you determine what the top 10 questions are i well i included the score and i included the number of comments because i think those are going to be the two most important factors and uh you know for for saying like what the top questions are um you can go by number of comments which would be engagement you know how many people were really interested in that question and you know it sparked a conversation uh where you can go by score uh which is just your upvotes and you know how many people thought that question had some value so they gave it a quick upvote and i think that what we ended up going with the score yep so we are using the sort values command from pandas and uh nice thing uh we got this you know um a little like help dialogue that pops up when you highlight things in collab but either way data frame one we're sorting values we're sorting by the score variable and this by default sorts ascending so it would give us a bunch of scores starting with zero first you know low scored items so it sorts ascending so it'd be like zero zero zero for the top ten um we want it to be descending so we're basically just going to put in this option ascending equals false and then we are going to take data frame 1 and basically output the title score comments message body and just give us the head uh and and request the top 10. so let's run that and there we go we have the top 10 questions from our data science subreddit in the past year and let's uh take a look at a couple of these questions how many of you are hybrids of data analyst data scientist and data engineer so i got 577 upvotes 141 comments i guess a lot of people resonate with that i think it's pretty accurate you know in many cases you do wear many different hats i think in a larger corporation you're going to be more specialized you know you're going to be more siloed you know the analyst is going to do more raw analytical work the data scientist is oftentimes going to be focused on the predictive modeling or forecasting side of things and then the data engineer you know ensures the data is moving through pipelines the way that it should be but you know that's a larger corporation if you're working for a startup or a smaller business uh you know you might just be a data scientist who does all of these things so just depends would anyone be interested in a soft data science series we've got 571 upvotes 96 comments so i guess there's a lot of interest in people within the data science world improving their soft skills there's definitely some value in that you need to be able to communicate with you know upwards downwards left and right within an organization the results of your analysis and research and having those soft skills can can prove to be invaluable for that does anyone get annoyed when people say ai will take over the world 514 upvotes 344 comments i think that's the best engagement yet three yeah yeah 344 comments is that the most and yeah it is i guess a lot of people feel annoyed when uh people say ai will take over the world i don't know maybe it's just me a lot of friends that are not in data science and a lot of them say ai is bad job destruction they don't know what machine learning is they always say ai this a i that i don't know thought i'd see if anyone else feels the same yeah i don't know i think it's funny i i honestly try not to engage in conversations about analytics with people who aren't familiar with analytics uh just because you end up having conversations like this how hard data science actually is how hard is data science actually how actually how hard is data science well apparently grammar isn't a requirement for engagement and popularity on reddit and i guess i'd go as far as to say in the data science subreddit even more so we're not so concerned with uh with your grammar as much as we are with the uh accuracy and predictive power of your models interesting how much of data science is lying i just saw my old company post a seminar they held i won't name shame and it was a project i witnessed and gave input on the head of the project never validated a model large biases were made and the use of k-means clustering with binary data maybe this worked and i don't know the true results but this is a grossly incompetent error in data science is there more of this because this is scary is data becoming data science becoming just a nice rapper on intuitive insights that a domain expert could guess this guy's fuming uh don't get in his way he's uh he's a bulldozer a wrecking ball he's going to take some people down with him anyways so that's uh that basically sums up you know what we wanted to cover today uh you know you can do a lot with this and as even i was going through this exercise i was thinking of a really interesting analysis that i might do for a future video but uh it was regarding this so you can well within this pro documentation you can extract comments uh comment extraction and parsing actually have a tutorial for it and i think that it would be really interesting to take this thread which has 237 comments i would expect that to be higher actually it's a little disappointing but the 2020 year-end salary thread 237 comments and see if um you know from those comments we can extract salary data and uh you know generate an interesting analysis based off of the salary data from those comments uh anyways a lot you can do with it it's a lot of fun honestly it's a little more interesting if you ask me than data types which we will be circling back to in next week's video but i just had to get something out there that maybe was a little bit more interesting for you and a little bit more interesting for me talking about data types was putting me to sleep and probably putting you to sleep as well even though it is necessary so hope you enjoyed the content don't forget to like and subscribe thanks again for watching and we'll see you next time so now you know how to scrape data from reddit using python and i hope that you realized as we were going through that exercise that it's not as difficult as it might seem to get that data in a workable format in python there's a lot that you can do with it it's a really interesting stuff especially with how nlp has you know really advanced over the last few years so if you enjoyed the video going ahead like and subscribe if there's any future content that you'd like to see leave it in the comments below and again this is bits and bytes where we provide you with bits of analytics content in bite-sized chunks
Info
Channel: BitsInBytes
Views: 699
Rating: 4.8461537 out of 5
Keywords: Reddit, python, text mining, web scraping, wall street bets, GameStop, Predicting stocks, PRAW, Wallstreetbets
Id: Y7BSe7EiBTs
Channel Id: undefined
Length: 28min 30sec (1710 seconds)
Published: Mon Feb 08 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.