Parsing Reddit comments - Python Reddit API Wrapper (PRAW) tutorial p.2

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
what's going on everybody welcome to part two of the Python reddit API wrapper or prawn tutorial mini-series in this tutorial what were we talking about is at least beginning to parse comments so like I said but at the end of the last video comments represent a different kind of challenge for a variety of reasons mainly it's just the fact that comments aren't you know perfectly in order there it's a tree of data it's not a linear form of data so anyways I'm going to go ahead and remove a subreddit that subscribe but the rest of this stuff can remain so just underneath this let's go ahead and continue so the first thing we could do is first of all I want to limit this to there's there there are two stickies so I'm just going to limit this to three just so we don't go you know so we just do one submission for now and now I'm going to come down here and we can reference the comments by just saying comments equals submission dot comments so this gives us the comments so now we can do is to say for comment in comments we can go ahead let's go ahead let's like print 20 times this but let give us some separation and then what we're going to do is we're going to print comment but just like a submission the comments are like these objects like the perot object and the object is just going to have the ID so then you reference an attribute and one of the attributes is body for the body of that comment and then what we're going to say is so that's our that's our comment so we can at least iterate through comments that way so for example let's just run that real quick this here your shirt here so these are like all our you know comments now let me pull up that what they're just close out of it I guess I closed out of it [Music] pull over mine so that was why so there's six comments total here but some of these are like replies like for example if you're unfamiliar do yourself a favor and look into pandas so for example if you made me look for this army okay okay anyway it's not here okay so what we have to do is iterate through it at least I'm pretty sure it's not there so these would be just like top levels I'm pretty sure I just want to be a hundred percent sorry for wasting your time anyway so I think I closed again I cus I'm bad at closing things anyway I'm pretty sure it's not there so what we need to do is get the replies so now we could say you know for reply so for or rather prot what we should do is we there might not be any replies so then we could say if lend comment dot replies is greater than zero and again if you didn't know replies existed you could have done a Durer on comment Abadi or you can read the documents anyway if when comment our replies is greater than zero so we have some replies then order is a for reply in comment dot replies hmm we get loops that's not a that's a thank you anyway we can print then let's just say like for blog that's why and also we got body on that okay so here you get a it's just me reply really great high-quality reply yeah okay so oh and here's another reply I was like this really isn't another one yet so this is the this is that comma I just searched for a second ago so there we caught that reply about pandas but then I think I close this let me open it again someone complained I wanted my videos like I just murder my Enter key it's true uh okay if you're there you go so so pan is looking to pandas but then there's another comment underneath that right so then we would have to be like um you know we did we'd have to just basically okay and then at this plant reply we could say okay if when reply dot replies is greater than there but you have no idea how deep down the rabbit hole the comment tree things go right so that's that's slightly problematic then so the solution is we can actually say submission comments we can add dot lists to these and this will list out your all of the comments so dot list I believe is purely a Python reddit API wrapper so purely a prof. um ssin allottee that's not something that's actually available to you in the Python alright it's not something that's actually available to you even the reddit API but anyways that doesn't matter let me go ahead and close this so we've got a nice clean thing and then also we uh we kind of want to do like print comment body we don't really want to do the replies so let's just do that to cancel this real quick so in this case we've run through all of them so here you go here's a the second-level reply now unfortunately we have no absolutely no idea the contextual data for this like we don't really know where this this was in the whole thing so for example you know you wouldn't really know that this was in reply to you know which reply it was to now what list does is basically it takes all the top-level comments list those out then it goes down to the second level comments lists all those out then third level and so on so one option you have is rather than comment body what you could say is you can also grab like you could you can grab a print the parent ID and that would be comment dot parent now do you note that's not an attribute that's an actual new API call which in my opinion is super unfortunate I wish that was supplied and I don't think that's a mistake I believe that's that's just in reddit and I realize not every comment is going to necessarily have a parent but pretty much every comment would write like you know the parent is the actual submission or the parent is another comment so and these are like little tiny ID strings like I really think that should be included but it's not it's a new API call so anyway comment ID so comment that parent and rather than that this one is just comment ID which just is actually an attribute so huh crazy I can't remember if a submission I'm pretty sure like the submission contains the subreddit ID so I love to give wrong though anyway that's okay so now what we could do is get the parent ID in the comment idea of every comment and then what we could do is print the comment body and then you've got the parent ID in the comments idea of everything now from that point you could begin to do some pretty cool stuff but the first thing I want to show you is right let's say let's say we don't do Python and instead we do news so very very popular subreddit and if this doesn't work I'll do like politics or something but we should hit an error here let's go there we go here we go there's error so if you use the dot list and you actually do iterate through all comments chances are eventually you're going to wind up with this stupid error so more comments object has no attribute parent ok so what's happening there is like on really long comment chains so like for example let me go to the news subreddit that would be this one marijuana company buys entire US time to create cannabis from the municipality that's going to have lots of comments so for example right away you can see here this like load more comments that's a more comments object and actually even though red it looks super simple they're going to that till you click this I'm pretty sure you're making a new call like it's an actual call to their database same thing would like continue this thread that's a new call it's going to reload that data like all this data is not loaded on your page load that would be nuts you never load the page so anyways if you wanted to continue iterating through those comments you would need to also either handle with a you know an exception or something like that or one option you have is to replace the mores so for example coming down here comments that list one option you have is so you could you can just use dot replace more kind of starting to add a little too many um a little too many things here but let's just do I'll do I'll add the dot list down here and then what we'll say is dot replace underscore more and then for now we'll say limit equals zero but at some point you will run into limits with the replace more like there's only so many more it will add I think it's 30 or something like that which is so fond of comments because like each replace more will load in a bunch of comments but just keep that in mind like you're you you're going to run out eventually but it won't air if you do run out of the option to continue replacing instead it's just going to toss them so you won't hit an actual error anymore so anyways let's let's go ahead and run this real quick and probably I should remove the parent call that's going to slow me down Walt hmm let's see submission dot comments okay replace more hmm okay fine fine fine one one okay dot list and then we'll come over here comments that replace more okay so first we we've converted it to list form which then creates this more comment object and now we can replace them I just did it backwards this should work that's still going to be a lot of queries to the API but hopefully we'll get through it are you kidding me please what have I done what have I done comments dot replace more so comment equals submission that comments I think I had it right the first time so comments equals submission comments please so where is a submission comments that replace more limit equals 0 now for comment in comments let's see no.4 comment in submission comments I really feel like I should have been able to string that someone can comment below what the fix should have been because I don't see why I wasn't able to string those together but obviously messing up something so for comment in submission comments that list let me try that drink some more coffee 1 Matic there we go not a problem that's going forever though I'm going to go I'm just going to break that pencil API calls eventually it would probably throttle me anyway as you can see now we've got all the parent IDs the comment IDs everything's hunky-dory we're doing great so go ahead and close this out so so that's how you can iterate through all the comments and all that now now the question is you know how might you rebuild that comment tree right because at some point right like you've got to rebuild that tree so for example one option you could have is like build a dictionary or something like that and then each of like the you know like the parent you've got a parent ID and then the parent content and then all the replies so a parent ID content all replies parent ID kind of all replies and if you did that you could rebuild the tree yourself now I'm not going to go ahead and go through all that I don't really see too much point covering that in video but if you are interested in that you can go to part 2 of this tutorial series on Python programming and there'll be an example there if you're interested in truly rebuilding those comment trees that's one way you could do it that's how I would do it anyway if you have a better way I'm sure somebody could come up with a better way anyways so now in the next tutorial we're going to talk about is basically just streaming from reddit so this has all been like historical grabbing from reddit but there's also a way you can actually just stream data from reddit so anyways that's all going to be doing in the next tutorial if you've got questions comments concerns whatever feel free to them below otherwise I will see you in the next trip
Info
Channel: sentdex
Views: 22,980
Rating: 4.8850574 out of 5
Keywords: Parsing, comments, Python, tutorial, Reddit, PRAW, Python Reddit API Wrapper, programming, API
Id: KX2jvnQ3u60
Channel Id: undefined
Length: 14min 13sec (853 seconds)
Published: Mon Aug 07 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.