Twitter Scraper Python Tutorial

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

in this video i'm going to show you how to scrape twitter data using python so that you can mine the data for research machine learning sales leads or whatever you want to use the data for twitter is a bit more complex to scrape than some of the other projects i've done so far in this series and you might ask hey isn't there a twitter api yes there absolutely is and if you're okay paying for the api access or working within the strict limitations of a developer account it's definitely something you should check out because it's going to be easier and you'll have a lot more rich metadata available to you however if that's not an option for you then keep watching we need to automate the browser for this project so we'll be using selenium i'm going to assume that you already have it installed but if not check out my video on installing selenium i'm going to use the edge browser but you can use any selenium web driver you want such as chrome or firefox there aren't many differences and i'll point them out when i see them here are a few things we'll be tackling in this project logging into twitter yes we need to log into twitter to scrape the data so you'll need to get a twitter account continuous scrolling twitter handles pagination by continuously loading data as you scroll down the page this is different than typical pagination where you can simply click on the next page or get a link to the next page i'll be using a lot of xpath syntax because using xpath is the absolute simplest and most effective way of getting the elements on a complicated website like twitter the first thing you need to do is import the required libraries [Music] if you're using firefox or chrome your webdriver imports will look the same except for one key difference you'll import the firefox and chrome web drivers instead of using the msh selenium tools library next create an instance of the web driver now this will be different depending on your web driver as i've indicated here [Music] a web browser should open up now and you can use the get method of the driver to navigate to the twitter login page what you should see now is the twitter login screen and what we need to do is find the username and password input boxes fill them in then submit right-click the username input box and then click inspect you'll see the developer tools open up on the right hand side or the bottom of your screen depending on how you have it set up you may need to click inspect on the input a few times to get to the right level of detail what you should see now is an input tag that has a lot of stuff in it but one of the properties is called name and it has a value of sessions and then username or email in brackets this is what we need to identify this element as for the password if we right click and then click inspect you'll see another input element and this one has a property called name as well and the value is session with password in brackets so let's take this information to our code editor and log into twitter use the method find element by xpath and pass in the following string which i'll explain i'm not going to give a full xpath tutorial but i do have a link in the description for some resources if you'd like to learn more about it i highly recommend it you might be able to tell already what this means now a single forward slash means that i'm starting from the root of the document however a double forward slash means that i'm looking for something that starts anywhere in the document i'm looking for an input tag with certain qualifications that i've included within these brackets in this case it should have a property called name with a value of session and then username or email then assign this to the variable username then use the syn keys method to input your username you should now see your username entered into the input box now for the password because i don't want to hardcode my password on the screen for all to see for obvious reasons i'm going to use the built-in get pass function in the python standard library it essentially works like the input function except that it hides the input from the user to keep it private let's find the password input the same way we found the username input except this one has a name value of sessions password then use the syn keys method to enter your password finally you can use the syn keys method again to send the return key which is essentially the same action as clicking the login button alright now if everything went as planned you should be logged into your twitter account what we need to do now is search for some term that we want to extract data for at the upper right hand corner of the screen there's a search input box right click on the input box and then click inspect this one appears to have a few properties that we can use such as the area label that has a value of search query and the data test id that has a value of search box search input but we're going to go ahead and go with the area label because it's shorter same as before use the find element by xpath method to find the input tag that has an area label with a value of search query and assign this to a search input variable next use the synkeys method to enter a search term into the input box i'm going to type in hash polynote then use the synkeys method again to send the return key [Music] now you'll notice that at the top of the page you have a few tab options such as top latest people photos videos you can find and click any of these tabs with your web driver depending on your goals however since i'm interested in pulling historical data i'm going to go ahead and click on the latest tab with the web driver so right click on the latest tab link and then click inspect as you can see this is a tag with other stuff embedded but it looks like the easiest way to identify this may be to just search for the text of the link now fortunately there's a built-in function for this called find element by link text so use this method and then click on the link to navigate to the latest tab the next thing we need to do is figure out how to collect all the tweets on this page and i i like to refer to them as cards now because this is a continuously scrolling page there is only about 9 or 10 tweets loaded so far but as i continue to scroll down the page more and more tweets will load so the process of scraping these tweets is going to be an iterative process of scraping and scrolling right click on a tweet and let's see how we can identify it as a unit as you scroll down the code you'll notice that different elements of the page are being highlighted and this tweet has several layers of containers however there's a div class container that has a data test id with a value of tweet this looks like it's going to be our ticket so let's go back to the code and collect all of the tweets on the page use the find elements by xpath method to find all div tags that have a data test id value of tweet notice that the method is plural elements and not element this means that i'm going to return a list of elements and not a single element assign this to the cards variable then get the first card in the list so that we can begin prototyping a data extraction model on a single card which we can then apply to all of them now if you take a look at this tweet there's a few pieces of information that we want to collect so we have username twitter handle timestamp count of comments count of likes count of retweets and the text of the tweet itself of course the containing div tag is the one that contains the tweet test id which we identified one of the difficulties with scraping twitter data is that you don't have access to a lot of unique identifiers that allow you to quickly access the data you want so a lot of what we need to do is rely on relative relationships so if we drill down one level into this tweet you'll see that there are two divs the first one contains the user image and the second one contains the body of the tweet that you can see on the right hand side of it we can access that body by using the following xpath notation the period here means that the search is starting from the current element then we're looking for a div tag and then since there are two we can use the brackets to access the second one so i'll put a two in here if we drill in one level into the body div you'll see that there are two more divs the first one contains the username twitter handle and date and the second one here contains the content of the tweet as well as the count of tweets retweets etc so let's go ahead and grab the username twitter handle and post date now we already have that second div containing the body of the tweet so to get the first div tag within that one we can access the next note in a similar way except this one will be the first element of that node and so we'll use a one now to get the username let's start drilling down and down and down okay this is a bit ridiculous to be honest so at the end of the day this username is in a span tag fortunately with xpath we can skip all of this and go straight to the first span tag by using double forward slashes and then a span tag so putting this all together we can use the find element by xpath method with this xpath to get the username and then we'll grab the text now i wanted to show you how to slice a node here if you hadn't seen this before but we can actually shorten this a bit further since this span tag is the first one in this tree we can just use the following abbreviated xpath command that means give me the first span tag after the current tag the next item that we need is the twitter handle now if you drill down in this more you'll see that the twitter handle is located in a span tag as well and it will be the first span tag that contains an at symbol so we can use a little xpath magic to make this really simple so using the find element by xpath method we're going to look for the first span tag that contains text with an at symbol and then we'll return the text of that element the next element is the post date and this one is actually pretty easy the post date is located in a time tag and it has an attribute called datetime which represents the timestamp of the tweet now you can use the datetime library of python to convert this to a datetime object but i'm just going to leave it as a timestamp string the next item is the text of the tweet now we actually already know where this is you saw before that the card contains two div tags and then in the div tag that contains the tweet body two more div tags and it's that second one that contains the content of the tweet and the tweet stats however we can't just grab the text of that second div otherwise what we'll do is we'll end up picking up the the likes and retweets and all that and it's going to show it up at the end of the text if you drill into that second div you'll find three more divs the first is the user comments the second is the content that the user is responding to and the third is the group of tweet stats and other buttons at the bottom now as far as i know these three divs will always exist even if the person is not responding to anything it's just that it will be empty if there's nothing they're responding to so what we're going to do is grab the first two divs separately and then we'll concatenate them together to represent the content of the tweet [Music] [Music] all right the last three items are the reply count the retweet count and the light count now these are easy to get because they have attributes that can uniquely identify them as you can see in this code so use the find element by xpath method on the card element and extract the counts for reply retweet and likes now if there are no counts for these it will just return an empty string which is fine [Music] now that we've prototyped a model for a single tweet we can generalize this model by creating a function with all of the code that we've written up to this point so create a function called get tweet data that accepts a single argument card and then copy and paste all of the code above [Music] one thing i should point out is that if you look through these tweets occasionally you'll see sponsored content the thing about sponsored content is that it doesn't contain a post date so what's going to happen is in these cases we're going to try to get a date time element but it elements not going to exist so it's going to return a no such element exception but this is actually a good thing because it means that we can easily filter out sponsored content by handling this exception when we try to get that post date [Music] we're going to then consolidate this into a tuple and then return that tuple which is basically the tweet data now you can test the function on your card to apply the model to all cards simply write a for loop that appends the results of the function to a tweet data list if there's any data returned then print the first item in the list to see that it returned out as expected [Music] as i mentioned before one of the challenges of getting data from a website like this is that it's got a continuously scrolling page the data continues to load as you scroll so in order to do this we need to actually execute a bit of javascript to tell the browser to scroll down the page but don't worry it's very simple and all you need to do is copy and paste what i'm doing here [Music] this bit of code will scroll down to the bottom of the page at which point more content will load to the screen so now we have all the pieces that we need to create our twitter scraper so let me quickly outline how this is all going to work and then we can put it all together so first we're going to import the required libraries then we're going to define our get tweet data function we're then going to start up the web driver then we'll navigate to twitter and log in we'll enter our search term and then we'll extract all available tweets we'll scroll down and we'll repeat that process until we've extracted all of the tweets on the page this loopy wrote to extract the card data works for a single page however there's some optimizations that we need to make for it to work on a continuously scrolling page first as we continue to scroll the page the background html isn't necessarily going to disappear now what's going to happen is it's going to continue to grow so instead of finding 10 tweets on the page it might find 20 30 etc it doesn't necessarily grow consistently but the fact is it does grow so what this means is that there's potential to continuously scrape tweets that we've already scraped there are two things that we can do to fix this first instead of iterating over every page card in our page cards list we'll just look at the last 15 items this will save us time by not having to recheck every single tweet in the list when we know that really only the last 15 or so are the ones that we care about because they're the new tweets that have been loaded onto the page the next thing we can do to prevent from duplicating tweets is to keep track of the tweets that we've already scraped now unfortunately we don't have a tweet id but we can create one by concatenating elements of the tweet into one long string which essentially creates a unique identifier then what we'll do is check to see if that tweet id has already been collected by checking it in a set of tweet ids and if not we can append that tweet to the tweet data list and also add that tweet id to the tweet id set next let's add the pagination now every time we page down i'm going to cause the program to sleep for one second this will give the program time to load before i start scraping the data now what we have is a loop that will continuously scroll down the page and then scrape additional tweets however you might be able to see a problem we don't currently have a way to test whether or not we've reached the end of the scroll region now the key here is to always keep track of the y position of the scroll bar using a little piece of javascript code we'll track the position before and after the scroll and if the scroll position doesn't change we'll know that we've reached the end and then we can break out of the loop the first instance of this code is going to go right in front of the while loop and we'll call this variable last position the next instance of this code is going to be right after we scroll down the page and we'll call this variable current position then what we'll do is compare the current and last position and if they're the same we'll break out of the loop now as it is this will work but sometimes it won't unfortunately the internet is not always as speedy and consistent as you'd expect it to be so sometimes you'll execute a scroll down and it won't load fast enough so the starting position and the ending position will still read as the same this will cause the script to stop prematurely so what we can do to fix this is allow for a certain number of scroll attempts before it finally breaks the loop to implement this fix we'll insert a variable called scrolling and set it to true we'll insert this before the while loop next instead of while true we're going to change this to while scrolling then we're going to create a variable called scroll attempt and place it right before the code that we used to scroll down the page then set it to zero now we're going to put this scrolling code inside of a while loop so while this is true we're going to scroll down the page pause then evaluate the current and last position if the current and last position is the same we're going to increment the scroll attempt if the scroll attempt is greater than or equal to 3 we'll go ahead and set scrolling to false and this will break us out of the outer loop we can go ahead and break this inner loop for completeness now if the scroll attempt is less than three we want to sleep for two seconds and then it will continue to cycle through this loop trying again to scroll down the page if the current position is not equal to the last position this means we've successfully scrolled down the page and we can reset the last position by setting it to the current position we can then break out of this inner loop of scrolling and get back to this outer loop of scraping tweets now we can let our bot run until it's finished scraping all of the tweets [Music] after the scraper is finished you'll want to save the data you can use the csv library to do that or if you want to use something else that's fine too [Music] and that my friends is how you scrape tweets from twitter i hope you've enjoyed this video if so please make sure you hit that like button and subscribe to see more content like this in the future see you in the next video

Info

Channel: Izzy Analytics

Views: 30,820

Rating: 4.9612589 out of 5

Keywords: twitter scraper python tutorial, twitter scraper python, twitter scraper python example, python tutorial, python web scraping, python web scraping tutorial, twitter web scraping python, how to scrape twitter data, how to scrape twitter data python, how to scrape twitter data using python, how to web scrape twitter, web scraper twitter, data science, web scraping, python scraper, web scraping in python, twitter scraper, web scraping twitter, tweepy

Id: 3KaffTIZ5II

Channel Id: undefined

Length: 22min 15sec (1335 seconds)

Published: Thu Oct 08 2020