Scrape Tweets with No Limitation and No API Key with Twint

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

what's up everyone how are you doing today i've decided to show you a quick way to extract data from twitter without the api without authenticating yourself to your account and also without using selenium for any browser emulator so this package is very powerful twins allows you to fetch twitter data directly from the command line or from the script and basically it has many features that allows you for example to specify the username that you want to scrape search keywords and more advanced parameters like the date and you the geolocation for example and you can at the end specify the output format that you want either in csv file or json format or sqlite etc you can run this same script here but directly from a notebook or python script by specifying a configuration object and setting the same parameter as here and then you can also some advanced visualization for example by plugging this tool to a graph visualization or kibana which is based on elasticsearch so basically this is very cool and we can use this script in two ways and i will show you now how to do this but before doing that let me show you how to install this package basically it's if you go to the readme of this repo you'll see a pip install don't do this because you'll have a problem executing the search operation so the right way to install this package is by cloning the ripple and then installing the requirements locally once you're done you're ready to go so basically let me show you how to use this script directly from a jupyter notebook okay so we will copy paste a small snippet okay let me just import twin first then we'll set the configuration file now i'm gonna ignore donald trump because i don't really like him and let's search for example for let's say covid okay okay so let's run this okay as you see we have an error and yet we're getting some data but we can't circumvent this error by importing a package this event loop is already running actually to make our twin package run inside the jupyter notebook we'll have to actually import a package called nest async io and then we'll have to call the apply method nest async io dot apply let me relaunch the kernel okay now i think we're good to go so as you can see here we are extracting the tweets basically these tweets are not generated live if you look closely they are generated okay right now but they are not the streaming tweets that you would get from the api if you go through the tweets you will see that some of them are fetched from a history basically with twin two won't have any api limitation which is very good if you want for example to harvest a lot of data on a specific use case okay so let's see now that i want to store this data in a json format which is the most handy way to store objects so okay we'll have to specify the store json parameter okay so let's set it to true and then we'll have to specify an output okay let's say that the output is called covid demo one dot json let me run this script for about 10 seconds and we'll see after that what we gathered what we gathered as data okay so one two three let's stop it now okay i'm gonna import pandas okay you may have this weird thing because we are running nest i think io but it's not a problem in our case now let's say that i want to import this new data okay so it's going to be demo one yeah as you see here we have a problem importing this data it's normal because this data is formatted in lines it's not a json object inside a list so basically to circumvent this error we'll have to add the lines argument and set it to true okay so basically now i have imported our data and if you look at the shape okay let me just restart the kernel because this is really annoying right okay so uh we have we have run our script in about maybe uh 20 seconds and or we have already six thousand tweets which is really awesome because imagine if you run the script for about one day or two you get millions of tweets so okay so let me just check the data i'm gonna just tweak the display a little bit to to display all the colon max columns equal 100 let's say so as you see here we are gathering the id the creation date the date the time zone etcetera and we are also getting the name of the tweet the twitter account the tweet the language which is very good if you want to do filtering on some specific languages and we get also the retweets the videos the thumbnail etc will have a lot of stuff and basically this data all this metadata you would get directly from the api but here you are using twinned and you are not using the api and you also don't have any limitation so isn't that great right okay now uh let me show you how to run the same operation but from the command line because some people would like to do that as well imagine for example that you are launching a amazon machine on the cloud for about two hours you want to run the script inside it so you can run from the terminal and detach it in a screen and basically this will run in the background and you can go two hours later and grab your data basically to do this after importing and installing twins you can call it from the terminal basically if you hit the tweet command you'll have options that you can execute for example i want to search about govid and vaccines like i can do this as you see we are looking at the results right now now let's say for example that i want to store this in a json format so basically this option can be added by json and then i want to output this file inside so let's say covet demo 2 okay so i can do this as well it will run this same fetching operation and once it's done i can collect my data and read it okay demo is here now data let's call data v2 so yeah line equal true and then data v2 shape we have 700 to it maybe the twin running from the command line is slower than the one running from the jupyter notebook i haven't really tested but maybe this is a some thing related to twins okay so yeah that's it now you have a powerful package you can extract as many tweets as you want you can do a lot of stuff on that for example if you want to analyze a specific topic you can collect this data and you can apply some natural language operations on it directly on the fly for example you can do sentiment analysis you can do topic extraction or name dtt extraction for example if you want to monitor what people say about a specific topic and what entities they mention in correlation with this given subject and you can feed all this data inside a elasticsearch database on which you can plug a dashboard called kibana to visualize the data and then you can also put some alerts on top of your metrics for example if for a given search keyword you'll have a volume that exceeds a specific threshold you can trigger some others so you can do a lot of things okay so thank you guys for watching i hope this was useful to you if you like the contest of this channel please hit the subscribe button or the like button of this video and to follow more content about data engineering data science and data extraction see you in next videos

Info

Channel: Ahmed Besbes

Views: 18,855

Rating: 4.9516907 out of 5

Keywords: twitter scraping, scraping twitter data with python, scraping twitter no limitations, scraping twitter wit no API key, scraping twitter with Twint, twint vs tweepy, scraping twitter without selenium, scraping twitter anonymously, scrape covid tweets, quick way to scrape twitter, social media monitoring with twitter, twitter, twitter scraper python, twitter scraper python tutorial, twitter scraping python tutorial

Id: _SqgSh3aR1g

Channel Id: undefined

Length: 10min 18sec (618 seconds)

Published: Fri Nov 27 2020