Introduction to data scraping in Python using Requests and BeautifulSoup

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello everyone i'm making this small video to show you today how easy it is to scrape data from the internet using python uh i don't know if you guys are familiar with scraping but scraping is the process of automatically extracting data from the internet from virtually any site we can use scraping for news extraction for example or stock prices or sports results and you can imagine any use cases you want for example if you're a fashion retailer you may you want to use scraping to analyze the competitors to extract their prices their product description and also to mine their social media to know what your customers may think about them and uh for example if you're an investor you may want to analyze analyze the stock prices and collect them and maybe build predictive models um i personally have used scraping in the past for various clients and multiple sectors and i also use this these techniques to collect data for a machine learning model training for example i have extracted images texts and videos to train classification models so today i'm going to show you guys how you can use libraries such as requests beautiful soup and pandas to collect and harvest data from a website we will be interested in this website premium beauty news and we will analyze a section of the website which corresponds to the market and the trends in this sector so basically if you look at this website we have a grid of articles that appears at the front page and if we scroll to the bottom we will have a pagination what we want here is to collect automatically collect this data by going through each article grabbing the url and entering each one of them extracting the title the date maybe the abstract and the full content we want to do this automatically by paginating over all the pages and then dumping the results into a csv file or json file we will see how this can be done you'll see that's fairly easy and in general guys if you start doing scraping you will see that's a lot of fun and you will see also that once you can turn every page in a structured data at the end the internet becomes your full database so before starting to code i should mention that we will be using three main libraries first of all we'll be using requests which is a python layer library to simulate http requests like post and get we will also use beautiful soup which is an html and xml parser and finally we will use pandas to structure all data and dump it inside data frames and export them into json file or a csv file for example okay so now we should get started i'm gonna use jupyter notebook to make this process interactive and so that you can follow along with me and at the end you will see how we can turn this jupyter notebook to python script in order to schedule the escaping process over time over weeks over days wherever you want to make the process of data scraping more uh automatable so let's get started um i'm gonna fire a jupiter notebook that's what i have done already and i'm gonna import the libraries i want so i start by request then i'll import beautiful soup and finally i will import pandas [Music] now i will define a function that given a neural return the soup beautiful super presentation of the the the xml and html content we will see how this can be done easily so once we have an url i can generate a response which is the requests get url and then i can get the content of the response by this attribute and finally i can parse it so parsed response which is beautiful soup okay content and i have to pass the lxml argument maybe i should mention that to make this work you have to also install the lxml package by running this command actually this command is already done so i'm going to skip it and then i have to return the parsed response okay now let's see what [Music] the result looks like if i get the url and pass to the function we'll see that we'll have this object actually it looks like a string but it's not it's a soup object it's a bs4 beautiful soup object and basically it's a tree each element of this tree is a note and the beauty with beautiful soup is that we can query all the objects of this tree by basically two methods find or find old for example if i want to find all the div tags inside this tree i just call find div this find method is basically meant to return one single element basically the first one and if we call find all we will just return all the divs inside this tree so we will return this recursively at the different level of the structure of the tree so if i compute the length of this list we will most probably have a large number all right now if i want to look for a specific object based on his attribute i can do it for example if i want to grab the div all the divs who has who have the class c pre-count i just do it like this which is fairly easy okay so now uh let's look at the structure the html structure of our website in order to make uh our scraper our first scraper so if you look at this fur page what i want here is to collect the information about each element appearing in this script so basically the first thing i have to do here is to inspect the source code page code source page and if you're using a mac you can do this by pressing command option and i the same time and you will see the element of your source code of your page so now i can hover over each element and i can see the source code and if i go through this code i will see that all these elements appear all my posts appear in a section that has the class content so i can extract this section by calling soup find section that has the class content so here it is and now i want inside this section to get each post so basically okay each post has is a div that has this class this looks like a bootstrap naming convention so we just use it first is equals sql section final div class equal this so normally i should have 10 posts which is good okay now let's take the first post and see how we can extract its url so basically each post is like this what i'm interested in here is this url this is not a complete url we have to append it to the base url of this of the website it should correspond actually to this url okay this looks good so we can do this to make this things more uh generic let's loop for each post inside posts and then we will grab the uri which is post and if you look at the code here we will see that the href is inside an a which is in inside the h4 which is actually the title of my post so i can do this very easily by calling find h4 and then inside h4 i will call find a and then i will grab the href here so basically i can do this all over all the posts i can print the uri at each time and it looks good okay now i want to append the base url to each uri yes so url post is equal to base url plus uri and then i just called url post yeah now we're good okay so now we have done this for one page only and what i want to do is paginate over all the pages until i reach the end so we can do this by inspecting the next button which is in our source page so if you look at the next button we will see that it's inside a p class there is a span that holds the class equal next and normally if we go through the end for example the page number 81 we will see that we won't have the next button actually the next button exists but it's disabled and if we look at the code we will see that inside class pagination there is no class equal next there is however next disabled all right so how i can how can you implement this logic so basically what i want to do to paginate over all the data is creating a while loop so you can do this by so my my url is equal to this value and then i want to extract the next button actually the next button is i'm gonna just put soup here soup find b and then i will look at the it's equal pagination all right okay so next button then i will get the span class equal next find span plus equal next okay so here's my while loop while next button is not none extract my posts and this section of my current url okay and then for each post i extract the posture okay now once i finish for the first page i have to iterate and change the url so that i can go to the next page okay so yes actually i have to yeah actually i have to define the soup url here okay yeah maybe the next button i will initialize to do something like this like empty string okay and then define soup and then section posts and so on and now i have to change the value of url by updating it to the next value using the imagination so so yeah the next button we have seen it is okay x button equal soup find b class pagination and then okay and then get the span class equal next find spam plus equal next ok so if next button is not none i can just modify my url by getting it from the span so next button find a href this is a relative uri i can just check it from here okay so i just have to get the base url so actually this is my base url yeah there is a slash okay so i'm gonna get okay okay which is good so base url plus this one okay now we can update our button our next next url here okay we can try this by printing each time the page number will start at one page number we increment it and finally is not man okay [Music] i won't print the url posts but you get the logic normally this loop should stop at the page number 81 so i'm gonna pause the video until i reaches the end yeah uh here we are okay so now the loop has finally ended so we can we have looked over all the pages so what we want now is um for each post in each page we want to grab the content so i'm going to just close this okay let's say for example i want this first article based on his url i want to grab the data the title the date the abstract and the full content and if i have additional data like tags for example or the author name i should get them as well so let's see how to do this so basically back to the code get the post url so i just create soup post which is parse url of the url post soup first okay now what i want is to grab the title of this article i see from the code here that it's an h1 with the class equal to article name so basically i just get it title equal find h2 i guess now let's set h1 plus equal article title yeah okay so i want to get the title text okay now i want the date time so let's see okay the date time appears inside a span inside the coal md7 inside a row subheader so i will get the the this header first sup find header it has the class row sub header and i will get the span and the datetime attribute okay so yeah okay this is a row sub header header row sub header yeah actually it's soup posts okay so i will get this span and then i will get the date time attribute okay good now i will i want to extract the abstract if i look again at the code i will see that the abstract is in the bold font it's inside the h2 that has the class article intro so the abstract is inside h2 that has this class yeah i'm gonna just grab the text and now we're done finally what you want now is to get the full content of the text actually there is a div that has this class okay there are a lot of sub uh tags inside this div but we can do this we can wrap get all the text and concatenate automatically by calling the text attribute which is quite handy in our situation okay so as you see here we have collected and extracted all the data we need now why why don't we create a function we can easily do it extract post data so basically when i want to do this i need a post url and then once i have the post url i can just create a soup post which is this one parse url post url and then i will extract all this information from here i'm gonna just get it and paste it just right here okay now i can put all this information inside the dictionary so title is equal to title [Music] date time is equal to daytime abstract is equal to abstract and content is equal to content as well i will add also the post url [Music] okay and then i return the data yes missing comma okay so [Music] now i can call this function inside my loop but let me define it here okay i can define posts data here and then i can collect my data for each post here extract post data i'm gonna just call it post url post url and i will append it to both data yeah i want to monitor the progress of my execution so i will just import tqdm okay will wrap it on posts and put leave the call false and this should be good i mean just so now the code is executing for the first few pages i will interrupt the execution at the 10th page to see how the results are structured if we got it right hopefully so now we're going through the seventh the eighth nine and finally the then okay so now if you have scrape 10 pages with 10 posts each we will probably have 100 posts yeah okay with the offset we have 90. and if i put this inside the data frame okay so it's posts data we will have this data frame and i can check the [Music] unique number of url which is 90. okay so now we have you've seen that you have collected all this information uh very easily now uh to make our code generic and uh to make it executable and uh automatable using scheduling and chrome jobs and so on we have to pass it inside the script and for this i'm gonna just use visual studio code all right so i'm gonna save this inside okay yeah okay scraping that's pie i will get all my dependencies i just replaced eq them by tkdm because we're in the side script i will define the parse url function the so i'm going to select an interpreter okay which is which one okay just change the formatter i'm gonna use python nice python okay looks good now let me do this here okay just replace tqdm and define base url okay okay oh oh yeah okay that's it so we've seen here that with a little bit of html knowledge and a little bit of python programming knowledge using beautiful soup and requests we can easily harvest and extract the content of a full website and most likely we can replicate our analysis and our code on any other section of the website because i honestly think that all these uh sections follow the same structure so you can get all the job done within maybe a few minutes of execution and finally you can run your analysis on this crave data yeah so i will post all the code in my github account and i will post the links to to beautiful soup requests and bundles in the comments section if you guys have any question regarding the code or the analysis don't hesitate to post a comment below and don't hesitate to share and also to like this video so thank you for watching bye
Info
Channel: Ahmed Besbes
Views: 992
Rating: 5 out of 5
Keywords: beautifulsoup, python, data extraction, scraping, requests, jupyter notebook, data mining, beautiful soup, beginner tutorial, machine learning, scrape data, scrape data from website python, python beautifulsoup tutorial, python web scraping tutorial for beginners
Id: 7Odi2_u-yDk
Channel Id: undefined
Length: 34min 10sec (2050 seconds)
Published: Fri Oct 30 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.