Data Extraction Using Python | Python Requests, BeautifulSoup, PyPDF2 | Python Training | Edureka

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] hello everyone this is versine from Eddie Rica and I welcome you all to this session in which I am going to talk about data extraction using pipe so let us take a look at the agenda for this session so I'm going to start with the introduction to data extraction and then I will tell you about the different libraries that we can use for data extraction in Python and after this I will jump right on to the demos and I will explain scraping data from a website a PDF file and an API using Python I hope you guys are clear with the agenda also don't forget to subscribe to a director for more exciting tutorials and press the bell icon to get the latest updates on Eddie wake up and enroll - Eddie recast data science Python certification course the link is given in the description box below now without any further ado let us understand data extraction using Python so what exactly is data extraction data extraction is the act or the process of retrieving data out of the data sources for further data processing or data storage in our case data analysis so the import into the intermediate extracting system is usually followed by data transformation and possibly the addition of metadata prior to export to another stage in the data workflow so basically what data extraction is we have different sources of data where we can actually extract the data from and then we can use it for different purposes let's say we can use it for transformation or we can use it for data processing and data storage as well so this is a simple definition of what exactly's data extraction now let's move on to what Python does for us when it comes to data extraction so basically python has all these libraries that we can use for data extraction and first of all we have a request library which is a vital library written by Kenneth writes and is light sensed under Apache 2.0 and it is basically a human friendly HTTP library as it is also mentioned on the official documentation page and it is quite easy reuse and basically used for making all sorts of HTTP requests because let's say if you are trying to extract the data from World Wide Web you're gonna have to need a some HTTP request for that matter so that is where request is actually used and when it comes to why do we actually need the use of Python request library the reason is pretty simple I mean while using Python requests you don't have to manually add their queries to your URLs and form encode post data and it makes our job easier when making the HTTP request of anything so that is why we use requests in Python guys for making the HTTP request and then comes our another library which is creepy so scrape is a free and open source web crawling framework written in Python and it was originally designed to perform web scraping but can also be used for extracting data using api's and it is maintained by scraping Huff limited scrapy is a complete package when it comes to downloading the web pages processing and storing the data on the databases as well and it is like a powerhouse when it comes to web scraping with multiple ways to scrape a website and a scrapey actually handles bigger tasks with ease scraping multiple pages or a group of URLs in less than a minute it uses a twister that works to achieve concurrency and then it provides spider contracts that allows us to create generic as well as deep crawlers so scrapey also provides item pipelines to create functions in a spider that can perform various operations like replacing values and data etc so if you want to know more about scrapey or requests in detail we have the tutorials for these like we have for Python requests we have a whole tutorial in detail and we have for scrapy as well with the demonstrations of examples using all these spiders and everything so moving on we have one more library which is a beautiful soup guys so it is basically a Python library for pulling the data out of HTML and XML files so it works with your favorite parser to provide idiomatic ways of navigating searching and modifying the past tree and it commonly saves programmers hours of days of work and now that we are clear with the library support for data extraction in Python let us jump right to the demos so the first demo is about extracting the data from a PDF file so for this we will use the PI PDF - library in Python yes so a brief introduction to PI PDF - guys a pure Python library built as a PDF toolkit and it is capable of extracting document information splitting the documents page by page merging the documents page by page and we can crop the pages merging multiple pages into a single page we can encrypt and decrypt the PDF files as well and there are so many more features of this pi PDF to libraries that I am talking about and then by being pure Python it should run on any Python platform without any dependencies on external libraries as well and it can also work entirely on string objects rather than file streams so it is going to allow the PDF manipulation in memory as well and therefore is a useful tool for website that manage or manipulate PDFs so in this demo I will show you how we can read from different pages of a PDF file so let us go right to the pycharm guys and i'll show you how we can do this so i will show you the PDF file first guys okay so this is a sample PDF guess so we have two pages over here something on in this document so what I'm going to do is I'll open fights on no gays so we haven't buy some guys I have made this project named data extraction and I have all these files for different demos that I'm gonna show you guys so make sure you have it install all the dependencies first of all so we're gonna use all those libraries that I've just told you about we're not gonna work on scrapey because it is our different tutorial altogether and we have a full tutorial on scrappy how scrape data using scrappy so you can check that out and otherwise you have to install all the dependencies so you can open the project interpreter come over here and install all the dependencies like requests installed in my projects I am NOT going to do that again so you have to installed ps4 or beauty-full you beautiful soup for you have to install guys and similarly you have to install PDF to and URL loop 3 which is you're gonna use this while working on web scraping using the beautiful soup guys okay so I'm gonna close this no I'm going to import I hope this is clear you guys I'm going to import pie PDF to as it's AP now I'm gonna take one variable file and inside this I'm going to use open function and I'm gonna provide the file location all right I'll be over here okay so now I'm what I'm going to do is I'll take one more variable let's say PD and I'm gonna use a B dot PDF file reader and inside this I'm gonna pass file guys now after this I am going to print oh wait let's say X so I'm going to use the PD and I'm gonna get the page the first page let's say Y is equal to PD dot get page 1 and now we print X and why nor extract and normal rate extract text okay so we are done with the program guys I'm gonna run this now so as you can see in the output guys okay first let me comment one of these check the output whatever I would written in the first page like there's a simple PDF Python is high-level object-oriented programming language what I have written in the first page of my PDF it's showing in the output and similarly to get the content of the second page I'm going to come in this part now let's check for the output again and this is how simply you can I mean there's the data like you must be important Python concepts such as data operations file operations etc so this is how you extract data from a PDF file guys using the PI PDF library that we have in Python now next up we have one more program which is going to be about extracting the data using API guys so I'll remove this again write the program from beginning so for this what we have to do is we have to import certain dependencies first of all which is going to be request ok wait say are and after this we're gonna have to install some more dependencies but these are will imported later guys so first of all I will mention the URL guys I will just copy the code over here so this is my URL guys I'll make this a little smaller so that'll be in the same page so we have our URL which is basically sv p dot api open weather mapped out ah gee so I'll open the URL for you guys just open weather map so this is basically the URL that we are using is and we are gonna get the current data or the current weather data so to call this we have to use this API call in which we have a city name and our API key the API key is unique for everybody for that you have to sign in to this login page and follow certain instructions so I'm not going to tell you that you can do that on your own yes we have a city name over here which I'm going to get from the user and then I have a key which I'm going to import from the secret file guys so from secret I am going to import key right and I'm going to mention the city as well let's say so it's going to be string input city name now after this I'm going to make one off request case in which I'm gonna use the URL using the request library that I have mentioned over here so what I'll do is I'll take one variable response and I am going to use request or r dot get side this I pass the URL case and now I'm going to print the status code case so let's check what is the status code that you're getting over you I will make this little smaller we type the City as New Delhi let's see and we have the status code s21 so we can carry this forward guys so what I'll do is I'll just talk friend response dot text let's see what we are getting eyes and for this I'm gonna use a different city let's say Chicago so these are the coordinates the weather everything I'm getting over here in the output guys so this is raw text I'm gonna convert it into let's say JSON format so that it's easier guys so we'll take one more variable let's say data and I'm gonna change this as a response dot text thought we just simply bring the response and JSON format guys you don't have to change it to any other now instead of Chicago let me just take it as let's say London okay so we have the coordinates do you have the Veda it's clouds description is broken clouds we are getting the ID as well that's not that we actually need over here and we have the weather and everything nice so this is how you extract the data using an API in Python using the request library and now that we're done with this one also let's take a look at the next one and also I forgot really about the PDF file reader that we did in PI PDF it's basically you know initializes or PDF file reader object and the operation can take time as the PDF steams cross reference tables are read into memory and the get page that we use over there was basically you know used to retrieve a page by in okay I'll just open the file and tell her again guys so this PDF file reader is basically you know it initializes a PDF file reader object and the operation can take some time as the PDF stream cross-reference tables are read into memory and then this get page a method that we have over here it basically retrieves a page number from this PDF file whatever we are using over here and okay let's move at with the API one so I guess you have understood how we have used the API also it's basically nothing guys I skipped the path where you get the URL from the open weather map dot o-r-g there you have to sign up and everything for getting the key and that's how you do it guys and also let me just give you a quick introduction to API so an application programming interface or API is a computing interface to a software component or a system that defines how other components or systems can use it so it defines the kind of calls or requests that can be made and you know how to make them also it tells us and the data formats that should be used the conventions to follow etc all those things are covered in the APA and for our program here I have used the open weather map API and to use this API you have to use a key in the URL that you have seen over here after the city we have used this key which is a secret key in my secret file over here that I'm not obviously going to show you because it's unique and you have to generate it for yourself so this can be a little exercise for you guys you know to figure out what kind of URL you're going to use for this API and if you search more on the Internet you can also find a lot more other API so that you can use to get the data this is a very simple example I'm just demonstrating how you can extract the data using Python now that we are done with this example as well we will move on to the final example of the session so this is basically web scraping using Python so I'm just going to close this and remove all this and I'm going to write it again from the beginning so before we begin guys I just wanna show you something or I'm gonna tell you that this web scraping using Python or web scraping in general is a very controversial concept in Python or any other programming language because of the legality clause that it comes with and it is not exactly ethical to scrape a website but we can always follow certain guidelines and then of course there are legality issues with web scraping so make sure you're thoroughly aware of the site permissions that you are going to explore or a scrape now most of the websites have a robots.txt file where you can see the permissions and even if a website does not have a robot dot txt file it does not mean you have the permission to scrape the website and there has been a case in the which Craiglist sued a company for more than 300 million dollars for this purpose and I have found a one article that I want to show you guys it's quite good and it's very informative about the legality clause of web scraping that we have here it is from pro web scraper dot-com guys so here are some aspects that you can figure out if a website is legal to scrape or not for example if we take a look at Eddie Drake or Co so we can do one thing guys first we go to the website course and meanwhile in check out other some certifications as well like we have for PG certification and data science and from it's pretty good guys IIT Guwahati we have affiliation with them so you can check this out for data science also we have so many courses that you can check out all right so we'll get back to the tutorial again so you'd write robot dot txt oh wait this is robots.txt we have all these permissions and this website editor karateka has disallowed everything so you cannot scrape this website guys so you can always check for any website that you're going to scrape first so you do one thing check for any disalignment on robots dot txt file if there is anything you should just move on to the next website that you had in mind to scrape the data from and I mean most of the websites have a robot or txt file where you can see the permissions and even if a website does not have a robots.txt file it does not actually mean that you have the permission to scrape the website or the webpage I mean this is just to let you guys know that it can be a serious issue in some cases so always check for the robot or txt file for permissions and if it's not there the best option is to look for all three options I mean if you cannot find something related to a robot or txt file which indicates that you can scrape the data from over there you should just move on to the next one or ask for permission from the owner of the website that works also so let us move on to our example in PI some guys so I'll just begin so first of all you have to import a few dependencies so from ps4 I'm going to import beautiful soup yes and from URL three dot so I'm going to use a response import URL URL open okay we have made a mistake over here it's not URL Lib three its URL lab and now it should work fine nice I'm gonna do it again you are local now I'm gonna paste a URL to scrape comm and inside this we have a whole page okay just copy this URL take you to the webpage that it's better to understand this what we are scraping actually and this is the webpage that I'm gonna scrape for data to extract data from this webpage the title itself says quotes to scrape then we have all these or codes that we can scrape inside our program and store it also and before that I have to tell you a few things about the steps that are involved in web scraping noise so first of all you have to find the URL that I've told you about you have to be very specific when you are choosing a URL for scraping the data because there are a few legality issues related to it and I mean sometimes you don't have permission to scrape the data sometimes the robot or txt file is not present so you never know if you have the permission or you have I mean if you don't have the permission and sometimes it's clear it and then you can scrape the data after that you inspect the page and check out everything or actually you know explore the web page to see how you can scrape the data and what kind of data are you trying to scrape from here and after that you have to find the data to extract extract and in our case I'm going to extract the codes from the webpage so after this after all these steps the three steps you have figured out what kind of data you are gonna scrape from which web page and what are the tasks that you are going to use from inspecting the element then you write the Python script using the beautifulsoup and URL Lib open then you extract the data and store it in a format so we'll move to pycharm again guys so we have our database sorry we have our URL and now I'm going to use HTML is equal to URL open and provide URL Oh here yes and I'm gonna take one variable soup I'm gonna use beautiful soup wait a minute okay so we're gonna begin again beautiful soup inside this I'm gonna pass HTML and we write HTML dot our sir and now what we'll do is we'll write type and then we write all links is equal to su dot find all and inside this we have to mention what we actually have to scrape from here so we're going to inspect this page guys we have opened inspect element we'll check for what we have to scrape over here so we have a span inside this we have the text case it's in the dev class over here and the class is caught yes and for each it is same so I think I know what we have to do here so we'll find all the depth where the class is if I'm not wrong it was good yes and after this we take one more variable string cells and we're gonna change all of it to strings and we are not going to get the clear tags so we'll use a beautiful soup again inside this I am going to pass string cells and then we can have HTML dot parse HTML our parser is basically nothing used to get the data in a format which is understandable to the user because not every aspect of the text that you have on internet is readable to you guys sometime it's in XML format and it has to be paused in HTML so that you will be able to understand it better now we print the clear text let's see what the output is right so we have all the text from each dev so we have tags also alright so this is how you scrape a data from a website guys and similarly I have written one more code for another URL that is books to scrape comm so I'll just take you to this URL first of all and I'll show you what I've done over here I've changed nothing from the code so it's basically the same and then we have this page okay I'll just webpage which is basically you know books to scrape that we however here so it has all these books that I can scrape it is basically a a commerce platform replica that you can think of it as the e-commerce platform and then let's say you want to scrape the name of a book or name of all these books on this page so what you'll inspect the element check for okay so we want this so he wants the H ref in s3 so what I've done over here is we have found all the s3 and got the text from it so let me run this and you'll see that we are getting the output as somewhat like all the names of the books over here so there's also a very simple example of web scraping that you can do guys and not that we have come to the end of the session don't forget to subscribe to a deer a cow for more exciting tutorials and press the bell icon to get the latest updates on a deer a cow also check out Eddy Rica's Python programming certification program and Python for data science certification program the link is given in the description box below thank you I hope you have enjoyed listening to this video please be kind enough to like it and you can comment any of your doubts and queries and we will reply them at the earliest do look out for more videos in our playlist and subscribe to any Rekha channel to learn more happy learning
Info
Channel: edureka!
Views: 90,834
Rating: undefined out of 5
Keywords: yt:cc=on, Python for data extraction, python data extraction, data extraction in python, data extraction, data extraction tutorial, web scraping In python, extracting data from API, Data extraction from API in Python, python web scraping tutorial, Pypdf2, data extraction from pdf file, steps involved in web scraping, web scraping, web scraping guidelines, robots.txt file, requests in python, python web scraping, web crawler, python tutorial, python edureka, edureka
Id: kEItYHtqQUg
Channel Id: undefined
Length: 25min 21sec (1521 seconds)
Published: Mon Apr 06 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.