Web Scraping Using Scrapy Tutorial For Beginners: Learn Scrapy From Scratch

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so what does it mean web scraping well before you asked me this question I have a question to ask you so have you ever wondered how Google fetches the data from the entire world wide web and index it in its search engine well surprisingly that's what we call web scraping it's the process of data extraction from websites in an automatic fashion through what we call web spiders or web crawlers it's like when you visit someone else website and copy and paste its content and that's exactly what Google does but it's not manual it's automatic now other in the market there are plenty of tools that you can use to do web scrape like HT track and W get but these are only tools that only helps to download an entire website and not extract in specific parts of the data for example let's say you want to scrape all the tutorials from this web page if you use HT track or W get you won't be able to extract specific parts from the web page likely tutorial title and description you won't be able to export the data to a database for example also you won't be able to structure the data I mean you won't be able to fully control the data the only advantage I can see in these tools is the ability to download an offline version of the website they are meant to do basic tasks only now enough about those you are a developer you should build your own one so a more robust solution is to use a web problem framework called scrapey scrapey is different than the tools we export recently it's the number one framework that will let you create your own spiders not only that you will be able to extract or select specific parts from the web page using what we call selectors like CSS and XPath selectors and not only from one page but from the entire website in a structured way now you could ask me can you please explain to me how scrapy is different how it really works well to answer this question you need to first know what are the core components that are built in into scribing so we have the engine the spider that you can create the downloader MIDI worm the item pipeline visca doula all these together when they cooperate between each other they perform what we call web scraping now let's see the flow of interaction between these components so at first the engine receives an HTTP request sent back by the spider then the engine delivers that request to the scheduler now why it did transmit a request first to the scheduler and not to the other components well because the scheduler is responsible for tracking the order of requests what comes first is served first if there isn't another request that is more prioritized it returns the same request back to the engine and the engine takes that request and sends it to the downloader middle worm to generate a response back to the engine when the engine receives that response it sends it to the concerned spider for processing purposes and finally the engine which is the coordinator sends the response to the item pipeline to return specific parts of the data that you told him to extract see all these interactions happen in a matter of seconds this is why nowadays scrapey is getting more and more interest by developers over the time now the last question I want to answer is why companies and people like you and me do web scraping well of course we do web scraping to make some data analysis on the data extract and by doing data analysis we can predict the future for example maybe you want to do lead generation like building an email list for marketing purposes maybe you want to monitor prices of a specific product to get the right deal for you maybe you want to build an application to showcase some projects there are plenty of fields where web scraping is indispensable and this is why companies today are searching for experts who really shines in web scraping now to follow along with this course you should have Python 3 installed that's essentially the package manager pip so we can install scrapy as I'm gonna show you later the virtual environment it's not required but it's recommended and of course you should have a code editor I'll be using vs code it's a super fast cross-platform code editor I really like it and of course you can choose whatever code editor you want as long as you are familiar with it now before I'm gonna show you how to install scrapie I'm gonna first set up a virtual environment you can skip this step if you want so the trial ng I'm gonna name it virtual and the line NV and basically this will create a new python executable that is specific to our project now let's CD to that folder so CD virtual and line in D and then we're gonna need to activate it so source scripts slash activate press Enter I'm using git bash if you were using just the regular command line you should put backslash instead of slash and in Linux you should replace scripts by been in order to activate the virtual environment now let's see how we can install scrappy I'm already in Skype is official website which is square P dot org and at the time recording this video scrapey is on version 1.5 so to install it we can copy this command and paste it into the terminal press Enter this will take some time so I'm gonna cut the video good now let's create a new project so scrapie start project and then the project name let's say demo and the line project and basically this will set up the project template for us now if we type LS we should see the project folder so let's CD to that folder see the demo and the line project now in vs code let's open the project we've just created so file open folder it's inside the virtual env folder click once on demo under line project and then select folder now in the next video we will explore the website that we'll be scraping we cannot define what we'll be extracting using XPath selectors let's say we want to scrape this website which offers random jokes basically what we want to do is to extract the joke text from all the pages available in this website and the first step is to figure out how we can select specific parts from the webpage because as I said we want only to get the joke text and the way we do that is through what we call CSS and XPath selectors CSS selectors are the same ones we write to target for example a specific tag in the HTML markup to style it I'm not going to go through it but I'll give you a brief introduction to XPath because that's what we're gonna be using so XPath is a query language used to select nodes in XML documents maybe you could ask hey we are dealing with web pages written in HTML and in the definition you said XPath can select nodes from XML documents well surprisingly yes we can also use expert with HTML web pages and trust me it's more powerful than CSS selectors once you get familiar with it now at the time recording this video XPath has four versions xpath 1.0 which was introduced back in 1999 xpath 2.0 in 2007 XPath 3.0 in 2014 and x5 3.1 which was introduced recently in 2017 I just included this to make you aware that web browsers do only support XPath version 1.0 so make sure when you read the documentation or use some of the functionalities that they belong to the appropriate version now I'm gonna give you a summary on how to write XPath expressions so let's say we have an HTML document and we want to extract the anchor tag in other words how can we select the anchor tag that is a direct child of the div tag well to do that you write double slash dev slash a which means select all the anchor tags that are below the tail pack and by executing this expression it will go down the pool markup of the anchor node now let's say maybe we are interested only on the text content of that anchor node so how can we do that well in this situation you can use the same previous expert expression and add the text function at the end to only return the text content now let's explore another example maybe you want to get the value of the href attribute and this time instead of using the text function we write ad sign so XPath can understand we are targeting an attribute and then the attribute name href to get the value of the href attribute now more importantly I want to show you how to select any type of any node based on its ID or class attribute and to do that you write double slash the target node which is the paragraph tag two square brackets to define a predicate and inside that predicate you write at sign ID equal to the target ID value and if you want to select it using the class attribute you only need to replace ID with class now in or theory let's explore a real-world example so as I told you before we want to get the joke text from all the jokes and first we need to see where does this joke fit in the HTML markup so I'm gonna press ctrl shift I to open the chrome developer tools and then I'm gonna click on this inspection tool and then I select the joke now the first thing we can notice is that all the jokes are rapid within a div with the class attribute set to jobs and inside that div we have another div with a class attribute set to joke - msg and then another one with a class attribute set to joke - text and then we have a paragraph tag that contains the actual joke text now let's press ctrl F to open the search box basically this is where we will test the XPath expressions so to select all the jokes we write double slash Dave two square brackets a design class equal jokes look this XPath expression returned twelve note in other words to add joke now let's get the actual joke text and basically we can target the paragraph tag that is inside the div tag with a class attribute set to joke - text so I'm gonna replace jokes with joke / text and then we add / be as easy as that now as an exercise I want you to get the h of attribute of this next button and in the next lecture when we build this spider I'm going to show you the solution now let's build the spider that will extract all the jokes so I'm gonna create a new file inside the spiders folder and I'm gonna name it jokes dot B Y now let's import the Suevi module now spiders in scrapey are defined as a class and inside that class we should have a unique class attribute called name basically it will let us call or launch the spider from the command line and obviously we build spiders to extract data from websites and the way we tell the spider about the target website is through the start and the line URL list and more importantly we specify what data we want to extract through the parse method now something I want you to be aware of is that in squiggly we do have five types of class spiders that defines the manner we scrape or crawl the web I'm not going to go through the details but just remember we have the scraping spider class the CRO spider XML feed spider CSV feed spider and sitemaps spider and the one that we'll be using is squiggly spider class and some people refer to this as the base spider class so our class spider which is the jokes spider should and must inherit from the scrape a spider class now in the jokes top py file let's define a new class called jokes spider that inherits from way P dot spider let's add the class attribute name so name equal jokes and then let's define the start and the line URLs list and inside it I'm gonna paste the website URL as a shrink now more importantly let's define the parse method so def parse it takes the self keyword as an argument and the response sent back by the default download the middleware as a second argument now we know that we have a list of jokes not only one so we need to iterate through them so for joke in response dot XPath the expert method takes the XPath expression as a string and returns a selector object so between two double quotation marks double slash dev two square brackets at sign plus equal jokes if you want to use a CSS selector you can replace X bar by CSS now square P returns these scraped items as addictive so yield two curly braces the key will be joke and the line text and the value will be extracted from the joke selector being returned by response dot X bar so joke dot X path to double quotation marks dub slash dev two square brackets at sign plus equal joke - text slash P and then we call the dot extract and the line first method to return a string now more importantly if we leave this X path expression like this it will always return the first joke text that occurs in the HTML web page so to fix this we need to add a period at the beginning of the XPath expression to limit the scope to the joke selector being returned only now finally let's deal with the pagination so outside the for loop I'm going to create a new variable that will hold the href attribute value of the next link so next and the line page equal response dot XPath - double quotation marks tab /li - square brackets outside Clause equal next and then slash a slash outside href to get the href attribute value now we call the dot extract and the line first method the next step is to check if this next and the line page is not empty because obviously when we reach the last page this next and the line page will be set to none so if next and the line page is not none and then next and the line page and the line link equals a response dot URL joint to join the website URL with the next underlined page variable finally we need to send another request with the new next and the line page link so yield sway P dot request we set the URL equal to the next and the line page and the line link variable and then we set the callback method to self dot parse now let's launch this spider so scraping roll this by the name is jokes and then - OH data dot JSON to save the data in a JSON file now in our project Explorer we will see a new file called data dot jason has been created and inside it we have all the jokes from all the available pages now we've scraped the jokes successfully but as you can see we still have the HTML tags we can fix this using XPath by adding the text function as I showed you in the slides but I want to show you how you can do it using scraping so inside the items dot py file let's create a new item class actually let's rename the default one to jokes item now within that class let's add a new field that will hold the joke text so joke and the line text equal scrapey thought field now back to the jokes spider let's import the joke item class we've just created so from demo and the line project dot items import joke item and then let's import another one called item louder so from scrapey got louder import item loader now in the parse method and within the for loop let's instantiate the item loader class so L for loader equals item loader and we need to tell it about the target item class so item equals joke item and as the second argument we pass this selector object so selector equal joke now just underneath it we call L dot add underline XPath it's similar to the expo method we covered in the previous video except this one takes the field name as the first argument so between two quotation marks we pass the field name joke and the line text and as the second argument we add the XPath expression so let's copy this one and place it within the add underline expert method now instead of returning back the items like this we call yield l got load and the line item good now if you launched this spider we will get the same output as the previous one because all we did is returning back with the items which are the jokes using the item loader class and to clean the data we're gonna need to use what we call input and output processors so back to the items dot py file and let's import two classes that we're gonna need to work with so from way P dot loader dot processors import map compose and take first and just underneath it let's import a utility method that we've had us to remove the HTML tags so from w3 Lib dot HTML import the removal and the line tags now we can declare input and output processors as arguments inside the field clause so input and the line processor equals map compose map compose as a set it's a class that takes methods as an argument and applies them to the data being scraped we want to remove the HTML tags so we called remove and the line tags notice here that we are not calling the method like this instead we are passing it as a reference now let's add the output processor so output and the line processor equals take first take first is a class that does the same job as the extract and the line first method let's test this further now so sleepy troll jokes - old data dot CSV to save the data in a comma separated values file now in the project Explorer let's open data dot CSV good we don't see any HTML tags but surprisingly we still have this white space we can fix this by adding a custom function to the map composed class so outside of the joke item let's create a function called remove underlined whitespace it takes a value as an argument and inside it will return value dot strip trip will remove any white space from the data being scraped now back to the map compose class and let's call remove and the line whitespace lets relaunch the spider again and let's change the file to data and the line Clint let's open that new file beautiful we don't now see any white space
Info
Channel: Human Code
Views: 99,137
Rating: 4.8369493 out of 5
Keywords: Scrapy, Web Scraping, Python Web Scraping, Python Scrapy, Scrapy Spider, XPath
Id: Wp6LRijW9wg
Channel Id: undefined
Length: 23min 37sec (1417 seconds)
Published: Sat Sep 15 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.