Web scraping simple tutorial - with Ruby

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hey guys I hope y'all staying safe during this crisis that we're all experiencing so today I'm gonna make a really quick video about how to scrape the website extract data and as a bonus I'm also including how to simply store this data inside of a CSV so let's go ahead and create the file scraping are good alright so the very first thing we want to do is require what we are going to use so the first thing is opening URI which will allow us to access any you are any website that we want through our computer we're also going to require a nokogiri which will allow us to access every node on that website that we're going to scrape and finally we need to require CSV so the first thing you need to know is all of the things I'm gonna do are available in the documentation ok I'm not making it up I'm not guessing you just need to go through the documentation it's really learning by heart I've done it several times so I know most of it but I still don't remember everything so I still need to go through the documentation and look for the methods that I need to use so let's create an a method called scraping which will take in a parameter which will be the URL right in that case we're just gonna scrape the website Etsy but you can just make it dynamic in the future so we'll create a variable and we'll assign it whatever websites we are opening so we pass it as an argument we pass it the parameter the parameter sorry and we read it alright so this variable here contains the website that we're gonna pass to the method once we call it the next step is to take this website and convert it into nokogiri right we want every node to be part of nokogiri so for example EP this here just so I can show you what it looks like right so it's just in the URL that's it as an argument and run our script so what is happening right now there is a lot of things it's very very confusing it's scary isn't it but don't get scared because this is not what we're actually gonna look at it would be way too messy and thanks to Ruby we have methods that take care of this for us we're gonna automate everything this is the website alright this is the page that we passed as an argument and we see all of its HTML content here all the modes you can see so just before I go ahead if you are still practicing with scraping the best thing to do is that you download the the web page that you're going to scrape so that you don't get blacklisted by the website that you're actually trying to scrape because if you do something wrong for example you run a loop and you keep doing many get requests on their server they will lock you means they will think it's a robot or whatever so you can download the web page from the terminal running curl and code marks you enter the URL and it will download it in a file you give the file name of course so that's it for the HTML now we want to convert it into nokogiri elements so we'll create another variable let's call it nokogiri block and this one we will assign it did not good beauty this is also in the documents I'm not making it up and there we go so as an argument this takes the website which is the very well we created before and assigned it the website that we we are scraping so if you downloaded it it's directly it's the URL new path is an argument all right and we want an empty array in which we are going to store everything which great nice now let's get to the nokogiri built-in methods so what we're gonna do is you need to go call the variable that we created which contains all the nodes and you calling it the search method the search method is going to look for any nodes on the website that contain a certain class so in that case we're going to look at the website we're going to inspect it we're gonna find the classes that are the parents to the elements we want we'll pass them as arguments and nokogiri is going to search for these all of them these classes sorry so we need to iterate through it I don't need to explain how it works awesome so this is where the magic happens we're going to iterate through each node based on the classes so you go under website that's inspected and what we are going to scrape for this demo is simply the title here so if you hover with your tools you can see that it's inside of a listing no sorry it's not inside of listing card info right this is the bearing div that we're going to get which is more or less precise so let's copy and paste this class list it as a first argument you can put as many arguments as you want everything that's inside of this div which is not what we want so if you keep you can see that the knee is in h3 type and there is no other issue that so let's simply scrape this one and then we simply say that the element which is parameter that we're sending to this each is equal to the elements text which is also another method from nokogiri which will give us the content of the note if we don't do that it simply returns us the entire note which is not what we want either because we want to store it in a CSV and it wouldn't look nice if it says h3 and it just prints everything so now what has happened we got the content of each note that contains this class and is as this type so the next step is if we want to store it in our array right that's pretty straightforward and that's pretty much it actually for the scraping part so now we scraped it and we have everything inside of our so we want to be able to see it so let's be doing through the array so that we can print it on the terminal just asleep find the rate of each cool let's put this in index at zero since it's an array itself all right so let's try calling it but this your so there we go you can see that we have 64 titles which are printed and these are all the titles all the elements which great right so that's amazing actually we just created our first website now what we want to do is actually be able to use this data so I'm gonna store it first but we're going to store it in a CSV file so to do so alright so now that we actually have the contracts that we scrape inside of an array we want to take everything from this array and store it in CSV so that we actually have the data right it's kinda useless it's just in the terminal Ruby's memory once we start the program is over so I assigned the method call is resolved to a variable called script and we're going to iterate through this variable and sort everything inside of that cat5 so that's great that's you out so I need to file that variable and then we want to pass some CSV options to have headers so you just simply say headers first row concept Goku so let's render see it's a method which is simply open the file path we want to override whatever is inside we also wanna pass in the Serie options that we look at here so let's iterate through a CSV file now what we want to do is the first thing is put inside of the CSV file the names of the headers that we're gonna have so the first one is going to be title the second one is going to be the index let's put it with the index it o is it's nicer looks organized and then we just iterate through the variable that we declared here index and we're also going to push this inside of the CSV after we push the headers obviously so item each one and that's actually pretty much it cuts so let's try it out might get some errors so see if we actually created a file so we can see that there is a file called density so let's try open you just dot CSV and there you go you can see that we have everything here inside of a CSV files that's all some guys we just made our first scraping method and we assign everything inside of a CSV so thank you so much for watching and best of luck with your learning journey

Info

Channel: CodeWithHassan

Views: 2,087

Rating: undefined out of 5

Keywords: ruby, programming, learn programming, hacking, web scraping, data extraction, hack, programming tutorial, coding, coder, programmer, web development, learn coding, ruby on rails, csv, web developer, python, nokogiri, openuri, curl, terminal

Id: NwmlUXZahmk

Channel Id: undefined

Length: 13min 15sec (795 seconds)

Published: Sun May 24 2020