Tutorial - How to Scrape Data from Websites with Data Miner

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hey everybody this is Derek and this is a tutorial on how to extract information from a website using the freemium data scraping tool data miner I often run into this problem at work where I need to get information off of a complex website and I don't want to copy and paste with you know my mouse and cursor because the information that comes out of it is really messy it takes a lot of time to clean it's time intensive there must be a better way generally speaking though if you've got a website that's built off of a structured database like a social media website or Wikipedia or something like that then you'll find that the information that it displays is something that a machine can more easily read so like let's take a look at my my current Goodreads feed you can see that each one of these posts that show up is structured pretty much the same way there's a name up top a book title a picture of a book a description and this repeats over and over and over again machines are really good at performing repetitive detail-oriented tasks better than humans and this is why people have created data scraping tools rather than having a human pull this information off a website a machine can be trained to look at the website identify the key areas of information that you want to extract and pull from it so let's get that a shot now I'm gonna go to a website called data - miner do and I'm gonna load it up in chrome and I'm gonna click on add to Chrome to download the Chrome extension and I'll click Add to Chrome here authorize it to add the extension you're basically giving Data Miner permission to like read the website as you're reading it and you should get a window that pops up that says hey welcome to data miner it'll point out that you have a brand new spiffy icon in your browser next to your url space it'll tell you how basically Data Miner works and then down the very bottom it'll give you the option to sign in with your Google account and you're gonna need a Google account in order to do this just go ahead and say ok and log in with your account ok now you should be all set up and you can close this tab I'm gonna demo what it's like to pull off of my social media feed and I chose this site because it is a pretty well structured site and it's fairly simple and it hasn't really been designed to make data scraping difficult like some social media web sites happen let's start by clicking on the icon right next to our browser that says up up you better restart the page because you just logged in with Data Miner no problem we'll do that and now it should be very friendly and say ok here's your main screen now sometimes if you go to a website data miner has a list of public pre-made recipes which is basically the list of instructions that you can give a data miner in order to pull information from it so from a website so in this case there are a few recipes that the public has already generated for Goodreads but we're interested in learning how to scrape ourselves so we're gonna go ahead and click here on to create a new recipe and it'll pop up a little window on the side oops alright so now you'll see at the start of a seven step process and bear with me because it's actually fairly simple once you get used to it the first thing you have to do is tell data miner whether you're dealing with a list page or a detail page a list page is probably the more common type of page that you'll have to analyze it's when you have multiple rows of information you want to extract multiple rows of information on the same page it's when things are listed on a page like posts on social media or products on Amazon a detail page is when you want to get all the details of a page to be extracted on one row so like say you click on a product on Amazon and then want to pull in the information that shows up on that one page in this case we're gonna do a list page the process is fairly similar but you know here's here's how it works so we select list page and then go to rows now we're going to try to instruct data miner to identify the the content that we care about and remember our ultimate goal with this process is to get an Excel spreadsheet out of out of this website that they were looking at so what we're going to do is say okay each row on that excel spreadsheet will be represented by a piece of content on this page in this case it'll be a post an update post on Goodreads you're gonna click the button find to identify a row and then it's gonna instruct you to hover over the content that you care about using your little selector tool and you can see that my mouse pointer has been replaced with a crosshair and every time I hover over a different part of the website it'll change color what good reads or what a data miner is doing here is like looking at the code that makes the website and looking at the different components and trying to find what it is that it thinks that I care about in this case I'm gonna hover over the post that I want to extract and then I'm gonna hold down on the left shift button or tap it once that tells data miner that this is the piece of content that I care about and you can see that there's a little dotted line box that is now being drawn around the content if I messed up and like grabbed this piece of content it's no problem you can just tap shift again and move it around all right now looks look on the right-hand side it has identified a couple of different guesses about what it is that I might be looking for there's element classes and there's HTML element types it helps to know a little bit of HTML here honestly but even if you don't you can fudge it by just looking around on the different elements and seeing what you find let's first click on the div clause and see what happens oh all right this this is very very busy and not at all what it is that we're looking for so by clicking div I basically told data miner to look for every single div line of code on the website so if you looked at the HTML it's pulling up every box every little frame of possible information this is way too much information this is not what we're looking for there's a whole bunch of extra Gnaeus detail I'm gonna unclick it here instead I'm gonna look here and as luck would have it there's a little piece of information called gr newsfeed item well that actually sounds like what we're looking for because this is a newsfeed we're looking for individual items so look what happens when I give this a click okay now you can see that not only the first box that I selected but each additional box following has been identified this is what we want to see essentially we've identified the piece of content or the type of content that we care about and that we want Data Miner to pay attention to so I feel pretty good about this gr news feed item moving forward there are a couple of things that it might not be so easy the first time but this in principle is how this works so I'm gonna just click confirm and move on alright so it said okay I see 33 different rows here on this page and as luck would have it I think 33 news feed items have loaded so that's good to hear next we're gonna go on to the columns if the row is the individual news feed item the column is each facet of that news feed item that we want to capture so let's let's use this as an example I first want to capture the name of you know my friends on Goodreads I'm thinking about buying them books or something like that right so I I'm going to start with a name and I'm going to say that I want to extract the text of the name and this is the simplest piece of information that you can extract if you want to you can extract a URL so if a name has a hyperlink to it you could get the the the website that you would go to if you clicked on the name if the piece of content that you care about is an image you could grab grab the image URL that sort of thing but we'll start it with just the name okay we're gonna click find we're gonna go through the exact same process well where it will hover over the the name that we're looking for and now we see that we have a bunch of different options here I could click G our hyperlink know again that's that's selecting a bunch of content that I don't care about or that I don't want to capture right now the prints at night is clearly the book title daniel and hack that's clearly the author that's not what I'm looking for I'm just looking for the stuff that shows up in this little box area right up here so I'm gonna unclick it I'm gonna try a no this doesn't seem to work G our user profile link now that looks pretty good that's user that's what we care about and if we click it and go down we see that this excuse me user name is repeated so I'm gonna say yes this is exactly what we're looking for and I'm gonna hit confirm now something interesting is popped up because it's previewing 47 and I seem to recall getting 33 rows it might be worth checking to see what is popping up here I'm gonna click on preview that little eyeball and this actually looks pretty good to me it may be yeah it may be that as I scroll down I queued up more things to load but that's okay like what matters is that this is a preview of the content that shows up so in the first row under name my friend's first name will show up here we go second third this this looks accurate to me I think this is what I want okay next I'm gonna get a new column and this time I'm going to extract the book title excuse me sorry about that I I again I'm just gonna get the text I'm gonna go find hover over and it looks pretty clear here book title link yep great that's what I'm looking for hit confirm and again preview here's a list of book titles okay this is going pretty well what if we want to know whether our friend is looking to buy something or is currently reading it right like if in our demo use case we're trying to think of presents to buy our friends it'll be useful to know if somebody actually wants to buy a book as opposed to if they've already gotten it so we'll start another column I will click find and hover over wants to read and then we'll look at span okay this doesn't look right and it also is a little frustrating because it doesn't look like we have a lot of options here if I'm clicking expand it's also showing by it's showing the Goodreads logo it's showing a bunch of other things that are you know qualified as an HTML tag span is there something I can do uh one thing you can do is mess around and choose this choose a sibling element basically try to find a nearby piece of content and if I click up on this button it'll it'll make a guess at the content and say maybe maybe you meant this specific thing and in this case clicking on choosing a sibling element seems to have isolated the piece of information that we care about it's not perfect because it's also showing how a book rating occurred and I have to tell you at some points when you're scraping information this happens and you say okay whatever I'll clean it in post I'm gonna try real quick to see if I can isolate it even further no because now now it's only selecting by in some area I'm gonna I'm gonna click no I'm also gonna try selecting a parent know that that doesn't seem like something I want either because that just gets me up to select more so in this case I'm gonna accept a little bit of incorrect data and let's just see what it looks like actually this has a luck out pretty well because instead of is currently reading instead of showing their the stars rating for the book because they're stars and because I'm just extracting text I'm good this is actually a pretty clean piece of information alright so this is I'm gonna call this title action I'm gonna do one more column just to demo the the principle and in this case I want to actually maybe I'll do two book image and here I'm gonna grab image URL click find hover over book title and then gr book title or book image large and here I get a list of URLs this is great and then say that I want to get a list of URLs for books to go to I'll call this a local URL and I'll extract the URL click find and now I'll hover over the thing itself mmm nope I'll go for book title hyperlink yes okay that's what I'm looking for and this should give us a bunch of book hyperlinks great okay but I want to be able to extract more information here so this is this is the data that I'm gonna get I'm gonna move on to step four and see what I can do alright now if I'm looking to scrape multiple pages of information that's another thing that I can do so this is particularly useful if you're like loading up a site that gives you like ten results and then you have to load up the next ten results and then you have to load up the next ten results that's useful sometimes that's not how Goodreads is structured so I'm going to move on but generally speaking the way you do it is you click nav you find the next button select it and then can test that I can also say at this point I want to scroll to the end every three seconds and this will get me some more information let's see if that works for me because that seems to be the way the Goodreads loads you scroll to the bottom and thinks for a second and queues up more content so it behaves a little bit like Twitter in this case you also have the option to load up JavaScript I never actually run JavaScript because most of the sites that I scrape are pretty simple but you have that option as well it's more of an advanced function and then I say save okay so good reads book it's great and hit save I'm gonna run this recipe and see what happens okay so it's gonna load up the initial page and you can see that it's scraping and it's moving down the website reviewing things as it goes and it's gonna pull up information let's see what happens BAM okay so it extracted 117 rows here and I imagine that if I wanted to keep going I could continue scrolling down and load stuff up or do the pagination or try to find some repeating way to gather this data because I'm just showing you the the the way that this information works this the script works I'm just gonna leave it here but let's see it the let's see the data that we extracted here's a preview we got a name we get a book we've got an action we got book assets and we got a book URL we seem to be in pretty good shape okay so I'm gonna click download I can either copy this to a clipboard or I can save as an Excel spreadsheet rather than open this that way I'm just gonna open this with Microsoft Excel we'll see it pop up and there you go now you've got a pretty very clean Excel spreadsheet that you can then use to change the world okay so thanks for watching this is a little bit longer video than I expect it to make but I think it goes pretty into detail about how you can use scraping tools like data miner to pull structured information off of a structured website and hopefully this saves you some time as you're out there researching the world and trying to pull information from the websites that you visit thanks for watching and good luck

Info

Channel: Derek Caelin

Views: 37,594

Rating: 4.9330144 out of 5

Keywords: data miner, scraping, tutorial, guide, data, extraction, excel, csv, dataminer, data scraping, websites, information, free, freemium, easy

Id: Zrq5E0zagGw

Channel Id: undefined

Length: 18min 22sec (1102 seconds)

Published: Wed Dec 04 2019