HELIUM for simple DYNAMIC web scraping with Python

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi everyone and welcome John here and in today's video we're going to be taking a look at helium for Python now in some of my previous videos we've used selenium to load web pages for us and extract data and although this isn't always a perfect solution it can work very well for us when we're trying to get data from certain websites that could otherwise be quite difficult to scrape it's also very powerful tool for automating a web browser to do repetitive tasks now what helium is it's actually built on top of selenium and its aim is to make it lighter and more importantly easier to use this is one of the reasons why I wanted to demo it and how it works and how we could use it for web scraping in particular with them selenium we relied heavily on their CSS selectors the IDS and maybe even the full XPath to navigate our way around the page but helium simplifies it all for us meaning hopefully shorter scripts that are easier to understand and create you can also mix between the two the commands together if you can't quite get what you need from helium which is a cool feature and it also comes already with the necessary gecko driver and chrome driver installed so you've got so there's no need for you to go out and get that separately and make sure it's in your path which is a really nice touch I haven't tested it with loads of different versions but I'm running the latest versions of Chrome and Firefox on my machine and it's been absolutely fine okay so let's get into it and first things first let's read the docs okay so here is his the helium-4 python github page and this just runs through some of the basic things that i've talked about before see the chrome driver and gecko driver come with it so you'd have to install it it talks about the weights that it does automatically for you and talks about how to install it and a bit further down there's a cheat sheet so this is the page where it tells us the basic commands that we need to know where how to import how to start either the browsers and one of my favorite features is how easy it is to run headless that basically means that the browser runs in the background and doesn't actually show on your screen which is quite a cool feature and pretty useful once you've sort of got everything down and you don't actually need the browser to pop up you can just run it in the back crowd talks about how to interact with a website with these really basic simple commands which is quite cool right press click go to and down here a bit further another thing that I really liked was how to interact with the elements you simply use this s open bracket command and then the hash tag for the ID or a dot for a class and then talks about how you can actually combine helium and selenium commands as well so what I'll do now is I'll run some of this so you can see how it works and then I will show you a couple of short scripts that I've already written so if we open up our it's a bit bigger than open up the Python interpreter and we do what we just saw from helium import everything and then let's do start Firefox and this should loads up our browser for us there we go now if we do go to and then type the URL let's say youtube.com then we can see that it load the page for us now the other commands that were which was quite it was click so if we go to click and let's just say how about trending down the side we can simply just type in the name of the button here and it will do everything for us to find the element if this was selenium you might want to you might need to have to find the actual identifier or something but here because there is the button called trending if we just do this it should click on that for us there we go and load it up so you can see how already it's going to be a lot easier just to use and a lot much more simple to do basic things so let's type something into the bar so we can just do right and then what you want to type in look spell my own name and then where we want to put it is into equals and then you can see here it says search so we can literally just type search and that will find that box for us and hopefully write that into there there we go and then we can just do press ENTER or you could click on the button if it was a go button or something like that so there we go so you can see how quite simple and easy it is to use so I've got a couple of scripts written already so this one is one that reaches out and goes to the steam store and filters by the specials and then brings back all the titles so we've got the import the start Firefox and then we could put the URL that we want straight into here instead of having to start and do go to I've then got a press page down to key down the page and then I'm creating a games list with the find all command and this is the s we talked about earlier and then the dot title and I'll show you where that comes from in a minute and then some basic list comprehension to go ahead and get the element text from the element that we get from the lit get from the find all and then print it out so what I'll do is I'll just run this now and we'll see it it working okay so start Firefox go to the page page down and then there's our list so this is a really long list we could page down more times or maybe there's a better solution for it but I found page down works just fine for me and that's our games list so if we go to our inspect element make this a bit bigger then go inspect element you can just see down here it's quite small that the span the class is title and that is the name and in the documentation it to get a class you just do a dot if it was an ID you would do a hash tag here instead so that's just gone out and got all those names for us so this was running without the headless command but if we put that in and we do head less is equal to true like that capital and if we run this now we won't see anything pop up but we should get exactly the same list back this is a really cool feature I think I will probably use this quite a lot there we go gone out and done it nice and easy I have another one here which is basically the same and you can see here I've got the hashtag for video title for the YouTube channel which is my youtube channel on my video list and it does exactly the same thing and goes out and gets the video title names and again we're running on headless so nothing's coming up and we should get that be all the names back here we go nice and straightforward so that's it guys hopefully you found this useful in some way and maybe you will can use helium for one of your next browser automation or some kind of scraping task that you're going to do or maybe you just learn something let me know in the comments below what you think and whether you whether you think you could replace selenium with helium and make your script shorter cheers guys bye
Info
Channel: John Watson Rooney
Views: 6,161
Rating: 4.9827585 out of 5
Keywords: python, learn python, web scraping, python web scraping, selenium, scrape dynamic sites
Id: Texh_xJfzEM
Channel Id: undefined
Length: 7min 41sec (461 seconds)
Published: Wed Apr 15 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.