Web-Scraping Tutorial Using Python + Selenium + Beautifulsoup - Part 1

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey everyone I hope you're doing well staying safe during these crazy times today I'm going to show you how to create an Instagram web scraper for photos using Python and selenium before we I guess get started into the main programming I'm going to let you know that I am growing my IDE of choice is Microsoft Visual Studio code so if you want to use Visual Studio code just for the sake of following along you can you know they support different operating system so Mac Windows Linux it's all here all I do is go to their website and download and install it we will also be needing the chrome web driver so under here if you click on chrome for what we're using it's not gonna be 84 it's actually 83 so I already do this you navigate to Google Chrome web driver search it on your search engine then click on again 83 at the time we're recording and you click on chrome driver 132 zip you're gonna save the file and navigate to your desktop or wherever your project is located and I have mine on my desktop in a folder called projects and then I'm just going to save it here so if you open it up see how it saves here so you just extract it and then you're going to put it inside of the folder for the name of your project so my case my project name is mr. scrappy and also for this reason if I go to visual studio and open it up yeah so see how it if you had a file and then open folder I already had mine open because I was testing beforehand but if you just select you know go to every it's located so for me it's on my desktop projects and then inside of it I have interests copy so select folder and yes so this is what is going to be needed for for creating the python file and now see I have my projects here in my instance crappy and in my interests crappy I have my driver so I'm on my answer scrappy I'm going to create a new file called instead rat V dot py and now this is the Python file which we are going to be using to to program our web scraper with but before we do anything of this I also need to have some dependency dependencies installed so if you go to terminal and go to new terminal what you're going to want to do is hit install and then the first one there's going to be urllib I just press ENTER and I already have my tip in soldier over maybe it's three it might be three yeah so it's PIP installed and urllib three and that's going to be used for retrieving the photos then you want to do pip install selenium and this is going to be the web driver that allows us to create a mimic a actual web browser like Chrome in my case then what you want to do is pick install the next one is beautiful soup which is as bs4 because I've installed it's gonna not install it and then what you want to do is pip install reg X and then you want to pip install PI input and that is going to last you get the follower and following information again because I've installed it's not gonna look the same for me as it is for you unless you've had these packages installed anyway to begin I'm going to create a function called main and then after this I am going to have I think it's if underscore main then I'm going to call my function main so in my main function right here I'm going to begin by connecting my driver to my program so as you can see the chrome driver exe is in my folder here it doesn't have to be in the same one but for consistency and for saving of time I just put it in here but anyways you put wherever so I'm going to have a variable called driver pass all caps you can make this whatever you want path driver whatever suits your needs but I'm going to make my tribal driver path side equal to and then I'm going to put it where the relative path is and I can right click and you can say copy rope to pass and paste it and then you can just do that I do put the two dots because this goes to the parent directory so it's going to actually I think I don't need the Intertropical our need is that so it goes to the parent directory and then goes to chromedriver exe and finds it that way you know where chromedriver is and then i'm gonna have a variable set equal to my driver itself so I would say WD for me it's gonna say for webdriver I would say that's equal to webdriver dot Chrome and say executable pass actually cute double underscore path is equal to driver path and I completely forgot because I didn't import any of my dependencies now this is gonna work so let me just do this now import or libraries and I'm going to say import to urllib and port urllib 3 then I would say from selenium imports the webdriver and then I'm going to say from selenium import freely from selenium webdriver dot common keys I'm going to import key keys and I'm going to import time and then I'm going to import import that time import after that undo ps4 forgiven from vs forth I'm going to import beautiful soup as and then you can say bs you can say people will see if whatever you want I'm just going to put be us though and then I'm going to import the OS because I want to make directories for each of my Instagram accounts that I take the photos from I'm also going to import guide pass for privacy reasons you can import re for reg X you don't really need it but I like to have reg ex in case I want to do some some data analysis or some stripping of the strings and last thing that you want to import is from PI input dot keyboard you want to import the key and Deacon folder this is going to be used for gathering the follower and following information of the accounts yes so then now that I have my fit main my webdriver connected connects like I see if everything works so from here i am going to say WD cat get is the function which opens up a window for the URL and in here I'm going to say HTTP colon for slash four slash gww instagram.com forge last accounts or sazkebab inner because we're gonna have to log it in for slash so again because we have to log in to our account in order to get the information so right now as you if you slap your terminal opened you can probably close the site you don't need it so if your terminal opened you what you want to do is type LS and if you have your and subscribe be folder here then what you want to do is say CD and then put type in your folder my case is inter scrappy so it's - scrappy without change to the directory see is Sandford change directory and in scrappy is the folder that I'm trying to change into your the directory I'm front change into so your press LS again a video list and CRS have chromedriver and insta scrappy it's what I have here subscribe you chromedriver and scrappy pop UI so what I want to do is Python because the interpreter use no point language I'm using and then I'll say insta scrappy and stuff wrap the dot P Y so this is going to initiate the Python script that we have it's going to open up yes that was my mistake I should just put one dot not two so my driver passing a crack that's why it's saying you can't find it so now it found it it opens up I'll go to Instagram and they'll close immediately and that's because once the s code is done with this you can once the s code is done I'll go hey I am done I don't need to be open anymore and I'll close it so to prevent that from occurring we can create a while loop so well - I'll chill you can say this will be the beginning for getting the Instagram account and everything so we'll see what you say see here test is equal to three input stream and say please take and count or maybe you find or wait too and crooked calling space and then I'll say if I say yeah we can change this name so I'll just call this Instagram account okay with Instagram account is equal to which and will say return boss otherwise and then this will be code or later on so now that we have our main here Oh actually for the time being receivers that we if so this works now then first what we have to do is get our Instagram login so here it is Instagram login and this is going to have us be able to access the be able to access the actual Michigan because when you go to Instagram calm it gives you a lot of information otherwise I want only you scrape information so we're gonna say define Instagram it's the grand logging and then we're gonna have to pass a WD as a parameter so in here I'm going to say user is equal to and before I continue let me actually go to Instagram so actually there was copy and paste where we're going it doesn't really matter but so on here if you open up your inspector element and then you click on the account information right here it's you get d log in you see it says input class is equal to this and then the name is equal to username so what we're trying to do is have the web driver find this and for this class for username in this class for password because if you look on here they'll say name is equal to password so what we want to do is have the web driver find the username and password fields as well as the log in button to click so we can input our username account password and then submit that information so much RAM can Lagos in so for this I'm gonna say user is equal to W do find underscore element underscore by underscore name and I'm going to say on here I actually started that name it's gonna be equal to username and the same thing I'm gonna have password dick to the WT dot find on that by name and this time it's going to be password once this is done I'm going to whatever prison password or login so once we do that we're gonna have to once you find them in case you may have stuff in there we're going to clear the fields so we'll do that we're going to user got clear and password uh clear as well so that clears the fields of any information then from here we're going to pass either or login information ask user for lobbying inferences so then I'm going to say call this Instagram underscore username is equal to and get past you don't need to do this you can do as I said as I did earlier just input string and then say hey set this variable equal to this but because I'm recording this online I do not want people to know my information so I'm just going to say get past up get past I'm going to say please enter and turn your Instagram user account and then I'm going to say Instagram sorry I'm gonna say either got send underscore keys to Instagram username and then I'm going to have my Instagram password this word password equal to get past dot get TAS and then we have it say please enter your Instagram password and then from here I mean say password dots send underscore case I would say interest and password so what this is doing is going to take our information that we submitted and just going to send it into these fields so phone number you don't name email whatever and password and the same things for the password it's going to submit from the password then once that's done what we're going to do is put a timer that way the web driver when we access it can can have enough time to process the information and submit it sometimes when you do this it's going to do it quite fast so the web driver may not catch it and therefore the whole program broke stopped working so prevent that we're gonna put some time dot sleeps in the program in general just to help maintain that it functions without any issues so after this if you go back to the Instagram website you click on the log in button using the inspector you can see that the class is here the button is here at the class is like this then you have type of thoughts submit type submit so what we're doing to you next is have the web driver find this this HTML tag and have it find the attribute of the tag so whenever we type in our data how I see how it is now active so we're going to have it say once this is once the information has been submitted into these fields we're going to save login in order to accomplish this you have to use them like we did previously it's going to be WD find underscore elements but instead of element bytes of my name is we buy XPath and then in here we're gonna have a - four flashes g-force lashes brain Save button because again it's the button that we want to find and then inside of the button we're gonna put brackets because this is how the web driver knows to find okay so I'm looking for an HTML tag inside the tag I forget what this is called the tag element whatever it's called we're trying to find this HTML item then from this item we're trying to find an attribute of the item so we're going to say fat and then it's gonna say type equal to I'm gonna put two single quotes and say submit because again it's a button the type is submit so we're trying to find the bat refute of the type the HTML so once you find that at the very end after the end part entities were put dot quick because this is gonna let the web driver know hey I found what I want I this is what I want and then the doll click is going to mimic a mouse click so once the information is and then button becomes available it's going to click on the button to give us what we want so then once this is done I'm going to again quite a time dot sleep of five seconds because once the submits we want to have it go to the main logging screen the login screen is going to take a couple seconds depending upon your internet speed and depend upon what's cash or not so on and so forth so then once that is done go to our personal login information our own login account first so it's going to be a WD dot get and to start off we go to our home pages so HTTP force : 4/4 /ww Graham calm force last then I'm going to put a plus and then we're going to get our username here sometimes on your Instagram username then we're going to do plus and it's going to be quotes and there's going to be a force less you probably don't need this I'm sure Instagram by default will correct this for its needs then we'll put in a time dot scape of another five seconds and then you can just return you don't really need to do this but you so this is the login function so if we call this in the main right here so insert functions so if I do this then actually I should put this outside of this so I'm actually going to do that and then I'm gonna have the login and then I'll put the functions Detroit so anyways so once we get the driver path it's going to initiate the login and then from here we're going to get the username password find element submit and then we're going to our home page and then it should work without any issues so again I would say you can either press the up arrow on your keyboard but for that is more for using type Python you can do and stuff and then if you press the tab sometimes it will map the correct field but in case you're ever using here tab doesn't work you can say Python just copy dot py or whatever you called your file so python do this open up it's going to load name is equal to username oh you know what it's probably because any give enough time to find it let me put a let me put yeah that's my cousin for enough time to find it okay there you go okay one more time now then it loads that's why you have to put the time so otherwise is not going to work so please ensure yours account name so here we'll say me Bob booty Bob George Bob George and the password is a little cat 2509 it's gonna wait to press a login button but because there's no law then it's not gonna work and it's gonna default back to you our keys and to turn Instagram account see these insert initiative account or find access and then if you say quit equites so now that we know that works the function is good now it's time to go into the main beef in beans and all that other good stuff for this to work so after we do our login what we want to do is create a directory for for being able to store all of our Instagram account folders and photos inside of because you want to separate the Instagram account names we don't want to have one folder with five quintillion photos as a take away too much space and it's very disorganized especially wanting to do some sort of facial recognition program or deep learning so you want to do is this instead so you want to create one poll here so for most of all yes just draw all the data into to make it nice and clean so I'm going to say define and in here I'm going to say make main directory and then I'm going to pass it nothing actually directory passive directory because we want to pass it the directory name that we want to call for the to hold all the photos and we say main directory the variable directory is equal to the past in parameter directory and then from here I'm going to say if not OS not path dot is dur meaning if this directory is not the directory that we're currently operating out of and then we're going to pass in my main directory as the parameter I am going to say OS top mcdr mkdir and then I'm going to call it what I'm gonna call my personal folder and I'm Instagram directory Instagram accounts and then we're gonna say OS dots changed here and then I'm going to say Instagram now in hindsight actually what I can do is instead of doing this I can just say main account directory is equal to and then all played input string quote these please - or their Instagram accounts Japan and then I'll just pass this into the main directly is arsenic main no this so I'm going to change the directory of the main directory change directory and say make main directory link main directory and I'll pass it into the main account directory once you make the main directory it's going to create it and put it inside of our mr. scrappy right here so if we run so if we run this by the way just press that I beverage run it and press Enter so if we run this it's going to ask me for my user he's interviewed user account information so say role tactic 9 I'll say code and I'll say random Jewish they'll click on the login button after a couple of seconds say hey this doesn't work then it's going to take it to the redirect you to what we typed it in and it's going to say please make a please say the name to store you're an accountant - for the folder and so I'm going to say accounts and then it's gonna ask you so yes so before I continue see how it says income accounts right here it's changed direct you get this now so I created an Instagram folder - you type it in and so if I say all code Sunday I'm just gonna take a quick for me actually and there it quits the program so now that we have our directory done what we want to do now is actually the heart of everything for our user account so we're going to do is say locating and we can create a function that will locate as well as get all the photos and everything will be one function that will call multiple functions so they locating and story or Instagram and this function I'm going to say define get Instagram account and then I'm going to pass it two parameters the first is gonna be the Instagram user name this ability doesn't have to be what you pass in here because this is just a parameter you can call it Billy Bob George are you Frank Ted Susie and see whatever you want to say make sure it's gonna be the user name so I'll just say Instagram user main and then I have to pass WD because again I'm gonna put it inside of here and WD is what we use for the web driver so once we pass in the user name what we want to do is first give time for the login information to access the website without having any issues or bottlenecks so then from here what we're going to do is say once you have the in-stream account we're going to sleep and then we're going to say Instagram folder there's equal to mr. Graham username because we want to take the information here and pass it as a variable just so that things become less confusing later on and then once we have this variable we're gonna say dem udda and then put the single quotes and say HTTP colon for class for class www instagram.com for class and do Plus Instagram holder you don't really need this if you don't want to I for the sake of programming my style of programming I enjoy doing it but you can do that which you would like to do so after we get get that we're going to have a function for shaping photos creeping followers following scraping polyworks so first I'm going to do the scraping of wrapping of photos and yes of photos and while I'm at it I'm just going to take this information the function and I'm going to put it in here sometimes to say at Michigan account pass it in the Instagram account variable here and then I'm going to passive nwd because again we need WD to get the information so now that I have my Instagram older and everything I now want to begin with the logging in in the photos now to start off yes to start off here I'll actually go and login to my account of course this will all be blacked out so you all can't see it yeah so let's say we'll go to okay so say I just can kardashian trip it why not so then inspect element and if you go here so what we're going to do is actually get the photo we want to figure out how to get the photos and the post and everything right so we need to inspect the element and if you click on the inspector icon and go to click on a photo you'll see that it has this P tag here for a href most Instagram web scrapers that I found they just try to scrape this and then it doesn't work so what you actually have to do is go down to the where the image is actually stored in this case it's in the image tag and the class is ffv ad that's the class name so what we're going to try and do is create a function that will do the infinite scroll until it hits the very bottom and then we're going to take all of the ffv ADEs that it finds along the way and store it into an array that will then be downloaded and stored on our computer or device whatever you use and so for this I'm going to say for the followers raping it's to be definition define then we call this scrape Instagram Instagram accounts images and when pass it the Instagram holder of course Dean College cream or anything but I like to stay consistent and I'm going to do W D because again I have to pass the webdriver into it anyways so this is going to be code that I will again put in the description to give citing over the sources so this is going to be links of page the equal to W execute script executes underscore scripts and then in here I'm going to say window that scroll to zero comma space document dot body dot scroll height camelcase because this is JavaScript I'm going to put it in parentheses as semicolon and save our space length of page and then I'm going to say it the cheap document dot body dot scroll height camelcase and then we say semicolon and I'm gonna say return and length of page so what this is doing its capitalized length of page so what this is doing it's going to scroll to from the top until it hits a breaking point and then it's going to pause then it's going to keep on going and going and going until it hits the bottom and you'll see in the next lines of code when it's actually doing so near Sheamus a match is equal to false then I'm going to put in a variable here X is equal to 100 this is going to help when capturing photos because again the web skippers that I've found all of them would only do the 30 photos and that's it they didn't have any means to actually get more than 38 just be 30 photos and then to be running each other until the end of the page hit so what I found was train this variable helps a lot and following the code that I found online seemed to be well match is equal to false so while it's equal to false and in here I'm going to put their directory of Instagram accounts so I'm say directory is equal to mr. Graham holder and then we say it the last count capitalised is equal to the length of page okay I'm going to put a time dot scape of 30 seconds this is for the infinite scroll so it doesn't take so it can gather all the images without pushing through the entire web page then here I'm gonna put on Instagram accounts and she I'm always finishing girls is equal to an empty array because we're going to take all of the elements that we found in here and store the sources into an array so I'm going to say in this case Instagram caPSURE is equal to w defined underscoring elements underscored by underscore XPath and again we want to find the image class the image tag with the class attribute and this his class is equal to F F V ad all caps so I'm going to say for / porch left m IMG open closed bracket then inside the brackets and say at class is equal to single quote two single quotes because I just two double quotes out here so two single quotes inside I'm say capital F capital F capital D how about a custody but this may change because Instagram likes to I guess has some sort of math level algorithms so whatever this class name is you just want to make sure that you have this class listed for this to be used that is crucial for capturing the photos and for this I'm going to say for I in Instagram capture I'm going to iterate through this I must say Instagram goals dot append and then i'm going to say i dot get after the get underscore attribute and then in here I'm put the in two single quotes SRC because the source is the link to where the photo is stored so if I right click on here and I say open link in new tab this is the photo which is going to be captured and stored for us to download but the source the SRC this is what we have to store into the array in order to accomplish that so we have this source now what we want to do arrows below here once that's done I'm going to say I'm going to now make it directly for all of the instrument counts that we're actually so it's going to be if not a West op path dot is there like before and we pass in the directory again you don't really have any of this directories you'll understand holder you can just have in trim holder here but I like to this is how I enjoy programming yes so then I'm going to say oh I stopped making dare passing in directory so this is going to say if the directory does not exist then I'm going to make the directory of live account okay if I can spell properly now after this what we are wanting to do is iterate through all the elements that we found and the Instagram URLs and I'm going to pour from that the images to download in store on the in the folder so I'm saying for I comma link in enumerate say instagramming rolls what I'm going to do is have a path equal to OS path join and inside if you're a man passing the instrument name and a comma and I'm going to put in quotes and curly braces : 0-6 got a peg so this is going to one second without format then put I plus X so what this is going to do it's going to say hey I'm going to create a path and have this be a part of my folder and then it's going to create a photo with the photos of six digits so zero zero zero zero zero zero dot jpg and then the format is gonna be I plus X because again most Instagram web servers I found that they would only do thirty photos and that's it they won't do any more so this variable right here X well I'll have iterate every time the function is called the for loop is called that way each photo will be able to be download installed on the drives so turning on once this is done I'm going to say try I have to do a try-catch block in case for whatever reason it doesn't work so try zero lib dot bequest dot your own retrieve your retreat link common path so I'm just saying try and see if you can put this these photos inside this path using the link provided and then it's going to be accept and we're just have a prince unable to download and place inside of folder and then while that is going on at the end I'm going to iterate so X plus equals 100 100 so this is going to iterate X to keep things going then I'm going to save lengths of page is equal to w d dot execute script and then it's the same code as before we can copy this code right here you can copy this if you want but for me I'm just going to say window dot scroll to open parentheses close parentheses then put 0 comma space document dot body dots roll eye and then I'm going to put a semicolon of our base lengths of page make sure it's camelcase is equal to document body dot scroll height semicolon return space run capital o F of the pH then so this is going to keep on scrolling until it hits the bottom and then once it does hit the bottom I'm going to say if the last count is equal to when that page match is equal to two so this is going to this is going to be the final theme so I says hey if you're at the bottom of the page and you don't eat anything else and I am done doing my deed and I can return so what I'm going to is go to my get in stream accounts so here's my get instrument count right here so under my script in the photos I'm going to pass in scraping Instagram images and when I pass in the Instagram holder and then WD because again I need to have the web driver available so if we run this there should hopefully not be any errors and again this is bugging out I don't know why so if I press LS okay CD is this crap being LS and do Python and stuff scrappy Dappy why stop this because I need to actually so if I'm just gonna comment this out I'm just gonna get rid of this because I don't need that so once that's done it's gonna return and I'm going to yeah so that was my mistake so hopefully no it's gonna be Python it's the scrappy Dappy why let's try this again that should take me to should ask me for my entire film yeah so in this case I'm going to go to e actually just on mus game on our musk with two K's so I take on you on our musk with two K's present inter oh whoops that was my mistake whatever so this might ensure penetrate I'm gonna say Ilan armed us so as you can see I have a count name's Neal on our musk so in here I'm gonna store my folders into my folder and images into so snow you want our musk I put a sleep here so that's going to go to Iran armed musk what profile and then from here it's going to go through and scrape all of the photos that are on his account and if you click on the folder here it should be cash in the photos so here's the folder being created and inside we have all of the photos which we are storing onto the computer get some zero to 35 so that's 36 fighters it stores and welcome back as you can see all the photos were downloaded yes there's one thing wrong with this program I when I first made this it did not make all these duplicates but for some reason it when I switched from my Jupiter notebook over to my visual code hit something in my now that we have all her photos that's basically if an intern for you can stop now after you're wanting to new but what I asked him found making this was I got to get the following information and the followers information Cyprus crisis and so for that I'm going to first create a function oh not here I'm going to put it right here she'll put it underneath the Instagram account images so I'll say actions for getting all over and then call us define and in this I would say Instagram actions and I'm going to say gets extra concoctions and then here on the past actions and WD because I need the webdriver and Hamas is going to pass in the forget-me-nots actions it's going to be Instagram holder Instagram underscore holder cuz I'm gonna pass on the Instagram username and I'm gonna pass in the Eddie webdriver so from here and say WD got and then it's going to be : or for quotes HTTP colon 4/4 / WWE use Graham calm for / + Instagram holder plush quotes for / and then from here I'm gonna put a timed up sleep of 5 seconds that way it whatever has time to load all the data into that and for me into the web ever never say href underscore temp is equal to the video find I'll mince it by xpath and if you go here so going back to Instagram and again this is Kardashian if you click on the following see how it has a class of negative and a I think it's an L 3 but all these aren't L eyes see how is Li why 8 - FY so what we're looking for is the Li and then this class name right here so if I copy the outer HTML open up notepad paste it this is the class that we are looking for and again Instagram hopefully they will have some sort of way to generate this so people can't hopefully it's dynamic every time but for now it's static so it's gonna be and it weighs 200 for classes Li open closed bracket at class because it's the class of y8f y space of a lie it's the attribute of Li so class investment equal to two single quotes inside those two single quotes I'm going to say well I just copy and paste of it but it's also going to be capital y a dash and then a lowercase of capital y and then this space again this is gonna change hopefully by the time this video is uploaded if not just go to this go to this class and find this name and you should be good to go replace that semester anything I'm going to return hrf temp so then when I go yes HRF 10th so then when I'm in my following right here I'm going to first do a third gram actions and we say got Instagram actions pass in the Instagram folder and as well as W D because I need my web driver so this is going to get this list C Li Li Li of posts followers following so you can actually you can just scrape this data if you're really the posts but for me I'm just doing the followers following and so once this is done you're going to want to under strain a following we want to create another function I know you guys probably hate function at this point but I I am big on object-oriented the first language or C++ so you're gonna put define get all the wing information and I love to name my functions after their actions that way it on and put options as a parameter I love name of my functions after what their intentions are and then I put WD because it just makes sense for me em so I'm going to pass in the options from the href tenth as well as WD and actually up here I continue I'm going to say href actions the sequence you so down here get following information first and then create an array so following up names you can name it whatever you want on your side to appear right I'm going say following is equal to actions of two because we go back here it goes 0 1 2 so the W D the blood drivers can go hey I have found three classes named y8 - F Y space and so the first one as you can tell is zero the second one is one in the second one is two because I was starting them into an array and returning them so I want the second the third action of following to get the following information so from here I'm gonna say following a little ring quick and that is going to give us the information needed for this and then it was a good time not sleep of 20 seconds and when say followers underscore temp it's the WD dot page source page underscore source so say the reason during that if I click on following isn't give me this um but what I'm doing when I set this up it's going to actually give me the this button inside an entire browser so not just one scroll infinite scroll it's going to give me the entire list and a humongous webpage like this anyways so I have that set to followers time and now I'm going to say followers data follow where's underscore dates up is equal to beautiful soup yes because I have it past MSBs right here yes so I'm say is equal to followers underscore data is to be us and here I'm going to put my followers time to the paid source comma space and I'm going to say HTML parser HTML parsing this means it's going to take all the page Shores from being able to soup and pass it in as HTML because sometimes you know it could they could be an XML format it could be an L XML format can be in whatever format that the people program the website to be in so I want to say hey WD I want this but I want this in the HTML parser format so I can use my so I use my necessary functions to get things going so I'm gonna say for I and followers forgive me it's not followers it's following although we follow be linked data and this would be follow them - mm-hmm so in following I'm just OH what I'm gonna head myself so I'm gonna have one more card following underscore name X bar is equal to following data dot find underscore at all and then what the attributes because they are stored and following right-click inspect see they're in a then to the HTML tag a at the class of this and the title is well this person's name is Jesus here anyway so I want all the ace so in following name 4-iron following name I want following names the array that I created dot append to be I dot get parentheses title this is going to take the title of a it's independent to my array so I'm getting all the names of all of the people that this person is following then I'm going to say clean following names is equal to u so in case there's some weird character which we don't see what can't read or Python can't read because that has happened quite a bit for me web scraping I'm say X for X and following names if X does not equal none so it's going to clean all the names and then I'm going to say return and clean the following names so I'm going to take this function right and I'm going to have it here so to say get following information tacit href actions as well as WD and then to go back to the main menu because when I go to my following information it's going to take me to the following big page I'm not gonna be able to do anything about that I recommend get the followers I have to go back to this page again and then did the followers but before I do this I forgot that in order to access this to get this information I actually need to get inspector this is going to allow me to access the inspector of the chrome web driver so that I can do all the necessary functions needed so I'm going to say under Michigan fashion is all put right here I would say define get inspector again to call it what you'd like let me pass I'll actually don't you have that anything so I'm going to say keyboard is equal to controller and controller is the is the constructor that we called from PI input keyboard so I for keyboard so I'm going to go down here and say keyboard equals controller and say keyboard dot press inside parentheses key dot ctrl so this is gonna mimic a control key being pressed next we need keyboard got press key doc shift and then it's going to be keyboar keyboard dot dress then i cuz for their Zilla firefox HQ c respect on that keel for chrome its control shift I so then I'm going to say keyboard dock release and say gee doc control because once you're done doing this when you to release all of our keys pressed so kita controlled then I'm doing keyboard key keyboard not released Q dot shift and then the last one is gonna keyboard dot release release and then it's gonna be I and then I'm going to put a timed sleep here because I want I need to inspector I have typing dyslexia so I want to put a sleeping care that way whenever this happens there's time between this being pressed and then me going on to the next function and so I'm just going to take this function right here and in side of my inspector up here I would say get inspector that's all I need to do I don't need to do anything else regarding that so it's going to open up the inspector here the options for me got risk retrieved then I'm going to get all my I'm inside the secrets of Halloween right all in is equal to you big following actions then I'm going to get my actions again and then I'm going to do the followers so after my actions function after my inspector function after my following I'm going to have one last function I'm very proud you guys probably six of this so I love my functions so I'm going to say get followers in formation in formation and then we're passionate two things actions and WD so in our get followers information I'm going to have it set up somewhere so I would say followers underscore and names are equal to the an MP or I say followers is equal to actions instead of to you it's going to be one because again 0 1 2 1 0 1 2 so we're trying to get the 2 or the second which is the first because python starts at 0 then I'm going to say follow we're stopped quick once this is done I'm going to put a time dot sleep so I can load the page without having any problems then I'm going to you put followers on the square temp is a clevy d dot data under okay W dot page underscore source so again we want to get this source and then we're going to say followers data is equal to the beautifulsoup be us say followers tons comma space then in quotes HTML parsing I wonder parses as an HTML in the HTML format anyone gave followers underscore name is a few followers underscore data dot find all a because again I want all the attributes of a I want all the A's because that's for the that's where the names are sorted so inside a on say for I and followers name and then we say followers names dot append I say I get and then title because again click on followers open the inspector we have the a attribute here and then the title is what we're trying to get so we're collecting all the titles so have all the titles here then at the very end I want to clean follow where all the where's names is equal to X or X and following underscore names if X is not equal to none just I know I found this card online because I was having issues with the beautiful soup although where's James so I am I forget where I found the stock overflow if I find this I'll put it in the description below if not I'm please know that but this isn't my code so I'm going to return Queen followers underscore names I try and attribute people people create the code so they could the rightful ownership anyways so now that I have this function completed I'm going to go up here to my scraping of followers and I'm going to say followers is equal to that all the words information pass in the href actions comma space WD today puss and the get followers information factions WD okay I just want to make sure so then once this is done what we can do is actually go into a CSV file I'm just going to say print following you can do what you want with this I'm not I just wanted to show you that it is capable of actually picking up the information how you want to use this data as in up to you so print followers followers following following so we should now be able to have a complete working projects regarding - Graham web paper so if we first yes so one side fact I'd like to mention um by default when you first run the program it's going to create this folder multiple times I haven't been able to fix this yet but only one do if you ever run this program when then once you just press Delete just delete your old folders and then rerun the program so I'm gonna say Python and stuff we're at V dot py so now it's going to load and then it's going to ask me for my it's really count and I'm going to black this out so you don't see it and you can reduce the duration if you want but me I actually like to have these sleeps in here so I'm going to say mr. grand accounts so then I wanted to type in a lawn are must yeah so I misspoke they changed the they changed it because it used to be something else space on the age to be 6:1 that's why so as you can see it didn't work because the classes are different so on a firefox it was in Firefox it was y8 - FY but on here see it's underscore underscore Oh H three six one which is why it didn't pick up on the followers or following and actually I had this management code so what we want to do if using the Chrome just copy go here copy a durational and all you want is the olh space poly one is this right here space LH 3 6 i and and here if you go into which says the wife just changed this right here to the LH three six pi and that will fix it again I was using Firefox Firefox they do have a dynamic based upon the browser so it's Firefox it's y8 for Instagram it's all HD sub-site regardless it works I one thing I forgot to mention is for this make sure that you have the room driver clipped on so when this is running if you make sure that you click on this otherwise this inspector well little input control shift I so if I'm playing a video game while this is going and their script runs into this coach them in right here whatever game you're playing if control shaft is something it's going to make that function in the game so be sure to use this code right here for and have this clicked on so yeah I hope you all enjoyed it I know it's a bit long of a video it's my first video too so please be kind constructively criticism consider constructivism is always welcome but please be sure to let me know what she's like to see or knowledge to see I'm trying to make more programming videos that I can help myself understand how things are going so you have anyways thank you all for watching have a great day night evening and see you in the next video
Info
Channel: David Grice
Views: 2,965
Rating: 4.9402986 out of 5
Keywords: python, selenium, beautifulsoup, instagram, webscraping, webscraper, web, scraper, programming, project, tutorial, urllib, fun
Id: beF2zLsw2ws
Channel Id: undefined
Length: 75min 17sec (4517 seconds)
Published: Fri Jun 12 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.