Building a bot to scrape job data… How NOT to collect data

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so i built a web scraper that logs into linkedin searches for jobs in the data science field and then scrapes this job description data everything was going fine until i woke up one morning to check on the status of my bot and found this so let's go over how i got here and how you can avoid this what up dead nerds i'm luke data analyst and my channel is all about tech and skills for data science and this video is part of a series where i'm going through and building a data science project in order to build up my portfolio in my last and also first video of the series i detail outlining a problem in order to get started with a data science project and the problem that i'm trying to solve that i feel like many of you can relate to is how to become a data analyst specifically i want to have more transparency around this and i'm going to be looking into things like what are skills required in this field and also what is the job market like so once you know the problem that you want to solve the next step is actually collecting the data and collecting the data can actually come from a lot of different sources so let's break this down first is around the availability of the data is it publicly available or do we have to sign in some service or go behind a paywall to access it and the other major aspect is whether it's clean or not if it's clean it's usually in the form of csvs or a database if it's not so clean it may be a multitude of data spread over web pages so i started in the easiest section looking for clean publicly available data so the main sites that i use to look for this kind of data are sites like kaggle google data.gov data.world and even some github locations although there was a lot of publicly available data around jobs there wasn't anything that actually provided the in-depth detail that i needed to solve my problem so i moved on to the next section of looking at clean data that is not necessarily publicly available and for this i think a good example are apis so what's an api in simple terms it involves using some code such as python to contact a server and then from there request the data that you want i consider this not public because typically an api requires you to go through some sort of authentication to approve and allow you to access that data if you're interested in learning more about using python to access apis i highly recommend you check out the python for everybody course as this provides an introduction to this so anyway back to the project i started to look into top websites that provided job data and whether they provided apis for this so i decided to look at some of the most popular job searching sites that include linkedin indeed monster glassdoor and even google jobs out of all these different websites that i looked in none of them really had an api to allow me to collect job data funny enough linkedin used to actually have an api to allow you to access this job data but has since deprecated it it turns out for good reason they didn't want anybody building a linkedin competitor all right the next section we'll look at quickly is data that is unclean and that is not publicly available as a data analyst i find that this is typically where i work in my normal day job in that i have access to data if you will on clean data within my company that's not publicly available that i have to actually aggregate and make usable unfortunately i don't work in linkedin or even google so this was an option for me so i moved to the last option and that is data that is not clean but that is public i like to think of this as all the publicly available data on the internet that you can access through web scraping or web crawling so for my project i felt that the best way to solve this problem was actually collecting job data around job postings and specifically i looked at all those websites that i previously mentioned and i decided to go with linkedin this is the social media platform that most of my subscribers are on when looking for a job so with linkedin they actually make it really simple you can go in and search a specific job title in a specific location and then get job postings around this so that's what i decided to do for this part of the project is build a web scraper to go in and scrape that data so over the course of a few days i worked to build a script for this purpose since i'm most familiar with python and feel it's a superior language i decided to go with this along with selenium a popular python library truth be told i'm no expert on web scraping so initially a lot of my time was spent learning the basics of web scraping with data cams course web scraping in python this course was great at getting me up to speed fast with the basics of web scraping and from there once i got into building the scraper i switched into using google to answer my questions and trust me i use google a lot so for the spot to make it easier to build i broke it up into sections of logging into linkedin with my login information navigating to the job search page and searching for data analyst jobs going through and then selecting each job posting in order to scrape the job data i wanted once all entries on a page were cycled through then selecting the next page and repeating this process for all remaining pages during this all the data is being saved to a daily csv file i wanted only the most recent job postings so the bot only searched for jobs posted in the last 24 hours i then used a cron job to run this script automatically every night so theoretically i was scraping all the jobs posted for data analysts anyway i do want to note some caveats about web scraping and some problems that i actually ran into first i noticed that i had to throttle the speed that i was actually scraping the data if i scraped it too fast i would get those prompts of an ru robot checks and i actually had to physically check this in order to continue so because of this it sometimes took as long as half a day just to pull all the data that i wanted the second other limitation was that linkedin only provides around 1 000 job results with a job search even if there were hundreds of thousands of job postings i could only scrape a thousand jobs at a time and the final most annoying thing was that i had to log on to linkedin daily otherwise i would continue to get the are you a robot prompt and it would mess up my script so besides that everything seemed to be going fine with pulling this data i even made a video here where i dived into the initial data that i pulled in order to find out what skills were being requested of a data analyst so i was really optimistic of the data i was pulling and i was really hoping to just continuously pull this data infinitely into the future so that way i was always having the most up-to-date data on this field but one morning when i woke up to check on the status of my bot and i noticed that it wasn't pulling data i initially thought that this was caused by that third problem that outlined if not logging in daily so i went and actually tried to log into linkedin and when i went to the job postings i actually physically couldn't search for jobs anymore so apparently linkedin identified that i was a bot and thus restricted my access to no longer be able to access job data anymore i was a little pissed but full disclosure i did use a burner account so not typically not my actual linkedin login account i set up a fake account to actually log in in case potentially something like this happened but nonetheless i was still upset because they had restricted my access to looking for jobs so i decided to look more into if web scraping is legal and i found this 2019 a u.s circuit court ruled that web scraping public sites does not violate the law interesting enough this case was actually on linkedin being upset that another company hiq was scraping its publicly available data the court ruled that for publicly available non-copyrighted data users are allowed to do web scraping however the ruling excludes those sites that require some sort of authentication and that have you sign some sort of terms and conditions that basically forbid you from doing web script this year linkedin actually brought this case up to the supreme court and it ended up that the ruling was vacated meaning the ruling was made legally void and this case is actually now up for review again so after learning all this i decided to look into more of what is linkedin's terms and conditions on this and come to find out they actually specifically ban their members from scraping any data probably should have read that first so where is this project going now that i hit this road bump well i actually did find that this job data is available without you actually logging in or having to authenticate and agree to those terms and conditions of linkedin i'd have to redesign my bot in order to scrape this publicly available data if you will and also i'm not sure if i'm necessarily comfortable doing that just yet so i'm still thinking about it as always if you got value out of this video smash that like button with that see the next one [Music] you
Info
Channel: Luke Barousse
Views: 86,677
Rating: undefined out of 5
Keywords: data viz by luke, business intelligence, data science, bi, computer science, data nerd, data analyst, data scientist, how to, data project, data analytics
Id: 1kU_ASADlPY
Channel Id: undefined
Length: 9min 0sec (540 seconds)
Published: Mon Nov 15 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.