HIDING Data with JavaScript? Web Scraping Obfuscation

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so i found this on stack overflow i was looking here and i found this question and i looked down through here and this is where this popped up and i thought hang on a minute that looks a lot like some kind of obfuscation attempt and i thought it was really interesting and then i scrolled down and i found out that it was because this guy had already answered it and he had already pasted a whole answer to it which was pretty cool so credit to this guy down here for doing that over two years ago great but i thought this was really fascinating i thought this would kind of be a nice introduction to some kind of general obfuscation attempts that people can do using javascript to stop you trying to get their data now obviously this one is near the top of the results for a certain search term in google and also it's not really that great an attempt at doing it so this is for educational purposes because this could be interesting and if you're trying to prevent people from taking your data then maybe you wouldn't want to do it this way you'd want to do it better way so this is the website this is a slightly different url so if we were to hover over here you can see the email address is right here it's actually a mail to link all in plain sight on the website so we're not doing anything bad here but when you go to it you actually see that there's this script tag above it and this is where this javascript is and this is the obfuscation now what this does is this actually creates this a tag here with all the information in it so if you were to try and get this tag out without rendering the javascript all you would get is this data back and nothing which is what the personal stack overflow was doing now we don't want to have to render a load of pages to get this information so what i thought i would do is i would just walk you through how you can sort of understand what this does and how you can de-obfuscate it to get this data out so what this does this code turns into this element here so what we're going to do is we're going to have a quick look at the code in the code editor so you can see it's creating a new array which is like a dictionary in python which is called l and for each um entry in it you can see it's got these pieces of data the thing that kind of gave it away to me right away was the fact that you can see here there's a opening of a tag there's an a there's a slash and then there's a closing of a tag and then we've got all these numbers with this line in front if you go to the end you'll see here there's another tag a and then h r e f so that basically showed me straight away this was creating an html tag which was a link which was probably going to have the data in it so what this does is it creates this array with this data and then it will say if the substring of it so basically it's looking for this line is it is there it removes it and then it will unescape it so these numbers are an ascii representation of a character are the number of the ascii representation and you can see that if it doesn't find that it just unescapes it anyway so it's a fairly simple attempt all they've done is turned it around reversed it and put a line in front of it and then there's a number which was the ascii representation of that character now we can undo this nice and easily using python so let's get started on that so let's just import in we'll use request html from this so let's do from requests html let's import html session and then s this is going to be our session and now we can do r is equal to s dot gets and let's grab that url that one will do and let's put that in there so now if we print r.text we should get exactly what you expect back or good fine if we were to search in this for the um part where the email addresses you're not going to find it because it doesn't exist because the javascript on this page hasn't run therefore it hasn't executed that script code which generates that element so let's find that element first so we're going to call this text and we'll do r.html.find because you want to find it and i'll show you how i pulled it out so if we go here let's close that down all i did was find this ul class of icon list and i'm indexing the second list item in which is going to be a one so ul dot icon list then we can do a space an ally css selectors we can index the first one which is obviously zero is no zero would be the first one on the list and then one it'd be the second on the list and then we can do text so if i print out the text of this element now we should end up with this javascript code that we were just looking at there it is we can see it so i'm going to do now so i'm going to copy all of this and we're going to use regex to pull the parts of the data out that we want so i've got this regex testing tool online this isn't the one that i normally use i can't remember what it's called anyway any will do i'm going to paste the data in here and here we can construct our regular expression that's going to pull out just the bits of data that we actually want so now that i know that this actually forms the element that has the email address in what i want is each and every item that they created in their array and that is in between the single quote marks there so what we want to do is we don't want everything within the quote mark so i'm going to do those but that obviously doesn't find anything because that's not how regex works but we want everything that's in between those single quotes so we want to use the brackets so that's going to search that's going to tell us we want to match everything within those if we do a dot that will represent a single character so you notice that it's only pulling out the ones that are a single character which isn't good enough because we have ones that are two three and four long or at least three and four long so instead of just putting more dots which will match those but not those ones what we can do is we can put a star that's going to match all of them now you can see it's picking up a lot of information at the moment and that's because it's finding them all here and it's matching them all so we want to put in our question mark which is going to make it lazy which is going to match as few characters as possible and that's how you can see it standard load up we get them all out here we actually do pick up this one as well but that's okay we can just ignore that later on down the line not a big issue if you had lots more outside this maybe you'd want to do something different to match them but this will work just fine for us here that is a really short and simple piece of regex so just before we go back to our code and complete our regex let's have a look at some of these numbers we can see that the first one here is 109 so i know that this is the ascii representation of a character that just has this line in front of it so if we would go to the end because we know this is actually reverse so there's our href equals male 2 you can see m a l i t o then our colon so the first one is a 99 so this website here will actually tell you what they are so if we find 99 we'll see that 99 is actually a lowercase c and when you go here the first one in the list of the male of is a lowercase c and the next one what do we have after 99 is that 46 46 is a dot so there you go that's that's just sort of showing you how i figured out how what that was so now we have our piece of regex let's import that into our code import re and instead of printing this text let's say our chars is equal to re.find all because we want to find everything then we're going to put an r and i'm going to use double quotes here because my regex has single quotes in it i want to use double quotes so they go around it all just like that then we pass in the text here so this should let's print that out give us just those characters that we found which we showed you in the regex in a list in python now which means we can actually now turn them into what we want them to so that's good so let's run through and let's do a quick loop so let's do four c in charge let's print c so we're going to see that this is still backwards but that's okay we'll sort that out so there's the beginning of our male 2 and this is where the the ascii characters are here and you'll notice that it's kind of it's twice because it's once in the mail two bit and then once in the actual text of the tag but we still have this line in front but we don't have it in front of these characters so what we need to do is we need to say that if there is a line in front we want to do something with that character instead so i'm going to say if c and i'm going to index the first character in c is equal to this stroke down pipe we're going to print and we'll we'll just do print c and we'll do the first character all the way to the end else and then we'll have our else here just going to print it normally so this should show us we need double equals there this should give us just the numbers now there we go so now we've got our numbers we can actually change these backs and characters nice and easily using python so we now that we know that this is the actual number we want to turn this into an integer so then we can turn it into a character so because at the moment it's a string we need to turn it into an in first so let's put it in int and now we want to use chr to turn it another bracket into back in from unicode into the actual character is i.e the ascii character that we want so let's run this again and there we go we're starting to see it there's our at sign and there's our dot-com uh there's the end of it ac so that's the company c dot there we go so what we want to do now is instead of like having it as a mess like that let's create a new list let's just call that our output list and we will do instead of printing it we will add it to the list so output dot append and we will do the same thing for this for where it doesn't have but it doesn't have it and now what we want to do is print reversed output so we are actually throwing up an error and that's because we picked up that character that had just the pipe but nothing outside of it so what i'm going to do is i'm just going to do try and hear and we'll have an except for whatever that exception was except i can't spell except what was that exception where's my um let's pass it for now it was a value error great so let's do accept value error as er and then let's just print our error and we'll carry on okay so we should get no error now so we there there's our error but we've carried on through it that's our list object so let's do our list reversed so now we've got it in all the right order you can just about see it there there's the start of the tag href equals so all we want to do now is let's for uh let's join it all together so let's do let's just make this a quick variable here and we can then do print joining it together and run and there is the full tag the full elements that we were looking for on the page originally which was obfuscated by javascript which we have undone by using regex to find it pull out the parts of the array we reversed the array we turned the characters back from their number we removed that pipe they got the number then we turned it back into a character from unicode and ascii and we've generated ourselves the actual uh element that was being done by javascript so we've undone it all if you've enjoyed this video you're probably gonna like this one as well which is my preferred method for scraping data from a website
Info
Channel: John Watson Rooney
Views: 2,043
Rating: undefined out of 5
Keywords: obfuscation, obfuscate, python web scraping, web scraping, javascript obfuscation, javascript website obfuscation, hidden website data, web scraping with python, requests html, scraping javascript, what is obfuscation, obfuscation explained, email obfuscation
Id: ks-iekIJy6M
Channel Id: undefined
Length: 12min 50sec (770 seconds)
Published: Fri Nov 05 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.