EVERYONE Needs to Learn a Little Bit of AWK!

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey there my name is Gary Simms and this is Gary explained today and we'll talk about orc which is a scripting language for manipulating data in files now today we often see that people are using spreadsheets for holding all kinds of information and they're trying to process that data inside of the spreadsheet and I might have multiple worksheets open and there's lots of stuff going on and I think people have generally forgotten or maybe they've even know about the power of awk and how it can automate a lot of tasks for you so orc basically reading data from a file it can be space separated it can be comma separated can be tab separated and it processes that data and you can manipulate it and then output something else so if you're the kind of person that manipulates data for your job or even for fun then I think you'll find this interesting so if you want to find out more please let me explain what we do is over to the command line and I'm basically going to show your practical examples of working with walk and these will be kind of a stepping stone to just teach you the fundamentals of how it works so let's go straight to the command line okay so here we are on the Linux command line now I've got a text file here called LS user bin dot txt it's basically a list of all of the binary files that you find in user slash bin and then the number of bytes that file is it's the size of the file and this is quite a long file if after we do a word count - L on that file you can see there are 664 different lines they're each describing one file and its size now awk takes in a scripting language as a program so that you can process files exactly like this and the simplest program is just to print out the file so that's what we could do first of all so you start with awk single quotes to specify this is the program you're doing all actions in orc start inside of a curly brackets curly braces and we'll talk more about why that is in a minute because there's other parts to the syntax that we can put outside these curly brackets so that's it you just say print and then we'll use our file that we're going look at if we run that now it will basically just print out the file and that's what it did so there you go very quickly it has printed out every single line of that file now just to introduce you to a thing the way awk works is it treats each CH a word as a field separated by a space by default you can change that if we were dealing with for example comma separated files but to access each field you use the dollar and in the field number and actually dollar zero is the whole line so we're going to run this program again in fact it will do exactly the same thing because is dollar zero which is every single line but if I just do dollar one then we'll only see the binary name come out there we go you notice there are no numbers now on the right-hand side of that if I do dollar two we'll just get the file sizes so there you go so it works by processing a line of text and then splitting up into field and then you can do things with those different fields now I said earlier on for example that let's change this to one I want to print out the name that you can do things outside of the brackets what you do outside the brackets is pattern matching so for example if we do this so notice how I've got the forward slash GCC's force actually the force that says this is beginning of a regular expression and then the end of it is where the other slashes and then print is the action so what it will do is whenever it matches the word GCC it will print out the name of that file and if you wanna know more about regular expressions I have a video on this channel about grep and regular expressions but now if we do this we just see the files with GCC that's much shorter just for four lines there and just like all regular expressions you can apply different things to it for example the up are sort of the up arrow here thing is for the beginning of Alliance we can say any line that begins in a double you print that out and they we can see those lines there that just start with a w and we could go on doing this all all kinds of different things we could look at just by matching it on you know different letter so who want to see anything to begins with WS that we just get those on so you any regular expression you can put in there we could do anything that lets see anything has the word let's just put the word path in there then you can just see those so anything that matches that rate of expression will they get printout and equally we could print out the file size and the file name and we could do that by saying comma dollar two so we're actually getting both of those printed out but now as individual fields rather than just printing out the whole word now what's interesting is is that these are the grep is orcas clever enough to know that these some of these are numbers so for example I could say well let's divide this by 102 four so at the moment this is being given to me in bytes if we look here this is in bytes let's say I wanted to dividing 102 4 to get it in inque and kilobytes so I'm actually saying print out - but divide it by 102 force and our starting to do some processing on the file and as you see now when we print that out it says thirty point four six eight eight thirty four point one so so we know those are the sizes in cave that we could add on here at the end just in quote the word letter K and then we go we've got a nice K number so you can actually do lots of different things in terms actually starting to process these files and you can actually match on multiple things so for example we could say here everything in path and so you do two ampersand signs I want to check dollar two so that's the size field and make sure that's greater than so what should we say greater than 15,000 bytes okay so that will now print out exactly the same thing but some of those files won't be matched so there we go we can see that three of them are I've been printed out because they are greater than 15,000 bytes okay now before we go on further with or could also mention this deal here that you can get at the moment the machine learning and data science training bundle now there are eight different courses over 48 hours of content and it's gonna cover things like tensorflow and it covers python and it covers reels this one the regression analysis statistics and machine learning there's also more down here you can see to do with our bootcamp in our then there's machine learning and deep learning in our so really quite a lot of stuff here if this kind of stuff interests you data manipulation data learning data science Big Data any of those things this looks like a pretty Dean this is an affiliate link which means if you do buy this course you also help out this channel but more importantly you get to increase your knowledge now as you can see these programs are getting a little bit more complicated it's not just a case of print something out we've got regular expressions and we're checking field values and things so we can actually put this all into a script and then say - awk please run this script let's write a little program called let's call it path 15k because that's what we're looking at anything with the word path in it that's greater than 15k and we can call it a WK orc and before we go into it let's just cut and paste this part of the text here so that we've got it available to us in our program okay so Nano and then now we're gonna do is just paste that in there now you can get rid of the single quote at the beginning of the end and so there's one line of this program basically when you match path and the two is greater than 15,000 15k then print out the name and do that kay conversion there simple as that so we can save that and now here we do awk - f2 take the name of the file and then the file we're going to run it on LS user bin and it does exactly the same thing but now our command line is a lot simpler but we're running actual orc script once you get into all scripts you can actually find out you can do lots of interesting things so for example if we go back into our little program here and we we add another line so let's say we look for everything that begins with the letter A and is greater than 25 came print that out as well so we're just adding in a second line with a second set of matches and then a second set of things that we want it to do and so we can just save that and we can run that now the same and now we also get things that begin with the letter A that are greater than what did I say 25 K and we get the other ones that have the word path in them that are greater than 15 K and they're all listed there now and they're matched each line of that file that we're going through now if you look at this you can see that the K sizes are pretty you know three three digits he noticed forty two point one five six there for exam for that one there 38.08 90 so these are all pretty you know there is not very human readable so in fact let's just change this now to make it more human readable so we go back into this file and I first of all we're going to get rid of that line we're not going to do that again now I actually know that there are some interesting files that begin with the word W because I want to show something about rounding up numbers now in orc there are actual functions there is a function called int which basically truncate a floating point number and turns it into a just a normal integer so if this was 42 points something it will just say 42 so we can now run this program okay and see now and now we can see it just says 18 K 32k 22 K 13k much much nicer much more human readable which is great however if we just modify the program again and just change this slightly to actually print out the real byte value divided by 102 4 and then print out the the rounded one let's just see here what happens what I want to show is there are some of these here that are it's definitely it's truncating here's a good example it's forty six point nine seven six six and that gets for truncate of 46 really that should be 47 this should really say 47 K if we wanted a proper approximation of what you know this is just a truncation just drops off that decimal point so it would be good if we could do something clever so that we get a better representation of these numbers and this is a great way to introduce the idea of a function now inside of or you can define a function we're gonna write a function called round so this is us writing a function and we're gonna pass in a parameter called a n which is the size and there's a quick trick you can do in math to round something up you can say n is equal to n plus not form as we add half to it and then we truncate it okay so we're saying n is equal to the truncated version of n and then we can return it so that's the number that will return how does that work what if it was 42.1 if you add half on to it you get 42.6 so when you truncate that you still get 42 but if it was forty six point nine in you add half and that goes over into forty seven forty seven point four or whatever will you truncate that you now get forty seven so it's a great trick to actually get the truncation to work in the right way and also shows us how we can write a function so the function takes in n the number we pass in it adds naught point five to it it then truncates it and then it returns it so now we can call the function here inside of the action part and rather than calling in to truncate it we can call round dollar two and of course we're dividing it by 102 for to turn it from bytes into K and it's now going to round up that number to a better one so now if we run that what do we get here we go so if we look at this one here forty six point nine seven six six is actually now forty seven so we've written a function and we've done some truncation so that shows you now that with the power of walk you can write programs that can parse all kinds of lists and logs and comma separated files and whatever data that you've got line by line and it can actually start to process it run function on it actually start to do things on it so let's have another quick example of what you can do with that I have another file here called numbers which is just basically what is that six numbers I've just typed in there 3 7 12 15 16 and 31 well now what we can do is we can write a program that's gonna read in each line and then print out all the numbers up to the number that it read in and we're gonna call this loop so what we're gonna do is we write a function called well let's just quick print list and we're gonna pass in the number which is the number we're gonna read in from the line and then we're gonna do a for loop now for loop is very much like the for loops in see basically you start with the initialization situation so we're gonna say I set I to one that's a counter then you have the expression you test it how how much do when does this thing stop and when you say stop or keep going while it's less than or equal to n and stop when it's greater than n and then finally you have an iteration function something that happens every loop in this case it adds once it starts at one keeps going to while n is less is less than n or equal to n and then each time it one on two ends that's the basic for loop that's the same in C and other similar languages and then all we do is we're going to say print but now we're gonna use printf which is kind of borrowed from the C programming language percent D means give me a integer and we're going to print out the integer I okay so that's from print F print formatted percent D says print out a number okay it's a special format you put in here a special number that's gonna an integer and in this case it's going to be I and then all we do in the actual code for the action part we call print list remember we're in curly brackets here now and we're gonna say dollar one dollar one is going to be the first feel to it was three seven and all those other numbers that I put in there so every time it reads a line it's going to call print list and it's going to call it with the number the the field that gets passed in now one other thing we want to put in here to tidy up the output is our fits printed of the list we want to print a blank line so again we can use printf which is borrowed from c backslash n means a newline and that's it so average printed out one two three for example for the first line print a blank line and then every line it will be nicely formatted on the screen so now we can save that and we can run it so how do we do that awk - f loop and we're running it on the file called numbers dot txt and there you go one two three one two seven one two twelve one two fifteen one to 16 and one to thirty one which are the numbers that I had there in that file and it knew to do those numbers because it read each one from the file so imagine now if you want to write programs that takes data from a file and then you want to do things with that data you want to you know create reports you want to do multiplications you'll add things up you want to work out averages you want to do whatever it is you want to do with those numbers you can do it here by reading in each number and then producing a result using the orc programming language okay so they have a gentle introduction to orc there are lots of things I haven't covered this is only an introduction most importantly I haven't talked about the begin and end keywords they're special patterns that match before a file is processed and after a file is processed and there's loads more you can do with all for example you can output that data and maybe feed it into pipe it into another program for plot grass for example something a lot of people do if they're working inside a spreadsheet in fact awk is so powerful you can even write small interpreters language interpret or even compilers using awk because it's so good at tokenizing the data so that you can actually process it and actually produce another language and so on and so on so I really do kind of suggest you really get into wok and I hope this was a useful introduction to it okay that tip my name's Gary Sims this is gary explains I really hope you enjoyed this video if you did please do give it a thumbs up if you like these kind of videos and stick around by subscribing to the channel okay that's it I'll see you the next one [Music]
Info
Channel: Gary Explains
Views: 395,762
Rating: undefined out of 5
Keywords: Gary Explains, Tech, Explanation, Tutorial, AWK, Spreadsheets, Excel, text processing, big data, data science, data extraction, reporting, generating reports, data-driven, Alfred Aho, Peter Weinberger, Brian Kernighan, Unix, Linux, command line tools, Linux text processing tools
Id: jJ02kEETw70
Channel Id: undefined
Length: 16min 29sec (989 seconds)
Published: Thu Feb 20 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.