Lecture 4: Data Wrangling (2020)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
all right so welcome to today's lecture which is going to be on data wrangling and data wrangling might be a phrase it sounds a little bit odd to you but the basic idea of data wrangling is that you have data in one format and you want it in some different format and this happens all of the time I'm not just talking about like converting images but it could be like you have a text file or a log file and what you really want this data in some other format like you want a graph or you want statistics over the data anything that goes from one piece of data to another representation of that data is what I would call data wrangling we've seen some examples of this kind of data wrangling already previously in the semester like basically whenever you use the pipe operator that lets you sort of take output from one program and feed it through another program you are doing data wrangling in one way or another but we're going to do in this lecture is take a look at some of the fancier ways you can do data wrangling and some of the really useful ways you can do data wrangling in order to do any kind of data wrangling though you need a data source you need some data to operate on in the first place and there are a lot of good candidates for that kind of data we give some examples in the exercise section for today's lecture notes in this particular one though I'm going to be using a system log so I have a server that's running somewhere the Netherlands because that seemed like a reasonable thing at the time and on that server it's running sort of a regular logging daemon that comes with system Deeb's it's a sort of relatively standard Linux logging mechanism and there's a command called journal CTL on Linux systems that will let you view the system log and so what I'm gonna do is I'm gonna do some transformations over that log and see if we can extract something interesting from it you'll see though that if I run this command I end up with a lot of data because this is a log that has just like there's a lot of stuff in it right a lot of things have happened on my server and this goes back to like January first and their logs that go even further back on this there's a lot of stuff so the first thing we're gonna do is try to limit it down to you only one piece of content and here the grep command is your friend so we're gonna pipe this through grep and we're gonna pipe for SSH right so SSH we haven't really talked to you about yet but it is a way to access computers remotely through the command line and in particular what happens when you put a server on the public Internet is that lots and lots of people around the world to try to connect to it and log in and take over your server and so I want to see how those people are trying to do that and so I'm going to grep for SSH and you'll see pretty quickly that this also generates a bunch of content at least in theory this is gonna be real slow there we go so this generates tons and tons and tons of content and it's really hard to even just visualize what's going on here so let's look at only what user names people have used to try to log into my server so you'll see some of these lines say disconnected disconnected from invalid user and then some user name I want only those lines that's all I really care about I'm gonna make one more change here though which is if you think about how this pipeline does if I here do this connected from so this pipeline at the bottom here what that will do is it will send the entire log file over the network to my machine and then locally run grep to find only the lines to contained ssh and then locally filter them further this seems a little bit wasteful because i don't care about most of these lines and the remote site is also running a shell so what I can actually do is I can have that entire command run on the server right so I'm telling you SSH the command I want you to run on the server is this pipeline of three things and then what I get back I want to pipe through less so what does this do well it's gonna do that same filtering that we did but it's gonna do it on the server side and the server is only going to send me those lines that I care about and then when I pipe it locally through the program called less less is a pager you'll see some examples of this you've actually seen some of them already like when you type man and some command that opens in a pager and a pagers is a convenient way to take a long piece of content and fit it into your term window and have you scrolled down and scroll up and navigate it so that it doesn't just like scroll past your screen and so if I run this it still takes a little while because it has to parse through a lot of log files and in particular grep is buffering and therefore it decides to be relatively unhelpful I may do this without let's see if that's more helpful why doesn't it want to be helpful to me fine I'm gonna cheat a little just ignore me or the internet is really slow those are two possible options luckily there's a fix for that because previously I have run the following command so this command just takes the output of that command and sticks it into a file locally on my computer alright so I ran this when I was up in my office and so what this did is it downloaded all of the SSH log entries that matched disconnect from so I have those locally and this is really handy right there's no reason for me to stream the full log every single time because I know that that starting pattern is what I'm going to want anyway so we can take a look at SSH dot log and you will see there are lots and lots and lots of lines that all say disconnected from invalid user authenticating users etc right so these are the lines that we have to work on and this also means that going forward we don't have to go through this whole SSH process we can just cat that file and then operate it on it directly so here I can also demonstrate this pager so if I do cat s is a cat SSH dot log and I pipe it through less it gives me a pager where I can scroll up and down make that a little bit smaller maybe so I can scroll this file screw through this file and I can do so with what are roughly vim bindings so control you to scroll up control D to scroll down and cue to exit this is still a lot of content though and these lines contain a bunch of garbage that I'm not really interested in what I really want to see is what are what are these user names and here the tool that we're going to start using is one called sent said is a stream editor that's modify or it's it's a modification of a much earlier program called edie which was a really weird editor that none of you will probably want to use yeah Oh tsp is the name of my the remote computer I'm connecting to so said is a stream editor and it basically lets you make changes to the contents of a stream you can think of it a little bit like doing replacements but it's actually a full programming language over the stream that is given one of the most common things you do with said though is to just run replacement expressions on an input stream what do these looks like well let me show you here I'm gonna pipe this sue said and I'm going to say that I want to remove everything that comes before disconnected from so this might look a little weird the observation is that the date and the host name and the sort of process ID of the SSH daemon I don't care about I can just remove that straightaway and I can also remove that like disconnected from bit because that seems to be present in every single log entry so I just want to get rid of it and so what I write is a set expression in this particular case it's an S expression which is a substitute expression it takes two arguments that are basically enclosed in these slashes so the first one is the search string and the second one which is currently empty is a replacement string so here I'm saying search for the following pattern and replace it with blank and then I'm gonna pipe it into less at the end do you see that now what it's done is trim off the beginning of all these lines and that seems really handy but you might wonder what is this pattern that I've built up here right this is this dot star what does that mean this is an example of a regular expression and regular expressions are something that you may have come across in programming in the past but it's something that once you go into the command line you will find yourself using a lot especially for this kind of data wrangling regular expressions are essentially a powerful way to match text you can use it for other things than text too but Texas the most common example and in regular expressions you have a number of special characters that say don't just match this character but match for example a particular type of character or a particular set of options it essentially generates a program for you that searches the given text dot for example means any single character and star if you follow a character with a star it means zero or more of that character and so in this case is pattern of saying zero or more of any character followed by the literal string disconnected from I'm saying match that and then replace it with blank regular expressions have a number of these kind of special characters that have various meanings you can take advantage of I talked about star which is zero or more there's also Plus which is one or more right so this is saying I want the previous expression to match at least once you also have square brackets so square brackets let you match one of many different characters so here let us build up a string list something like a BA and I want to substitute a and B with nothing okay so here what I'm telling the pattern to do is to replace any character that is either A or B with nothing so if I make the first character B it will still produce BA you might wonder though why did it only replace once well it's because what regular expressions will do especially in this default mode is they will just match the pattern once and then apply the replacement once per line that is what's said normally does you can provide the G modifier which says do this as many times as it keeps matching which in this case would erase the entire line because every single character is either an A or a B if I added a C here and remove everything but the C if I added other characters in the middle of this string somewhere they would all be preserved but anything that is an A or and B is removed you can also do things like add modifiers to this for example what would this do this is saying I want zero or more of the string a B and I'm gonna replace them with nothing this means that if I have a standalone a it will not be replaced if I have a standalone B it will not be replaced but if I have the string a B it will be removed which yeah what are they said is stupid the - a here is because said is a really old tool and so it supports only a very old version of very cool expressions generally you will want to run it with - capital e which makes it use a more modern syntax that supports more things if you are in a place where you can't you have to prefix these with back slashes to say I want the special meaning of parenthesis otherwise they were just match a literal parenthesis which is probably not what you want so notice how this replaced the a B here and it replaced the a be here but it left this C and it also left the a at the end because that a does not match this pattern anymore and you can group these patterns in whatever ways you want you also have things like alternations you can say anything that matches a b or b c i want to remove and here you'll notice that this a b got removed this bc did not get removed even though it matches the pattern because the a b had already been removed this a b is removed right but the c stays in place this a b is removed and this c states because it still does not match that if I made this if I remove this a then now this a B pattern will not match this B so it'll be preserved and then BC will match BC and it'll go away Regulus presence can be all sorts of complicated when you first encounter them and even once you get more experience with them they can be daunting to look at and this is why very often you want to use something like a regular expression debugger which we'll look at in a little bit but first let's try to make up a pattern that will match the logs and and match the logs that we've been working with so far so here I'm gonna just sort of extract a couple of lines from this file let's say the first five so these lines all now look like this right and what we want to do is we want to only have the user name okay so what might this look like well here's one thing we could try to do actually let me show you one except one thing first let me take a line that says something like disconnected from invalid user disconnected from maybe four to one one whatever okay so this is an example of a login line where someone tried to login with the username disconnected from missing an S disconnected thank you you'll notice that this actually removed the username as well and this is because when you use dot star and any of these sort of range expressions indirect expressions they are greedy they will match as much as they can so in this case this was the username that we wanted to retain but this pattern actually matched all the way up until the second occurrence of it or the last occurrence of it and so everything before it including the username itself got removed and so we need to come up with a slightly clever or matching strategy than just saying sort of dot star because it means that if we have particularly adversarial input we might end up with something that we didn't expect okay so let's see how we might try to match these lines let's just do a head first well let's try to construct this up from the beginning we first of all know that we want - capital e right because we want to not have to put all these back slashes everywhere these lines look like they say from and then some of them say invalid but some of them do not right this line has invalid that one does not question mark here is saying zero or one so I want zero or zero or one of invalid space user what else well that's going to be a double space so we can't have that and then there's gonna be some username and then there's gonna be what exactly is gonna be what looks like an IP address so here we can use our range syntax and say zero to nine and a dot right that's what IP addresses are and we want many of those then it says porch so we're just going to match a literal port and then another number zero to nine and we're going to wand plus of that the other thing we're going to do here is we're going to do what's known as anchoring the regular expression so there are two special characters and regular expressions there's carrot or hat which matches the beginning of a line and there's dollar which matches the end of a line so here we're gonna say that this regression has to match the complete line the reason we do this is because imagine that someone made their username the entire log string then now if you try to match this pattern it would match the username itself which is not what we want generally you will want to try to anchor your patterns wherever you can to avoid those kind of oddities okay let's see what that gave us that removed many of the lines but not all of them so this one for example includes this pre off at the end so we'll want to cut that off if there's a space pre off square brackets our specials we need to escape them right now let's see what happens if we try more lines of this no it still gets something weird some of these lines are not empty right which means that the pattern did not match this one for example it says authenticating user instead of invalid user okay so as to match invalid or authenticated zero or one time before user how about now okay that looks pretty promising but this output is not particularly helpful right here we've just erased every line of our log files successfully which is not very helpful instead what we really wanted to do is when we match the username right over here we really wanted to remember what that username was because that is what we want to print out and the way we can do that in regular expressions is using something like capture groups so capture groups are a way to say that I want to remember this value and reuse it later and in regular expressions any bracketed expression any parenthesis expression is going to be such a capture group so we already actually have one here which is this first group and now we're creating a second one here notice that these parentheses don't do anything to the matching right because they're just saying this expression as a unit but we don't have any modifiers after it so it's just match one-time and then the reason matching groups are are useful or capture groups are useful is because you can refer back to them in the replacement so in the replacement here I can say backslash two this is the way that you refer to the name of a capture group in this say I'm in this case I'm saying match the entire line and then in the replacement put in the value you captured in the second capture group right remember this is the first capture group and this is the second one and this gives me all the usernames now if you look back at what we wrote this is pretty complicated right it might make sense now that we walk through it and why it had to be the way it was but this is like not obvious that this is how these lines work and this is where a regular expression debugger can come in really really handy so we have one here there are many online but here I've sort of pre filled in this expression that we just used and notice that it it tells me all the matching does in fact now this window is a little small with this font size but if I do hear this explanation says dot star matches any character between zero and unlimited times followed by disconnected from literally followed by a capture group and then walks you through all the stuff and that's one thing but it also lets you've given a test string and then matches the pattern against every single test string that you give and highlights what the different capture groups for example are so here we made user a capture group right so it'll say okay the full string matched right the whole thing is blue so it matched Green is the first capture group red is the second capture group and this is the third because preauth was also put into parenthesis and this can be a handy way to try to debug your regular expressions for example if I put disconnected from and let's add a new line here and I make the username disconnected from now that line already had the username be disconnect from great here me of thinking ahead you'll notice that with this pattern this was no longer a problem because it got matched the username what happens if we take this entire line or this entire line and make that the username now what happens it gets really confused right so this is where regular expressions can be a pain to get right because it now tries to match it matches the first place where username appears or the first invalid in this case the second invalid because this is greedy we can make this non greedy by putting a question mark here so if you suffix a plus or a star with a question mark it becomes a non greedy match so it will not try to match as much as possible and then you see that this actually gets parsed correctly because this dots we'll stop at the first disconnected from which is the one that's actually emitted by SSH the one that actually appears in our logs as you can probably tell from the explanation of this so far regular expressions can get really complicated and there are all sorts of weird modifiers that you might have to apply in your pattern the only way to really learn them is to start with simple ones and then build them up until they match what you need often you're just doing some like one-off job like when we're hacking out the user names here and you don't need to care about all the special conditions right you don't have to care about someone having the SSH username perfectly match your login format that's probably not something that matters because you're just trying to find the usernames but regular expressions are really powerful and you want to be careful if you're doing something where it actually matters you had a question regular expressions by default only match per line anyway they will not match across new lines so so the way that said works is that it operates per line and so said we'll do this expression for every line okay questions about regular sessions or this pattern so far it is a complicated pattern so if it if it feels confusing like don't be worried about it look at it in the debugger later yep so so keep in mind that the we're assuming here that the user only has control over their username right so the worst that they could do is take like this entire entry and make that the username let's see what happens right so that's the works and the reason for this is this question mark means that the moment we hit the disconnect keyword we start parsing the rest of the pattern right and the first occurrence of disconnected is printed by SSH before anything the user controls so in this particular instance even this will not confuse the pattern yep if well so if you're writing a this sort of odd matching will in general when you're doing data wrangling is like not security it's not security related but it might mean that you get really weird data back and so if you're doing something like plotting data you might drop data points that matter you might parse out the wrong number and then like your plot suddenly have data points that weren't in the original data and so it's more that if you find yourself writing a complicated regular expression like double check that it's actually matching what you think it's matching and even if it's not security related and as you can imagine these patterns can get really complicated like for example there's a big debate about how do you match an email address with a regular expression and you might think of something like this so this is a very straightforward one that just says letters and numbers and rotor scores some percent followed by a plus because in Gmail you can have pluses in email addresses with a suffix in this case the plus is just for any number of these but at least one because you can't have an email address that doesn't have anything before the ad and then similarly after the domain right and the top-level domain has to be at least two characters and can't include digits right you can have it calm but you can't have adopt seven it turns out this is not really correct right there are a bunch of valid email addresses that will not be matched by this and they're a bunch of invalid email addresses that will be matched by this so there are many many suggestions and there are people who've built like full test suites to try to see which regular expression is best and this is this particular one is for URLs there are similar ones for email where they found that the best one is this one I don't recommend you trying to understand this pattern but this one apparently will all most perfectly match the what the like internet standard for email addresses says as a valid email address and that includes all sorts of weird Unicode code points this is just to say regular expressions can be really hairy and if you end up somewhere like this there's probably a better way to do it for example if you find yourself trying to parse HTML or something or parse like parse JSON where they're expressions you should probably use a different tool and there is an exercise that has you do this not with the regular sessions point you yeah that it's there's all sorts of suggestions and they give you deep deep dives into how they works if you want to look that up it's it's in the lecture notes okay so now we have the sister of user names so let's go back to data wrangling right like this list of user names is still not that interesting to me right let's let's see how many lines there are so if I do WC - oh there are one hundred and ninety eight thousand lines so WC is the word count program - L makes it count the number of lines this is a lot of lines then if I start scrolling through them that still doesn't really help me right like I need statistics over this I need aggregates of some kind and the send tool is like useful for many things it gives you a full programming language it can do weird things like insert text or only print matching lines but it's not necessarily the perfect tool for everything right like sometimes there are better tools like for example you could write a line counter instead you just should never said it's a terrible programming language except for searching and replacing but there are other useful tools so for example there's a tool called sort so sort this is also not going to be very helpful but sort takes a bunch of lines of input sorts them and then prints them to your output so in this case I now get the sorted output of that list it is still two hundred thousand lines long so it's still not very helpful to me but now I can combine it the tool called unique so unique we'll look at a sorted list of lines and it will only print those that are unique so if you have multiple instances of any given line it will only print it once and then I can say unique - C so this is gonna say count the number of duplicates for any lines that are duplicated and eliminate them what does this look like well if I run it it's gonna take a while there were thirteen zze user names there were ten ZX VF user names etc there and I can scroll through this this is still a very long list right but at least now it's a little bit more collated than it was let's see how many lines I'm dumped in now okay twenty-four thousand lines it's still too much it's not useful information to me but I can keep burning down this with more tools for example what I might care about is which user names have been used the most well I can do sort again and I can say I want a numeric sort on the first column of the input so - n says numeric sort - K lets you select a white space separated column from the input to sort my and the reason I'm giving one comma one here is because I want to start at the first column and stop at the first column alternatively I could say I want you to sort by this list of columns but in this case I just want to sort by that column and then I want only the ten last lines so sort by default will output in ascending order so the the ones with the highest counts are gonna be at the bottom and then I want only lost ten lines and now when I run this I actually get a useful bit of data right it tells me there were eleven thousand login attempts with the username root there were four thousand with one two three four five six isn't username etc and this is pretty handy right and now suddenly this giant log file actually produces useful information for me this is what I really from that log file now maybe I want to just like do a quick disabling of root for example for SSH login on my machine which I recommend you will do by the way in this particular case we don't actually need the k4 sort because sort by default will sort by the entire line and the number happens to come first but it's useful to know about these additional flags and you might wonder well how would I know that these flags exist how would I know that these programs even exist well the programs usually pick up just from being told about them in classes like here the flags are usually like I want to sort by something that is not the full line your first instinct should be to type man sort and then read through the page and then very quickly will tell you here's how to select a pretty good column here's how to sort by a number etc okay what if now that I have this like top let's say top 20 list let's say I don't actually care about the counts I just want like a comma separated list of the user names because I'm gonna like send it to myself by email every day or something like that like these are the top 20 usernames well I can do this ok that's a lot more weird commands but their commands that are useful to know about so awk is a column based stream processor so we talked about said which is a stream editor so it tries to edit text primarily in the inputs awk on the other hand also lets you edit text it is still a full programming language but it's more focused on columnar data so in this case awk by default will parse its input in white space separated columns and then that you operate on those columns separately in this case I'm saying just print the second column which is the user name right paste is a command that takes a bunch of lines and paste them together into a single line that's the - s with the delimiter comma so in this case for on this I want to get a comma separated list of the top user names which I can then do whatever useful thing I might want maybe I want to stick this in a config file of disallowed usernames or something along those lines um awk is worth talking a little bit more about because it turns out to be a really powerful language for this kind of data wrangling we mentioned briefly what this print dollar 2 does but it turns out the for awk you can do some really really fancy things so for example let's go back to here where we just have the usernames I say let's still do sort and unique because we don't otherwise the list gets far too long and let's say that I only want to print the usernames that match a particular pattern let's say for example that I want to see I want all of the usernames that only appear once and that start with a C and end with an e there's a really weird thing to look for but in all this is really simple to express I can say I want the first column to be 1 and I want the second column to match the following regular expression hey this could probably just be dot and then I want to print the whole line so unless I mess something up this will give me all the usernames that start with a C end with an e and only appear once in my log now that might not be a very useful thing to do with the data what I'm trying to do in this lecture is show you the kind of tools that are available and in this particular case this pattern is like not that complicated even though what we're doing is sort of weird and this is because very often on Linux with Linux tools in particular and command-line tools in general the tools are built to be based on lines of input and lines of output and very often those lines are going to be have multiple columns and awk is great for operating over columns now awk is is not just able to do things like match per line but it lets you do things like let's say I want the number of these right I want to know how many user names match this pattern well I can do WCHL that works just fine all right there are 31 such user names but awk is a programming language this is something that you will probably never end up doing yourself but it's important to know that you can every now and again it is actually useful to know about these this might be hard to read on my screen I just realized let me try to fix that in a second let's do yeah apparently fish does not want me to do that um so here begin is a special pattern that only matches the zeroth line end is a special pattern that only matches after the last line and then this is gonna be a normal pattern that's matched against every line so what I'm saying here is on the zeroth line set the variable rose to zero on every line that matches this pattern increment rose and after you have matched the last line print the value of rose and this will have the same effect as running WCHL but all within awk his particular instance like WCHL is just fine but sometimes you want to do things like you want to might want to keep a dictionary or a map of some kind you might want to compute statistics you might want to do things like I want the second match of this pattern so you need a stateful matcher like ignore the first match but then print everything following the second match and for that this kind of simple programming in all can be useful to know about in fact we could in this pattern get rid of said and sort and unique and grep that we originally used to produce this file and do it all in awk but you probably don't want to do that it would be probably too painful to be worth it it's worth talking a little bit about the other kinds of tools that you might want to use on the command line the first of these is a really handy program called BC so BC is the Berkeley calculator I believe man BC I think BC is originally from Berkeley calculator anyway it is a very simple command-line calculator but instead of giving you a prompt it reads from standard in so I can do something like echo 1 plus 2 and pipe it to BC - shell because many of these programs normally operate in like a stupid mode where they're unhelpful so here it prints 3 Wow very impressive but it turns out this can be really handy imagine you have a file with a bunch of lines let's say something like oh I don't know this file and let's say I want to sum up the number of logins the number of user names that have not been used only once all right so the ones where the count is not equal to one I want to print just the count right this is me give me the counts for all the non single-use user names and then I want to know how many are there of these notice that I can't just count the lines that wouldn't work right because there are numbers on each ran I want to sum well I can use paste to paste by plus so this paste every line together into a plus expression right and this is now an arithmetic expression so I can pipe it through BCL and now there have been hundred and ninety one thousand logins that share to username with at least one other login again probably not something you really care about but this is just to show you that you can extract this data pretty easily and there's all sort of other stuff you can do with this for example there are tools so that you compute statistics over inputs so for example for this list of numbers that's that I just took the numbers and just print it out just the distribution of numbers I could do things like use our our is the separate programming language that's specifically built for a statistical analysis and I can say let's see if I got this right this is again a different programming language that you would have to learn but if you already know R or you can pipe them through all their languages too like so so this gives me summary statistics over that input stream of numbers so the median number of login attempts per user name is 3 the max is 10,000 that was route we saw before I'll tell me the average was 8 for this might not matter in this particular instance like this might not be interesting numbers but if you're looking at things like output from your benchmarking script or something else where you have some numerical distribution and you want to look at them these tools are really handy we can even do some simple plotting if we wanted to right so this has a bunch of numbers let's do let's go back to our sort and k-11 and look at only the two top 5 new plot is a plotter that lets you take things from standard in I'm not expecting you to know all of these programming languages because they really are programming languages in their own right but is it just show you what is possible right so this is now a histogram of how many times each of the top 5 user names have been used for my server since January 1st and it's just one command line it's somewhat complicated command line but it's just one command line thing that you can do there are two sort of special types of data wrangling that I want to talk to you about in the in the last little bit of time that we have and the first one is command line argument wrangling sometimes you might have something that actually we looked at in the last lecture like you have things like find that produces a list of files or maybe something that produces a list of arguments for your benchmarking script like you want to run it with a particular distribution of arguments like let's say you had a script that printed the number of iterations to run a particular project and you wanted like an exponential distribution or something and this prints the number of iterations on each line and you were to run your benchmark for each one well here is a tool called X args that's your friend so X args takes lines of input and turns them into arguments and this is my look a little weird see if I can come with a good example for this so I program in rust and rust lets you install multiple versions of the compiler so in this case you can see that I have stable beta I have a couple of earlier stable releases and I've launched a different dated Knightley's and this is all very well but over time like I don't really need the nightly version from like March of last year anymore I can probably delete that every now and again and maybe I want to clean these up a little well this is a list of lines so I can get for nightly I can get rid of so - V is don't match I don't want to match to the current nightly okay so this is al a list of dated Knightley's maybe I want only the ones from 2019 and now I want to remove each of these tool chains for my machine I could copy paste each one into so there's a rust up tool chain remove or uninstall maybe tool chain uninstall right so I could manually type out the name of each one or copy/paste them but that's getting gets annoying really quickly because I have the list right here so instead how about I said away this sort of this suffix that it adds right so now it's just that and then I use ex args so ex args takes a list of inputs and turns them into arguments so I want this to become arguments to rust up tool chain uninstall and just for my own sanity sake I'm gonna make this echo just so it's going to show which command it's gonna run well it's relatively unhelpful but are hard to read at least you see the command it's going to execute if I remove this echo is rust up tool chain uninstall and then the list of Knightley's as arguments to that program and so if I run this it on installs every tool chain instead of me having to copy paste them so this is one example where this kind of data wrangling actually can be useful for other tasks than just looking at data it's just going from one format to another you can also wrangle binary data so a good example of this is stuff like videos and images where you might actually want to operate over them in some interesting way so for example there's a tool called ffmpeg ffmpeg is for encoding and decoding video and to some extent images I'm gonna set its log level to panic because otherwise it prints a bunch of stuff I want it to read from dev video 0 which is my video of my webcam video device and I wanted to take the first frame so I just wanted to take a picture and I wanted to take an image rather than a single frame video file and I wanted to print its output so the image it captures to standard output - is usually the way you tell the program to use standard input or output rather than a given file so here it expects a file name and the file name - means standard output in this context and then I want to pipe that through a parameter called convert convert is a image manipulation program I want to tell convert to read from standard input and turn the image into the color space gray and then write the resulting image into the file - which is standard output and I don't want to pipe that into gzip we're just gonna compress this image file and that's also going to just operate on standard input standard output and then I'm going to pipe that to my remote server and on that I'm going to decode that image and then I'm gonna store a copy of that image so remember T reads input prints it to standard out and to a file this is gonna make a copy of the decoded image file ass copy about PNG and then it's gonna continue to stream that out so now I'm gonna bring that back into a local stream and here I'm going to display that in an image display err let's see if that works Hey right so this now did a round-trip to my server and then came back over pipes and there's now a computer there's a decompressed version of this file at least in theory on my server let's see if that's there a CPT's p copy PNG 2 here and CP 8 yeah hey same file ended up on the server so our pipeline worked again this is a sort of silly example but let's you see the power of building these pipelines where it doesn't have to be textual data it's just go taking data from any format to any other like for example if I wanted to I can do cat dev video 0 and then pipe that to a server that like Anish controls and then he could watch that video stream by piping it into a video player on his machine if we wanted to write it just need to know that these thing exist there are a bunch of exercises for this lab and some of them rely on you having a data source that looks a little bit like a log on Mac OS and Linux we give you some commands you can try to experiment with but keep in mind that it's not it's not that important exactly what data source you use this is more find some data source that where you think there might be an interesting signal and then try to extract something interesting from it that is what all of the exercises are about we will not have class on Monday because it's MLK Day so next lecture will be Tuesday on command line environments any questions about what we've guarded so far or the pipelines or regular expressions I really recommend that you look into regular expressions and try to learn them they are extremely handy both for this and in programming in general and if you have any questions come to office hours and we'll help you up
Info
Channel: Missing Semester
Views: 106,631
Rating: undefined out of 5
Keywords: mit, lecture, tools, command-line, bash, scripting, shell, linux
Id: sz_dsktIjt4
Channel Id: undefined
Length: 50min 3sec (3003 seconds)
Published: Sat Feb 01 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.