Advanced Programming Techniques in MATLAB, Part 1 | Master Class with Loren Shure

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
perfect hello everybody my name is lauren shore and i'm the first person hired at mathworks as such it's in almost 35 years i've contribu contributed broadly to many of our products including making um uh some additions to the matlab language i'm gonna i'm excited to be running master classes live on youtube to this may and two coming in october my goal is to help you learn tips tricks and best practices so you can get the most out of matlab um and to make sure you don't miss any of these sessions please subscribe to the matlab youtube channel to watch uh the next sessions live thank you okay well today i'm going to talk about one of the two techniques in advanced programming techniques so this week i'm going to talk about memory and what you as a matlab user matlab programmer would benefit from knowing about how matlab uses memory and to follow on on that in two weeks i'll be talking about the kinds of functions there are in matlab and i'll also talk about them in relationship to what they have um but their implications are for memory um and then in the autumn i'll be talking about some other programming things so um please uh put your questions in the chat and we will try to answer them you will get access to the slides and code from the there should be a link near the chat that you can use to download the uh the materials so advanced programming techniques i'm gonna just do the first one today i'm to talk about matlab and memory and what you as a programmer in matlab really should know and i'm going to talk about passing arrays and things like what's how structures are structured from a memory point of view and the functions we're going to ignore until two weeks from now so my plan here is to have a slide for each topic within memory and then have a matlab example that i'm going to run in my matlab that you see sitting in the background here so when matlab passes arrays to functions what's going on does matlab make copies does it not make copies of the arrays um and the answer is not necessarily completely what you might expect um so here i have a typical function in matlab you see it's function foo well named of course three inputs and one output and you see on the first executable line i'm changing an element of the second input and then i'm computing what looks to be like a straight line and then when i go to call this i call it and these in in this case i'm calling it with parameters that are temporary variables but let me talk about what's going to happen in matlab well when when we pass um arrays into functions and by array i mean any kind of matlab entity because all matlab entities are arrays they might be a scalar they might be a vector they could be an m by n by p by q matrix it could be an array of structures so anything in matlab getting passed into a function and um the mental model that you should have is that matt matlab will call any function as if it's making a copy of all of the inputs before it calls the function so if the function messes around with the inputs in some way it doesn't affect your original data values there but that would be very memory um unfriendly it would use a lot of memory potentially if your x's and a's and big b's were big and so what we do instead is something called lazy copy or copy on right so we send in basically the original data in its original location and only if it's changed do we sever the relationship between the variable that you called with and the variable inside the function so let me show you what this means here over in matlab and i want to um i think i want to make my font a little bit bigger for you here i should have thought of this before sorry let me come to my preferences my fonts and let's make this go up to um see if 22 does well 22 24 is a little bit more than i want apply okay that might be a little bit better okay so i have a function foo it's got three inputs and one output same exact thing that i had on the slide so now i'm going to put some variables into my workspace x is one to three a a equals four bb equals two and what you can see here is i've now created some arrays in matlab and this is not a trick question but i want you to think about this and count them with me how many arrays did i just create never mind that some of them are scalar arrays i made three things three arrays in matlab xxa and bb now i'm going to call foo on using these xxabb but before i do that i'm going to actually come into the editor and put a break point at the first executable line in my function and so even though i've typed this line yy equals foo da da da it hasn't executed yet so i still have only my three variables in matlab okay and now i'm going to call foo because i'm going to press enter and when i press enter i come into foo and you'll see it stopped on line two but it has not executed line two yet okay so at this moment from a conceptual point of view matlab now has six variables it's got the double letter variables in the main workspace xxabb and then it's got their equivalent xab in this in this function workspace but from a physical point of view at this very moment x and x x are identical and the same with the a's and the b's so from a physical point of view there's only three variables in matlab even though i have six cons six variables conceptually and now i'm gonna come to the uh editor part of the tab and i'm gonna say let's run a step here and now i'm sitting on line three i haven't executed line three and now from a conceptual point of view i still have six variables because i didn't change i didn't add any new variables here i changed one of them and that change triggers things so that now a and a a cannot share the same memory because one's different than the other and so now from a physical point of view instead of having three i now have four okay so let me step again and now i've executed line uh three and i'm ready to head out of the function and so at this moment how many variables in all of matlab from a conceptual point of view i have seven i've got four now in my workspace x a b and y in my function workspace and i have my original three double letter ones in the main workspace so i have seven from a conceptual point of view from a physical point of view i have five and now we're done with the function so let's continue i'm back at the matlab prompt here and now how many in all of matlab and the answer may surprise some of you but maybe not it's four i have my xx a a b b and my y in fact if i type who you can see what variables are in that workspace and what happens in any function workspace once we leave it is anything that got created in there that's not being sent out is basically um goes away it gets cleaned up so the copy that i had essentially of this scalar went away now for a brief moment when i was leaving the function foo y and y y shared the same memory and then foo got rid of its connection to y because it was cleaning up but why why retained it so let me review that really quickly for you we started off with three variables and then we called foo from a conceptual point of view we went from three to six to seven to four so three to six to seven to four from a physical point of view we went from three to four to five to four why am i telling you this like this seems sort of dumb in some ways well think about it x was my maybe my data here um where i'm trying to locations where my data are where i'm trying to produce some calculation in this case a straight line i never changed x i used it but i never changed it so if x was i don't know a terabyte it wouldn't make an extra terabyte copy here of course once i calculate y that's going to take a terabyte and use more memory but a being a scalar yes that's going to take more memory but not very much so we're not so worried about that so hopefully that makes some sense to you and um while i'm in here i'm just going to show you a little bit about the way i use matlab so i use matlab where i have the quick access toolbar here um and i can click in it it's normally up here and you can see all the tabs from the desktop because i have a whole bunch of of shortcuts i like using favorites so i move it below the tool strip and then if i want to minimize the tool strip i still have what i put over here anything i want and you'll notice that i have some things from let me just bring this down again here you'll notice that i use run in advance so often that i already added it to my quick access tool strip here so that's really a nice handy thing and i have another one this is really helpful which clears my workspace makes my cursor from my command window go to the upper left hand corner for me and it closes all my open figure windows so if i go like this now i have nothing in my workspace and nothing in my command window so i'm sort of free to move forward okay next problem we want to talk about in matlab from a memory point of view is when does matlab do calculations in place so by that i mean you know i have an input variable and then i'm going to use that same name as an output variable maybe and if i do that well let's suppose x is a terabyte again and it's i have a clean workspace i started with a terabyte and then when i'm done i end up with a terabyte because this isn't changing the size of the array we're working on so i have a terabyte then terabyte out if i say y equals same thing but the new expression same expression i have x in and y out so i have one terabyte in and then two terabytes for x and y on the way out so the interesting question is okay but what happens in the middle here and that is an answer that is um not straightforward but it's not difficult either so oops let me um come and show you some code here so in in order to answer this i'm going to show you a function i have that is going to create a large array and then you'll see i'm going to call two different functions excuse me one second okay two different functions my func and my func in place and you'll notice in both those cases i'm using xx as an input and xx as an output instead of putting the output in a new variable and here are my two functions they're actually the same as each other except for a the name and b the output variable in my func in place i'm reusing the input variable name as an output variable name and i'm using a new variable name in my func so i want to show you how this works um so let me come over here get rid of foo and now this is my funk in place and let's talk about this function it's the same one i had on the slide you probably know that dot up arrow does an element-wise square so if um if i have the elements one two three four i'm gonna get one square two squared three squared four squared okay uh okay so i'm going to basically take x x and square each element and multiply times two and then and then add to that three times each xx element and then add the scalar four so if you think about it suppose i'm gonna this could be true in n dimensions but i can only draw with my hands in two so think about the ij element on the input what does it influence for the output what it influences is the ij element and nothing else so i tend to call these things element-wise functions because one element only um affects the uh equivalent element in the output okay so that's my func in place and here's my func same exact thing but i'm putting the output into a new variable name and so i'll get rid of both of those and now here's my show in place and i'm ignore the debugger stuff right now i'm going to um in fact i'm going to make this bigger so you can see the whole thing right now i'm going to um set up an array just call make up some random numbers here i'm going to first call my func with x is the input and x is the output and then i'll do the same with my func in place and then i'm going to call them each with um separate names in the for the left-hand side so i'm going to call my funk with a y and then with a yy and i think uh oh interesting question so um someone just asked if you have to uh close a function uh terminate a function with an n statement you do not have to but in this case it's a great question by the way and it's something i will cover in two weeks but i'm happy to talk about it here momentarily i can put it there and it doesn't change the meaning of this program but there are times when you have to use end statements and um uh that won't be today it'll be it'll be in two weeks but um if you have some code that is old matlab code that doesn't use end statements and you want to change them to use end statements you can if you're using uh there's only one rule in and this is well there's two rules um and i'll get into this more again two weeks but if you have more than one function in a file if you terminate one function with an end statement you have to terminate all the functions in that file with an end statement and if you happen to be using nested functions you must terminate them with an end statement which means everything in that file that is a function needs to be terminated with an end and if you don't know what function our nested functions are come join me in two weeks i'll be talking about them but in the meantime this end is completely up to you uh it really doesn't matter matlab won't be more or less efficient because of it um either way okay so i'm gonna um just run this code and um oh and then i said i was gonna call my the regular one with uh y and y y as output and then i'm done so let me run this and what i want to do now is i'm going to bring up the windows task manager okay and i want to see what's happening with memory so when i created um this array here um i had a bump up in memory and obviously because i'm online other stuff is going on on my machine so you may see the memory jitter a little bit but that's just from my first one now what i want to do is i want to move from here and i want to call my funk with the same um left hand side as right hand side whoops i moved that and i didn't mean to let me go like this so that you can see what's going on and let me continue here and so we had one bump up because we made the array x and now when i call my func notice that it goes up again by about that same amount but then it comes down because we're replacing the old x with a new x and now we're gonna go do the my uh the in place version and when we do if you look here it now did it in place and there wasn't a big bump in memory and now let me do the last one where i'm going to call the in place version but with an uh oh i'm going to call the regular one with a new left-hand side let's do that and when i do we see the memory go up again because it needs a chunk of memory the size of x okay so it went up it went up as far as this one did but it's staying up and now i'm going to call my funk in place with a new left-hand side and because it's a new left-hand side we bump up again and now i'm near the end of the function and so i'm going to hit continue and that's going to say finished and we're done and you see when we finish matlab gives all that memory that we were we were using inside this function for xx and uh for y and yy and it comes back to me because the function is done so here's the rule of thumb if i have a function here that's calling another function here and the input name is the same as the output name both when i call the function and in the function i call in this case it's not then matlab will try to do the calculation in place and it's like oh man lauren why are you saying try don't you just do it well we can't just do it because there's a relationship that has to be true between x the input and x the output for this to work and if i were in a live audience with you i would i would make you try to uh to to tell me what that is um but in the interest of uh uh not making it complicated through youtube and all that i'm gonna tell you and that is that x the input and the x the output have to be the same size and data type okay so if x is a one by three double the output x also has to be a one by three double imagine my input was an unsigned eight uh an unsigned um int so one byte and my output was a double eight bytes well eight bytes don't fit into one byte okay and so that's why we need the data type and we need it to be the same size and if you're thinking in matlab it even needs to be the same thing like if the input is full and not sparse the output needs to be full and not sparse so things like that so that's one thing that needs to be true and the other thing that needs to be true is we have to know an algorithm that will work okay so it turns out for matrix multiplication even if i'm talking about a square matrix and i say a equals a times a there is no in place algorithm that lets me do that i actually have to make a new output to hold the new a because if i start overwriting elements in my a while i'm doing the calculation i will get the wrong answer okay so we have to have a good combination of the right input and output relationship and an algorithm now element wise ones always work so things like trigonometric functions and so on um and plus and minus and things of that sort there's another class of functions that also works and i'm going to attempt to show you by um modeling it here on the video so hopefully you can see me and suppose suppose we're using one of the cumulative functions in matlab so that would be sum cumulative sum prod q trap c q min q max okay let's suppose we're doing the sum now pretend i'm the array here or i'm representing an element of the array and i'm right now the one one element well when i do a cumulative sum the one one element is just the one one element so i come over to number two the one one element is one plus two so i have one right next door so i can take one plus two one plus two and replace myself with one plus two let's see if that's going to work let me go to three three is one plus two plus three i'm three one plus two is right there so i can take it add it to myself and replace myself and so all the cumulative functions work whether you're working them forwards or backwards can take advantage of that and so this is still pretty esoteric because if i can do some of these things in place that might not cover the whole calculation i'm doing and so what we have found is that sometimes if you um if you refactor your code and you say oh i've got some element-wise stuff or i've got some stuff that could do go in place and i have a really large array to work on but that's in the middle of my calculation so what you do is you can refactor your code into pre-processing function and then the element-wise or whatever the in-place stuff and then a post-processing piece because often the pre-processing piece and the pros processing piece are not using um every array to do everything it might be padding one array with a few extra elements and so i don't need copies of everything while that's going on i might just need to make that one elem one monterey grow and so we found people who if they were near the edge of something working or else going into virtual memory which would slow things down they were able by refactoring to make this work better for them without having to think about let's go to a much bigger machine or i have to buy more memory or i have to use a parallel system this is just within the confines of of matlab and so that's something that may help you once in a while but it's a little strange okay now i want to talk about for loops and their performance and memory because there's a link between performance and memory here and suppose i want to try to calculate this matrix the rows and columns are going to be the column the row number times row number plus column number q plus 17 and i'm going to show it to you three different ways i'm going to show how to create this array looping across the rows looping down the columns and vectorized because there's different implications for which way you address the memory with matlab and that's because um we have to store the the matlab array somehow and the way we store the matlab array i'm going to show you here okay so let me do my my little thing of clearing and closing things and now um you will see that instead of using a regular matlab code file that ended in dot m i now i'm whoops wrong way i am now using something that ends in dot mlx and so i'm using it uh an it's not even a new feature anymore it came out in 2016b it's called the live editor and that lets me create these beautiful um uh documents here that have actual code in them too but i get to tell a whole story here so i said this on the slide let me show you how the matlab memory is laid out and i see i have a an extra space space here okay so i i was telling the story in this i used to be a fortran programmer i i am reformed but i i was and um it was the way to okay what is the case if output of the function has lesser bytes than the input say uh so that's a good question too um if it's a different type we just make a new new memory we don't overwrite it at all um we we because we don't know um at that point if it's going to work so if you're going from a double to an int um we we there's no sense of in place because we don't want to use the whole array and just the first bite of each one we want to actually compact it even more so it won't do that in place memory thing another great question thank you for answering or asking not answering asking okay so in the old fortran days i used to program and fortune all the time and so you you really had to be careful in fortran to arrange the code you wanted to run through your loops in the best way so that it used the memory as sequentially as possible and that meant that you needed to know how fortran stored arrays and as it turned out it stored it if this is your array column one column two column three it's stored column one and then followed by column two and then underneath column three and so on like a long string column by column by column and that's the way fortran works that's actually the way matlab works um it's not the way c works c works row wise we could have made any choice at the time fortran was at that point a very popular um mathematical and scientific computing language and so we did what they did okay so let's just see if i try to run the loops in different orders is there some implication for that so i'm going to help i'm going to make me my um array have uh i'm going to time it here so that's just making um uh the time thing because i'm gonna time three things remember i'm gonna time the um the loop the inner loop going over rows the inner loop going over columns and then a vectorized solution okay so um i'm going to say i'm going to work on a 5x500 array and then i'm going to time this so my my output array going by rows on the outside so i'm doing a column at a time i'm going over the columns each time so i'm going across the rows each time basically here because i'm going across row 1 and then across row 2 and so on so if i come here one way i can run this there's a lot of ways you'll notice in the live editor i can do that same run in advance and it's up here but i can also just put my code my cursor on this chunk of code this section and i can just run and you can see that it um uh had an indicator that it was running now i'm going to do the same straightforward loop you'll see this is rows plus times rows plus columns cubed to 17. and now i'm going to get the column version and we're going to time that that's what the tick tock is doing and now i'm going to use do the vectorized version which is creating my number of ro this is my number of um rows this is my number of columns and so i'm going to just i have to do transposes and things to get things right and i'm gonna time that now before we even worry about the times we better make sure we computed the same things so that's what i'm doing here with his equal i am making sure that everything got computed correctly and here are the execution times so the first one took up about four seconds it took a little bit longer going across the rows then down the columns and the vectorized one was even faster so the order that you do loops is important now you might say depending on how often i run this you know if i run it again i'm going to get a different answer because matlab begins to learn things as it runs the code so it might go uh if i come here and we do this again let's do this and this and this and this and now you see that so there's some difference there and now you can see there's really a bigger difference here between the columns versus row version and i can try it one more time i don't know that it's going to make a big difference but let's try it one more time and um so it makes it makes difference it depends what jitters going on on my computer at the time too um so um i don't want you to learn from this that you should never use for loops sometimes it's the best way to express something sometimes it's the only way to express something not everything can be vectorized like like i did here but what i do want you to be aware of is that making sure you go down the rows so that you're addressing the elements in order is faster and that will always be faster if you go down the rows than across the columns because of the way the memory is stored for matlab so hopefully that is helpful to you while i'm here i'm just going to take advantage of the fact that this is a live script and show you one more thing about it i can actually export this when i'm done and i can export it i'm gonna zoom up let me zoom out and hit export and zoom in again and you'll see i can now take my document and export to pdf word html or latex um and i also have a whole bunch of other things that i can do in there that i'm not going to talk about but um uh it's it's a really nice way to get your thinking straight um so i encourage you to look at it if you have stuff that you need to document whether it's for your uh your professor your your a paper the outline of a paper you're trying to write or for your boss or whatever okay let me move on so right now i've talked about until now i've talked about numeric arrays which is what was the only thing that existed in the first versions of matlab when i used it but the real world doesn't really work like that there's other information that's pertinent to what we're doing so suppose i'm doing an experiment i want to collect the data but there's often metadata that i also want to have associated with it for example i might want to know what the ambient temperature and um uh barometric pressure are or something like that i might want to know the name of the person who's taking the uh measurements whatever and and those things are different than the data themselves and so i might like to somehow take all my data which might be maybe some time sequences maybe i'm getting a data point every second for 100 seconds and i'm getting some other thing a voltage and then i also want to put like i said the experiment person's name and the um temperature and pressure there and so i want to kind of have that all get collected together and we have at least two ways actually three ways to collect things together very nicely one is with a cell array and one is with a struct and i'll talk about tables in just a few minutes so let me just start with only one piece of data that we're going to worry about my numeric data here and i have a really tiny amount of data here obviously it's a double array even though it looks like integers matlab when you input numbers they are doubles and i have a double array and then i'm going to take that same array and i'm going to put it into a cell of a cell array and then i'm also going to take that same array and put it into the d field of a structure now for those of you who don't know what cell ray's instructs are let me give you my analogy that i think will help you so everything in matlab is an array which means it's regular it's m by n or m by n by p or m by n by p by q so on okay well cell arrays instructs are two okay so this cell array actually happens to be a one by one cell it's a scalar array um but i could like when i was a kid um i would steal an egg carton when it was empty sometimes before it went into the trash and i would put my pennies in one of these sections and i would put feathers in one and i would put white stones that i found in another one and some of them would be empty and what unified these things kind of like the experiment as that was my stuff okay and i didn't want anyone touching my stuff and if you think about it as a celery well my egg crate was in my case it was two by six and um i could you know if i if i told you to get the feathers and you couldn't look inside to see you would want to know oh it's in row 1 column 3 or something like that so i could do that but if instead of one of the dimensions i have names i could think about the very last dimension instead of being the final column let's say if i give it a name i could instead store something in this box and each of the little areas would have names in them and those names might be pennies dimes feathers stones and so on and then i could say get the stones out of the stone section and you would know what to do so that's the idea between celeries and structs and because they're a container holding my other stuff they're going to take up a little bit more space than just the original stuff okay but the question is how much and um i want to talk about that now so um for everything that i'm going to talk about i might prefer to just do a celery or i might refer to a struct but they're basically very similar and the if i say something for one it's true for both of them so let me come over here and let's clear everything out and let me clear all the output okay so here we have my array that i'm going to make and the cell array and the struct and you can see them there and the question is i just told you which one's going to be biggest well i don't know but i'm pretty sure i hope i've convinced you without thinking at all about it hard that d is the smallest of these and the question is okay between a salary or a struct which one's bigger and which one's smaller and usually i make people vote so think in your head which one you think might be bigger and why or smaller and why and then i will come over here and run this for you and explain what's going on so let me make that a little bit bigger so you can see it so you can see my cell array um takes up 16 bytes it's a no my double array sorry takes up 16 bytes two uh two eight bytes for each double and then on top of that my cell array which is only on one by one has a hundred and four more bytes well what is being taken up in those hundred four more bytes well it has to know information about what it's holding so what's in there is something that's a 1x2 array it's double it's real it's full and not sparse real not complex and so on so we need some information about that and that's the minimum amount we need as 104 bytes of overhead if you have something that's highly dimensional like 1 by 3 by 7 by 20 by you know a really long set of dimensions it might take even more space and then you'll notice the struct which really only holds the same stuff is another 64 bytes more than the cell array so i'm going to come here i'm going to put a section break and i'm going to start typing something because i want to explain what's going on um name length max okay this is a function whose name you probably have not heard of and there's good reason there's no no reason you need to know it normally it has an answer of 63. okay let me come here and let's get help on name length max and here you see it's the maximum identifier length in matlab what's an identifier an identifier is the name of anything in matlab that you need to know about so that you can refer to it so that you can tell matlab what you want it to do so that means it needs to be variable names the field names of matlab structures scripts functions class names simulink model names it's not the name of data files that's different stuff but this is basically the names that you need in matlab here and you'll notice that it's 63. um we tried to make it big enough so that you could make names that were um expressive but not so big that we're wasting a lot of memory because most people only use three or four characters okay now it turns out that this is 64 bigger and not 63 bigger and that's because now i'm letting you in on implementation detail you kind of don't need to know but why not tell you it's not hurting anything we saved the final byte here the last byte the 64th byte for what's called the null terminator because it's written in c and then c the convention that's the convention you use so you know when a string is ended so we make sure there's always room for a null terminator so we can never over get over 64 that way or over 63 that way okay so now we know that a um a double array takes less memory than a cell array with the equivalent data and the struct takes a little bit more more space on top of that okay but i've only had um scalar arrays here so far you know this is a one by one cell this is a one by one struct but matlab which you probably know stands for matrix laboratory is actually an array-based language and so i can make arrays of these things too and so i want to talk about how we store things in structures that are not um that that first of all have more than one field and second of all we'll get to what happens when we have a raise of them so i'm going to make a scalar struct again this time but i'm going to have two fields not just one an a field and a b field and it'll be big and then i'm going to say s new equals s and we'll see what happens with memory in matlab when i make that copy and then i'm going to change one element of one field of s and see what's going on and so i'm going to come over here i'm going to shorten this i'm going to bring back our task manager there so that we can see the memory here okay so i'm going to um now you'll notice that instead of running things in place i'm running things side by side now that's another way you can run here okay and i'm going to create a structure here um and you'll see that the on the memory gauge that the memory goes up a bit one little step and now when i create b it should go up by that same amount because it's the same size array that i'm putting in there and now i'm going to make a copy i'm going to say s new equals x s and to me if you think about it from a logical point of view it should go up by those same two steps because i'm making copies but we also talked about copies not always needing to be made right so what's happening here is we actually don't do anything here i've got s new it tells me that it exists now but nothing happened in memory because at this very moment s new and s are identical completely and now i'm going to modify one element of one field of s and the question is do i go up by two bumps or one and let me do it while they're still on the screen i think i ran it maybe i didn't we go up by one bump so let me clear this out again and let me now that i've talked through it let's run through this faster so i want to clear everything out to get our memory back let's come up here and i'm going to run the section in advance so we're going to get one bump up oops sorry now you know what happens you get errors life happens it's okay now i'm going to create my first uh field there's first bump you should see the second bump here for b here's s new and nothing happens and here's me changing s dot a and just even one element and it's a little hard to tell here but it's actually going only up by another one of these little steps the granularity is not very easy to see here and because if i did this so i'll show you because i'll do this now and let's um run this as well and when i do it goes up again so b had kept its connection s dot b and s new dot b were connected until i made that change just now now um so what i want you to know is for every field in a structure it contains its own matlab array and that's why when i changed s dot a it had to sever itself from s new dot a but s dot b and s new dot b could be the same thing and only when i modified the b form did we see that piece disconnect as well and now i'm going to clear the memory out again okay so i just said something kind of interesting i hope that every every field of the structure and this means every cell in a cell array are each their own matlab arrays why do you care well the reason you should care is if you're passing one of these entities into another function and that function's going to operate on some of it it's only going to change potentially change the fields or the elements of the cells that it touched and all the other ones don't need to have extra copies made so if you're trying to do something small in one little piece and you have really large amount of data that's part of this whole big big entity all that big stuff is not going to need to have a copy just the fields the small fields maybe you're fixing a date on the day of an experiment because it got entered incorrectly or something like that it's not going to make a copy of all the data because you're not changing that it's only going to deal with the field or the cell that's changing right there okay another thing you will notice is that i went and changed my code here i'm going to just show you briefly i'm actually working with stuff in get behind the scenes and so if i wanted to i could come back here and i could say what whoops what was my original code ah i could i'm being i i will try one more time and then i won't current folder okay it won't let me do it i'm not going to worry about it i i can revert at any rate for those of you who are get users if i want to let me go to the next topic here which is okay those were scalar structures it was s dot a and s dot b but there was really only a s of one one i want to think about what happens when we have an array of structures and to do that i wanna motivate it so we all these days have um cell phones with cameras and we take pictures on them and they basically get stored effectively as an m by n by three array because we're doing things in color usually and i'm not worried about compression right now so it's not how many bytes it takes but it's representing the size m by n and if it's low res it's going to be you know whatever you set it to be 200 by 100 and if it's high res it might be uh 2000 3000 by two thousand so different m's and n's depending on the resolution and then three um three uh planes afterwards one for red one for green and one for blue okay so i may be in a lab and i'm trying to collect some sort of uh information of the same kind and i could take a picture and collect the m by n by three array but there's um devices that are put out by various manufacturers that may be more helpful maybe more suitable than what i have available and so if we use sensor number one they say oh well we collect things um uh because we have a really good red filter we're gonna and a really good green filter we're gonna make sure they're separated for you so we're going to give you red green and blue but instead of giving them in separate arrays we're going to put them into one structure for you and if you think about it this array this red plane is m by n the green plane should be m by n blue planes m by n so i have three m by n's which is the same as m by n by three so same number of numbers right now and then we have another equipment manufacturer who says they may have a great red filter but we can collect more data more quickly because we use an array of sensors and so we have an array of sensors that's m by m whether it's you know again 3000 by 2000 or whatever and in each one of these we store one red one green and one blue so three numbers for each of these so we have m by n of these and times three so it's the same number of numbers each way so let me come over here in matlab let me do the usual let's clear things out we'll get rid of that right now and i want to get rid of the output okay so i'm going to come over here and i'm going to load an image you just saw it but you'll see it load in now and we're going to use this pretty colorful image and you can see it's 135 by 198 by 3 and it's only 80 000 bytes because it's not doubles it's unsigned 8-bit and in fact if we come over here i think it might say that this class is unsigned 8-bit okay so now what i'm going to do and i i love this matlab code this is one of the things that's so expressive about it this says give me take my array x my image and give me all the rows and all the columns in the first plane second plane and third plane so i can come here and i can get the im1 here and now there isn't a really nice way for me to do the same thing for the im2 version so i'm going to use a for loop and i'm going over things in the right order here so we're going um uh down each row i'm going to rows you know row 1 row 2 or i3 and then looping over the columns as we do that and i'm putting the data in as we go so the question is i have x now if we do a who we have x i am 1 and i am 2. i hope you'll grant me that x is going to be smaller than i am 1 or i am 2 because it's just m by n by 3 as we saw up here and let's see what happens so if you think about it the way to think about this in my mind is to think about how many arrays are being stored and if we think about im1 there's an array for each field right so a red a green and a blue so there's three fields that's not that many so if i come here that's i'm gonna be a little bit of overhead the overhead remember was 104 and then we also have to deal with the fact that there's a name there there's three names there in fact and you can see that in my calculation for the overhead im2 actually has only one field so it's going to be name length max plus one but instead of the number of rays being 3 it's got m by n arrays each of which contain a pixel and so i've got m by n arrays here so when we come and look at this oh my goodness look at this now my im2 is in the first two dimensions the same as my x my im1 is only a one by one array because it's a scalar struct but it takes almost no more space than the original x 500 bytes or so im2 takes a lot more this is almost three megabytes everyone three megabytes for the exact same data that i can store in 80 000 bytes and this is the uh calculation for the overhead here so the first one the im1 has the overhead of 504 and the second one is almost 300 megabytes now i don't want you to think that you should never ever use um arrays of structs there are times when you should and can and it makes sense but i want you to know about them so that you don't mistakenly do it in the case where you don't need it and therefore get a lot of extra memory bloat over that okay so what do we do in this case with the im2 is there a way we could fix it well i don't know if this what i'm going to do next makes sense for images but i'm going to talk about tables and it says new there but in fact tables been have been around since 2013. so it's not so new anymore so hopefully you know about them but my experience is that not everyone knows about them so i'm going to show them to you and i'm going to clear my workspace here and i'm going to clear all the output here okay and this data type is basically think about spreadsheet and as long as each column is homogeneous in terms of what it holds matlab can you put that in to a table and it supports very flexible indexing when we introduce tables we also introduce categorical arrays these are for data that are essentially non-numeric think about something like um uh colors like um you know our favorite colors mine might be purple yours might be green on the spectrum there's different things where we could we could talk about the length of them or wavelength or something but in terms of purely thinking about colors we like there isn't one ordering that makes sense it's really an unordered list but there are things where it would make sense if you're going to the store and trying to buy a coffee you might buy a small a medium or large with the expectation that um a small holds less and costs less than a medium which holds less and cost less than a large okay so that would be an example of an ordered list even though you don't have exact numbers for it maybe and so they can be very helpful when you're doing some of the work too and then i'm gonna just mention briefly strings they're a better way to work with text they're relatively new i think they came out in 2016 and i'll talk about them in just a minute but i want to come over here and talk about the tables first so i'm going to run some code here you'll see i'm just putting in matlab arrays these are each length five here um but my table is going to come out and it's a five by five table i could have i didn't have to have the same number of roses columns here obviously and you'll see the last name they're all celery cells of a celery containing a string and then we have age height weight and blood pressure and notice blood pressure is two numbers in each row as long as each row is a one by two double then i can stack them up and get the output to be what i want and the table is five by five and it's just pretty convenient to see what's going on there okay well if you have maybe say saved some data that would be similar to this and you'd want to maybe try them in a table and you saved them in a matte file what i can do is i can load in my map file here into a structure if i give it a left-hand side and you'll see when i hover here everything is length 100 but there are different types there's salaries there's doubles there's logicals okay and so i can find out how many fields there are or i could have come here and i could have counted them um and now i would like to say well what if i put that into a table instead well i can use the structure table function to get me there which is very nice and now you see i have a hundred by ten table and it doesn't even though i left the semicolon off it doesn't print it out it lets us scroll through there and we can see all the different column names and so on okay and so now if we look in the workspace we see that i have my structure which is one by one and it takes a little bit less memory but not much less than my table which is a hundred by ten the table just by telling me there's a hundred rows and ten columns is helpful from that point of view and the extra overhead is not huge it was about a thousa a thousand uh bytes okay but if i add more rows i'm not adding more overhead because each one of these columns has some overhead so i've had added a new column that would be like adding a whole other variable to my structure so of course it would take up another name in the namespace there and another one here so if i add a column in either one of them it's going to add some overhead as well as the data if i just add rows it's only adding data and no overhead and so i showed you this which is what the overhead was okay now look what i can do i can say if we look in this table here just looking at the top we can see the ages are kind of bouncing all over the place and even the last names aren't in order and not nothing's necessarily in order here in fact it probably is ordered by some thing that we didn't even leave in this table like data visit or something i don't know but i can come here and i can sort my rows um uh in the patient table by age and when i do we can find the oldest patient by just saying give me um the end one and all the rows and we can find out about that person but suppose notice here i have a bunch of 25 year olds maybe i want to um i want to um sort them in another way maybe what i want to do is let me come here let me do it oops let me come here and let me i'm not going to worry what i call it right now but let me sort it by two things let me sort it first by age and then by and let me run this again and now you'll notice that i've got the 25 year olds at the top still but now they are increasing weight and maybe not too shockingly because women tend to be smaller than men more often not all on average uh the women here happen to be first okay so and i could add as many more variables as i want to sort if i want to do tertiary sorting on another thing whether they're smokers or not as long as the first two things are a tie that's what would happen as it would sort on the next thing okay and then let me show you categorical so what i'm going to do is there's one field in the patient table that was self-assessed health status and this is when you go in and there's those smiley and frowny faces and you know you basically have to pick where you are on the spectrum and i know you know being scientists we tend to think or engineers we tend to think well they said one to five i'm feeling 4.3 i'm not really a four but i'm definitely not a five they don't care they just want to know do you feel lousy or do you feel okay okay so for their point of view that's all they need need excuse me now for my cell array of strings here which is what they're stored in it takes about 12 000 bytes all right and now i'm going to turn this into a categorical array because i have four categories there if you think about it and i'm going to show you the categories that it thinks are held in that array excellent fair good and poor and i didn't make it a a structured array a um ordered list you can do that just go to the help for categorical and you can find out how and then when we look to see the difference in the sizes you'll see my categorical array is also 100 long but instead of 12 000 bytes it's about 600 bytes so so if you're importing a spreadsheet and set for example and you have several columns that can be categorical it may be to your advantage to um make sure that they get imported as categorical rather as than as carrying around the strings and the nice thing is i can do things like i can say sum up all the people who said they felt good and there it is and we can have an answer so i can do logical and if it's ordered list relational operation not on it excuse me all right now i talked about strings and i want to come over back here to the strings um we introduced a new data type called a string in 2016. and what you'll notice here is we're using double quotes instead of single quotes that's observation number one observation number two is i have an array of these things and they're not inside a string a cell array this is a string array and it's a one by three because this is a one by one string and a one by one string and one by one string and the way it got created it they got created from a one by one string and another one by one string and in the middle we use implicit expansion to expand out the one through three so we make image one image two and image three for example okay and um we did this because we wanted to have better text processing capabilities that enabled us to release the text analytics toolbox for example and do some machine and deep learning with data of that sort as well as numeric data or images or whatever so it also is simplified text manipulation so this is what i used to write and believe you me i've written this many times sort of this sort of thing i want to see if the string dog is in my my um data set my text data and so i do a stir find which will find me the location the beginning location of where dog is and it may be empty or maybe give me number two if it starts in the second index or it might give me an array of indices two three forty four ninety nine um and if it's not empty then i found one so it's in there really hard to read that code afterwards and remember what it does so here's what you can do now you can say does text data contain my string dog it's so much easier i don't know about you but i can kind of feel the um feel my blood pressure go down i feel myself relaxed because i can read the code and i don't have to stress about what it's doing so not only is the uh code about 50 times faster this way than this way um this one with a string and that one with the cell array of strings instead but this is also strings tend to save about half the memory over cell arrays of strings so let's come and see this so now i'm going to turn my care cell array of strings into a string array here and now you'll see we have a hundred by one uh our string array and we get about a factor of two savings in the string and i still get the properties so that i can do the logical operations that i expect to be able to do how many people said they felt good and that's however many all right um just about done here i just want to tell you that today we talked about matlab and memory and how matlab passes arrays to functions and i told you it was by value we we call the function as if we made copies but we do it with a lazy copy or copy on write paradigm so we only make the copy when we actually need to and i showed you that sort of wacky in-place optimization code pattern where the input name and the output name have to match in two different locations when you call it and where it's being called and then we try to do it in place because not everything can be done in place so we will if we can and then i told you about the memory used for race storage like numeric arrays versus cell arrays and structs which have some overhead and then i talked about the array of structs versus the struct arrays and from a memory point of view only a small number of big arrays is better than a a really large number of small arrays because each array has its own overhead so only from a memory point of view there's other reasons why you might arrange things differently but that's the way that happens and i told you about the for loop and which way it's best to loop over the data and then i showed you these categorical arrays tables and strings and i'm just gonna there's some additional resources you'll be able to download these slides and this is just a um a public service announcement announcement for two weeks from now everything you wanted to know about functions and so i'd like to encourage you again um i want to thank you first for joining me today i'd like to encourage you to join me in two weeks for um a discussion about functions in matlab and then again two sessions in the fall that we'll have which will be on programming with matlab and from a programming point of view all the tools you have available and then the second one then in the fall will be object oriented programming and um i want to remind you also that if you want to be notified about these please subscribe to the matlab youtube channel to watch live thank you very much you
Info
Channel: MATLAB
Views: 9,012
Rating: 4.9768114 out of 5
Keywords: work from home live, matlab, simulink, mathworks, matlab tutorial, time series data in matlab, python
Id: rqZBmLW_1mw
Channel Id: undefined
Length: 65min 15sec (3915 seconds)
Published: Thu May 13 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.