Mod-09 Lec-39 Cache coherence

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

Welcome to lecture 39, the course on High Performance Computing. The previous lecture we started looking at parallel architecture. Our objective is to understand a little bit more about the programming of parallel machines, because parallel machines are fairly wide spread today, they provide improved performance, a programs run on parallel machines can run much faster. They also provide the possibility of fall tolerance, which is sometimes an important requirement. In a way, in looking at the possibility of a parallel machine in which the many processors actually share a single physical address space with a single shared memory. In the previous class we understood that, there is a potential problem if each of the processors has a cache, and this was referred to as the cache coherence problem, which we looked at from an example of understanding what happens when a simple program which has two processes runs on a parallel machine, in which there are four processors, in which each of them has a cache, each of the processors have the cache. And in the example which we went through, you will notice that there was a parallel program running as two processes, process p 1 and process p 2, they have a shared variable called x, x is initially equal to 0 which is in the main memory. Now, when process p 1 reads x, a copy of the variable x comes into the cache of the processor running process p 1, if the slides were shown, then they could see what was being said. Then subsequently, if process p 1 reads x again, it will get the value of x out of its cache. Now the problem is that, process p 2 could be sharing the variable x, and therefore, if process p 2 also reads x, then copy of x will come into its cache, subsequently process p 2 could read the value of x again and again. But if now, process p 1 or process p 2, for that example, was to modify the value of the variable x, by ordinary x equal to 1 or some kind of a store instruction on the memory variable x, the effect would be that one of the other cached copies of x could become incorrect, and therefore, if process p 2 actually read the variable of x again, it would get the old invalid value of x equal to 0, whereas currently from the perspective of the parallel program x is equal to 1. This is a serious problem, because the correct operation of the program has been compromised, and is to refer to as the cache coherence problem, which we described in the next slide. So, if each that the cache coherence problem arose if there is a shared memory parallel machine, in which each of the processors has a cache, and the problem arises, it is a data consistency problem, it arises when there is the need to modify a shared variable, it gets modified in a cache, but maybe not in all the caches. And this has to be handled, it cannot be just left to the programmer to deal with, because it is not possible for the programmer to deal with correction of the incorrect values inside caches, therefore, either the hardware or the system software has to handle this. We looked at one simple mechanism, which could be built into shared memory parallel machines, in which there are snooping cache controllers, in other words, in which the cache controller at each of the processors just constantly monitoring what is being done on the shared bus by it and by other processors. So, if this is the case, then whenever there is a an update of some kind, which has happened, or update to main memory, which happens on the shared bus, each of the cache controllers will notice this and can take corrective action, possibly by updating or invalidating its copy of the cache block in question. And we saw that this could correct the problem with the previous example, because now when process p 1 reads x and gets a copy of the block containing x, and it is cache does not matter if process p 2 gets a copy in its cache, subsequently when process p 1 modifies the value of x under the Write Once Protocol, the copy in it is cache will get updated to 1, the copy in the main memory will get updated to 1, the activity of updating main memory will require a shared bus transaction, which will be observed by all the caches in the system, since the cache controllers are snooping. And therefore, each of the other caches in the system, cache controllers could invalidate their copies of the block as shown in the next step, so that if any of the other processes then read the value of x, it would get a cache miss, which is correct behavior as oppose to the previous situation, where it actually read the wrong value of x equal to 0. Now, this is a simple idea which can result in the correct execution of programs, but it may not result in good performance for the execution of those programs. As we can, we will realize, if we think about what will happen in a shared memory system which uses something like the Write Once Protocol, a Snoopy Cache Coherence Protocol, but also uses locks to implement mutual exclusion of access to shared variables. You will remember that, when we talked about concurrent programming, when we are talking about how the operating system shares the CPU among the different processes, we use this idea of a lock, which was a software mechanism built using special instruction, such as a atomic read, modify, write instruction, like test and set. But the lock is required so that, in regions of a concurrent program or a parallel program, where accesses to shared variables are going to take place, it might be important to ensure that at any given point in time only one process of the parallel program is modifying the shared variable at a time. So, that requirement is still going to be present in parallel programming, if one is using shared variables, and therefore, just like one was using locks to ensure mutual exclusive access to shared variables in a concurrent programming, one would use locks to have mutually exclusive access to shared variables in a parallel program as well, the same concept carries through. Therefore, in thinking about the Snoopy Cache Coherence Protocol, it does not hurt to also remember that, the processes p 1 and p 2, that we just talked about, will in all likelihood be using locks, to ensure mutual exclusion in their access to the variable x. So, let us now concentrate on the accesses to the locks themselves. So, I am showing the same system that we had before, there are four processors and we have a parallel program running,, but instead of just worrying about the accesses to the shared variables, let us concentrate on what happens to a lock. So, let us suppose that the value of the lock is initially 0, you will remember that our understanding of locks is; let me just remind you about locks. We are assuming that a lock is some kind of a shared variable, which has a value 0, if the lock is available and have the value of one, if the lock is not available. Further, there are two functions which are provided for manipulating locks; one is a function, which can be used to acquire the lock, and the other is a function, which can be used to release the lock. So, if there is a lock call L, a parallel program would, before any of its critical sections, would include a call to acquire lock, and the end of the critical section, would include a call to release lock, and the process would not be allow to enter the critical section, until the lock had been acquired. So, the implementation of release lock was fairly simple, a process just had to set L equal to 0, indicating that the lock is now available. In implementing acquire lock, we used a special instruction, which is available, there could be different variance on the instruction, but we were using the variant in which there is the instruction called test and set. So, the idea of the implementation of acquired lock was, the process wishing to acquire the lock, would execute a while loop, in which each time through the loop it would test and set the lock variable. The property of test and set is that, it is an atomic instruction, which will indivisibly do three steps as if they are one step. And the first step is to read the old value of L, second is to modify L to 1, and modify the variable, the memory variable, containing the lock L. So the net effect is that, as long as the process executing acquire lock gets a return value of 1, in other words, the lock is not available, it will continue executing this while loop. But as soon as the value of L becomes equal to 0, because another process has released the lock, then it will get a return value of 0 from test and set of L, and hence might be able to escape from the acquire lock. So, this was how we talk about the implementation of a lock, and in terms of the parallel program, once again, we assume that the same kind of a lock could be used. So, if there is a situation where there is a lock variable, which is initially equal to 0. Suppose that, a process running on the left most processor initially tries to acquire the lock by executing the test and set instruction, since the value of the lock variable is equal to 0, which means that the lock is available, this process which executes test and set of L, will successfully get a return value of 0 and escape from the test and set L loop within its acquired lock function, and therefore, it will acquire the lock setting L to one. So, when it sets L to 1, the cached copy of L gets set equal to 1 and the main memory copy of L also gets set equal to 1. Now, let us suppose that while this process, the process running on the left most processor is now executing inside its critical section, let us suppose that,, one of a process running from the same program now running on one of the other processors tries to acquire the lock. So, while the process in the left most processor is holding the lock and inside its critical section, let us suppose that a process running on the second processor attempts to do a test in set of L. What happens when it executes test and set of L? In order to test L, it has to have a cache copy of the lock variable, so it gets a copy of L into its cache. And you will notice that, test L sees that the value of L is equal to 1, and then, indivisibly as the same operation sets L equal to 1, which is why the cache copy inside the left most processor suddenly disappeared. Now, let me just repeat this once to make sure it clearly understood, when the test and set L instruction is executed on the second processor, they will be a cache miss, and therefore, a copy of the cache of the variable L will be brought to the block containing, the cache containing of that particular processor. But, indivisibly it is going to set L to 1, and that is going to be a modification of L, and therefore, under the Cache Coherence Protocol, the main memory copy will also get set to 1, and because there is a bus activity, the cache copy inside the other processor will get invalidated, flushed, which is why we now have a situation that looks like this. So, when the process on the second processors execute a test and set of L, it caused the copy of the lock in the other cache to get invalidated, but it itself did not get the lock, because remember, that the return value from the execution of test and set of L on the second processor was that the value of L is equal to 1, L is not available, and therefore, it was not successful. Now, let us suppose at this point, a process running on the third processor executes test and set of L. Once again it will suffer a cache miss, it will bring a copy of L into its cache, it will then set L indivisibly as part of the same operation, and the net effect is going to be that the main memory copy gets updated by an activity on the bus, and the cache copy in the second processor gets invalidated. Now, there is one valid copy of the cache block containing the lock L, but that is in the third processor. Now, as you can see, when any one of processes executing on any of the processors executes test and set of L, for example, now the fourth, a process running on the fourth processors are executes test and set of L, something very similar is going to happen. The cache copy inside the third processor will get invalidated, because of the update of the main memory copy of the lock L inside due to the test and set instruction being executed, in effect, what is going to happen is, the lock variable is going to move from cache to cache among all the different processors which have a process trying to acquire the lock. Ultimately, the process which was executing in the critical section is going to release the lock, and that is going to happen, well again, as each of the processes executes test and set of L, it will cause the copy of the block containing the lock to enter its cache, and it will involve the memory operation, it will involve use of the bus. But soon or later, the process which is inside the critical section, will execute the release lock function, which will involve setting L equal to 0, as a result of which L will be 0 in its cache, L will be 0 in the main memory, and all the cache copies of L in other caches will disappear. Subsequently, the first process among the remaining processors which was trying to test and set the lock, let us suppose, it is now the third processor in line executes test and set of L, it will get copy in its block, which will have a set, initially have a value L equal to 0, it will then test and set it, as a result of which L becomes equal to 1 and other cache copy disappear, the main memory copy will become equal to 1, and it can then enter the critical section. Now, one can see that the operation of the lock is correct. The Cache Coherence has correctly ensured that the different processors, the different processes running on the different processors, do not incorrectly update the value of the variable L, which happens to be a lock, and that the mutual exclusion is in fact, guaranteed. It was not the case, that a processor, a process was able to enter a critical section despite not having the lock. The function of the program is entirely correct, but if you think about things more critically, when I had a parallel program in which each of the of the three processes was executing a while loop, in which it was doing test and set of L, which is what was happening in our example, the process running on this processor; process p 1, processor p 2 and processor p 3; were all executing the acquire lock function and that acquire lock function, they would executing test and set of L repeatedly, a single instruction inside a while loop. And therefore, the lock variable was actually going to move from one cache to another cache with each execution of test and set of L on any of the caches, and each of those operations involve a bus transaction and updating of main memory. And this was happening quite frequently, because remember, each of the processes running on the processors p 1 through p 3 was executing the acquire lock function, in which there was a very small while loop, in which it executed a test in set of L and then executed test and set of L again, until the exit condition was reached. Therefore, if these are processors are running on a one gigahertz clock, then they are going to be executing test and set of L many million times per, I mean, going to executing test and set of L once every 2 or 3 nanoseconds, because that size of the loop is very small. Therefore, the net effect is going to be, that if we use this implementation of the lock acquire lock function on a shared memory machine in which there are caches and Cache Coherence is in first using a snoopy invalidate Cache Coherence Protocol, like the Write Once Protocol, then we are going to have a new problem that gets created. The problem, of the lock actually moving back and forth, from cache to cache, at a very high frequency. So, as I have mentioned, once every few nanoseconds, one or the other of these processors is going to be executing test and set of L as it spins on the busy way it lock loop, and every time any one of them executes test and set of L, the block containing the lock variable is going to move from cache to cache, updating main memory along the way. And this has been refer to by some people as, pingponging of the lock. You can think of it as pingponging, in the sense that, if you think of the lock as being a ping pong ball, then it is moving at a very high frequency, in the game of table tennis or ping pong, the ball moves at very high frequency from the 2 half of the table. In this case, the lock is going to move at very frequency from cache to cache, among all the caches, which are busy waiting on the lock. And hence, the term pingponging is not inappropriate, it is going to be moving at very high speed. If the players of the table tennis game are very skillful, the ping pong, the table tennis ball moves with very high frequency from side to side. And the net effect is going to be that, the performance of the entire computer system is going to suffer. The correctness of the individual programs is not at risk here, because we have seen that the Cache Coherence Protocol is enforcing the correct operation of the lock, the correct access to the shared variables, that is not a problem. But what has been created through this artifact of the Cache Coherence Protocol is that; the bus, the shared bus and the main memory are going to be kept busy with these test and set instructions. So, the bus utilization is going to be extremely high, because once every few nanoseconds is, I just pointed out, conceivably, a processor is going to initiate a bus action to update main memory and its attempt to acquire the lock. Therefore, the bus utilization is going to be so high that the bus is rarely going to be available for any other processor in this system to utilize the bus, and that in effect, even if I had talked about an example, in which there was only one process, which was doing test and set of L, let us assume that, process p1, p2 and process p 3 were not doing test and set of L, even if only process p 1 was doing test and set of L, the effect might be as bad, the bus may be fully utilized in just the activity of the busy waiting on the lock by one or two processors. So, the primary problem that we have is, we have a situation where we needed locks for the correct implementation of mutual exclusion for the critical sections of the parallel program which is to be executed, and there was a need for a Cache Coherence Protocol, in order to ensure that the cached copies of the data, including the locks, did not become invalid or staled. But when you put the two together, you could have a problem that the Cache Coherence Protocol results in a very high bus utilization, leading a possibility that memory cannot be access at all by the other program, other processes running on other processors of the system, or by other programs running on the other processes of the system, and this is the severe problem which effects the system as a whole, as I mentioned. Therefore, in the context of a shared memory multiprocessor with the Snoopy Cache Coherence Protocol, such as the one that we just saw, but what should one conclude? One should conclude that this might not be the correct way to implement a lock. By implementing a lock using busy waiting on test and set of L, one is causing a severe problem to the efficient execution of the system, one is not causing a problem for the correctness of the programs running on the system, but there was a problem for the efficient use of the resources of the system, the utilization of the resources of the system. So, one has to thinking about alternative implementations for the acquire lock function in the light of Cache Coherence Protocols. Now, the problem with this particular implementation is, that it was a busy wait loop, which was busy waiting on the test and set instruction itself. One could actually think of having an implementation of a lock, in which the acquire lock function does not busy wait on the test and set instruction, but rather, busy waits on the value of the lock. So, as long as the value of the lock is equal to 0, it will loop, until the value of the lock becomes equal to 1. Now, we saw that this was the beginning of a problem when we try to implement a lock using this mechanism, because there was a need for an indivisible read, modify, write, updating of the lock variable, and that when we try to implement the lock using a busy waiting on the value of the lock, we ended up with an incorrect implementation of the lock. Now, we can overcome that problem by actually using the test and set instruction to do the updating of the lock. So, while we are going to busy wait on a read of the lock, we will actually have an outer line busy wait loop, in which we will busy wait on the test and set of the lock. So, we will have a repeat until not test and set of L, as the correct, in order to ensure the correct implementation of the lock function itself. Remember, what we are talking about here is how the acquire lock function is going to be implemented. This is the implementation of acquire lock. And the objective of this implementation is to make sure that, the busy waiting on the lock is primarily on the lock itself and not on a test and set of L. Realizing that the reason that the lock ping ponged between the different caches was, because we were busy waiting, and in the busy wait operation we were, every few nanoseconds, testing and setting the lock which was a modification of the lock. Over here, we overcome that problem by, busy waiting on the value of the lock and not on a operation, where we are modifying the value of the lock. Now, sooner or later, a process will exit from this loop because L will become equal to 0, when it is reset by another of the processors, and at that point in time it comes to its outer loop, where it will attempt to test and set the lock. Remember, that there could be many processes, as in our previous example, they were three or four processes, all of whom were trying to acquire the lock. So, it could happen that more than one of the processes could notice that L has become equal to 0. But, we have the safety mechanism that only one of them will actually observe that it has successfully change the lock from the value of 0 to the value of 1, the remaining processes will notice that they have changed the lock of the from 1 to 1, and we will continue to go back and busy wait on the read of L. Therefore, this will continue to operate correctly, it will be a little bit better than the previous implementation, in that the bulk of the processes will be busy waiting on a read of the lock, rather than on a test and set of the lock. And the net effect is going to once again be correct. There is a minor problem with this implementation, in that, as I pointed out, it could be the case that more than one of the processes may notice that the lock has become equal to 0, because we may just execute that check of L at a point in time, soon after the point in time, at which L has become equal to 0. So, for example, it could be the ten processes, all notice that L has become equal to 0, and that all of them try to test and set L. But only one of them will successfully test and set L, which is why I say, the first one of them to get test and set on the bus wins, and will cause invalidation of the other cache copies. But there is a problem with this implementation, in that, soon after a lock gets released, there will be a lot of bus activity, as in this example, nine or ten other processes all try to test and set the lock, again using a modify instruction, which will cause bus activity. Therefore, one could improve on this particular implementation of the lock even more by setting things up, so that once again, one is busy waiting on a read of the lock, but rather than immediately allowing, after the lock becomes available, allowing all the processes which observe that the lock has become available to busy wait on test and set of L, which is what we had in the previous implementation, one could actually set things up, so that, between exiting from the this while loop and checking to see whether the lock can be acquired, the different processes could wait for different amounts of time. And this will cause them to actually end up attempting to test and set the lock not at approximately the same time, but at different points in time, and therefore, reducing the intensity of the impact on the bus utilization. So, as one thinks about things little bit, one realizes that the implementation of the acquire lock function within a system, which is a parallel computer with some kind of a Cache Coherence Protocol along the lines, what we have seen, could be a little bit more complicated than the simple implementation of a lock, that we had talked about earlier. And that the primary objective would be try to make sure, that the correct operation of the lock is ensured, but in addition to that, the efficient operation of the system as a whole is not compromised. So, lock implementations could end up being a little bit more complicated than what we had imagined, from our discussion of concurrent programming. This is in the context of parallel machines. Now, with this we have a rough idea about what to expect in the different kinds of parallel machines. Remember that, there are primarily two kinds of parallel machines; there are the shared memory parallel machines and there are the message passing parallel machines. We have seen a little bit about shared memory programming before, in the sense that, when we talked about concurrent programming, we were talking about issues, such as the need to have mutual exclusion, the need for locks, the need for shared variables etcetera, all of these issues related to concurrent programming between processes which had shared variables, they also relate to parallel programs of, may be, talking about parallel programs with threads or processes that have shared variables, and therefore, what remains for us to see is, more about the message passing kind of parallel programming. But before getting to that, I would like to make a few comments about parallel programming in general. Now, you will recall that, when we talked about parallel architecture, I talked about two kinds of classification; one was the classification, where we talked about parallel machines as either being good for shared memory types of parallel programming of a message passing types of parallel programming. Before that, we talked about Flynn's classification, in which there was the Single Instruction Stream Multiple Data Stream kind of parallel machine and the Multiple Instruction Stream Multiple Data Stream kind of parallel machine. They differ in that, for example, in the Flynn's SIMD kind of parallel machine, at any given point in time there is one instruction which is being executed by all the processors in the system or all the ALUs in the system, in the example that I used, each of them would be operating on a different set of data to different operands. So, if there were thousand ALUs, each of them could have two different input operands for doing the one operation, that they were all doing at a particular point in time, as indicated by their one instruction. And the alternative was, the Multiple Instruction Multiple Data kind of a scenario, where we could, in fact, think of the as being a single program which runs with one process on each of the processors, alternatively we could think of it as a piece of hardware, on which we could actually have multiple independent programs running on the multiple processors independent of each other, not cooperating in any in any fashion what so ever. Now given this, one would expect that, if one was writing a program to run on an SIMD machine to probably be different from the program, that one would write to run on a MIMD machine. And it might be good just to talk a little bit about the different kinds of programs, so the different kinds of programming models that might be involved, if at the at the highest level, and thinking about SIMD versus MIMD, for example. Now, the property of the SIMD, I will remind you again. In an SIMD machine, at any given point in time there is one instruction which is being executed on all the processors of the parallel machine, and therefore, the program itself is going to have parallelism expressed in terms of the different data, which are going to be operated on the different processors by that one instruction. And therefore, one could think about SIMD is having this property of the same program running on each of the processors, but each of the processors using a different piece of data. And one might think about the program, in effect is being a single program multiple data kind of a program, there was the same instruction executing on each of the processors, they just happen to be using different data. Hence the name Single Program Multiple Data. The alternative is what you might talk of for an MIMD machine, where each of the processors is actually capable of executing a different program, and that you could, in fact, have a Multiple Program Multiple Data kind of a setup to write programs for a MIMD machine. So, one may come across terminologies like this, when it comes to parallel programming. Now, in either of these cases the writing of the programs may involve cooperating activities, and once again I am assuming that I am using the SIMD machine or MIMD machine, I have written a parallel program and that the parallel program from now on is something which has the single objective, and I happen to written the parallel program as multiple processes, if I have an MIMD machine, I happened have write it as these instructions which have separate pieces of data, if the SIMD machine. But, in both cases it is a parallel program, in which there is a need for cooperation between the different parallel activities. We have seen that there these two main mechanisms for the cooperation and they were shared memory and when we talked about concurrent programs, I talked about this as primarily being could be threads with shared variables, we can also think about processes with shared variables. Then there was the idea of message passing, and just briefly we had talked about message passing as being this mode of interaction between processes, in which there was explicit communication from one process to the other process, using something called, a message, which was supported by functions provided by the operating system. Now, in any of these cases a question which you may ask is, what is the benefit of running the application as a parallel program. Remember that, we talked about parallel architecture as potentially giving a performance benefit, because it would should take less time for a program to execute, because there is a possibility of having three or if there are n processors on the system, I could have n instructions executing at the same time, completing at the same time. And therefore, I could have sufficient improvement in the execution time of the program. So, we could actually ask the question how many times faster could I expect the program to execute on a parallel machine, and the terms speedup we had seen when we talked about pipelines, we could also use the term speedup in very much the same way to quantify the extent to which parallel programming might benefit us in terms of how much the execution time of a program might reduce. Now, we are essentially going to compute the execution time of the parallel program, which is the benefit that we would get by running the application on a parallel machine, we will compare that execution time with the execution time of a program if it was running on a single processor. And therefore, we will talk about speedup as being the ratio of the amount of time to execute the application as a sequential or a single process program divided by the amount of time that it takes to execute the application as a parallel program. And, we expect that the denominator is going to be lower because the execution time should come down, if I have written the program to run effectively on the parallel program, and therefore, we should have speedup which are greater than 1. So, in general when we talk about speedup we talk about that ratio between the execution time on one processor divided by the execution time on the parallel machine. So, the first question which will come to mind is, if I have a parallel machine with n processors on it, what is the maximum possible speedup that I should be able to get? Very clearly, the speedup that I could get is going to depend on how effectively I write the parallel version of the program, how effectively I setup the communication between the parallel activities and so on. But, here we are trying to think about the issue of speedup in the limiting sense. Now, we are going to assume that I have a situation where I have a processor with, I have a parallel machine with n processors, but let us make some finer assumptions, we realize that if I have a situation program which was written to run sequentially, it might be the case that I take that sequential program, and then modify the sequential program to get a parallel program. In other words, identify the different activities which can be done in parallel, and then I cause them to execute as separate threads or a separate processes and so on. So, one could view the activity as happening in this fashion. We went from a sequential program to the parallel program, and let us suppose that in looking at the sequential program, I realized that there was some parts of the sequential program, which just could not be parallelized, there was no scope for more than one activity happening at a time. Let us suppose that I identify that the fraction of the sequential execution time which was of that kind was s, so let me just explain what I mean here. Let us suppose that, I looked at a sequential program and I figure out that 10 percent of the parallel of the sequential program could not be parallelized at all, then I might say that s for that parallel program is equal to 10 percent or point 1. So, we are going to assume that, if I analyze the sequential version of the program, I am able to estimate, I am able to find out what fraction of the sequential execution time cannot be parallelized, by understanding of the work that is being done in that region of the program. What this means is that, the remaining 1 minus s or, in this example, 90 percent or point 9 can be parallelized. Now, let me assume that, if 1 minus s or, in this example, 90 percent of the parallel program can be parallelized, then I will make the assumption that, when I divide that activity across the n processors of the parallel system, I am going to assume that it gets parallelized perfectly. In other words, the time which was 1 minus s will actually get equally divided among the n processors and get parallelized ideally. Now, with these two assumptions I will actually be able to calculate what the speedup could be. Now, what is the sequential execution time? The sequential execution time is going to be the sum of the fraction that could not be parallelized plus the fraction that could be parallelized. So, that is going to be equal to 1, because here we have reduced the execution times to these two components, each of which is a fraction, the sum of these two fractions is going to be equal to 1. How do I calculate the parallel execution time? The way that I calculate the parallel execution time is, I realize that of the sequential execution time, the fraction which I called s, in this example, it was point one could not be parallelized. And therefore, in the parallel program this is still going to take the time fraction s. However, the remaining time 1 minus s is going to get perfectly divided across n processors, which means that if I looked at the n processors, each would them at be simultaneously doing the activity which used to take 1 minus s fraction of the sequential execution time, but that is now going to get equally divided among the n processors and that therefore, each of them is going to use 1 minus s divided by n amount of time as a fraction of the original program execution time. Therefore, I could calculate the speedup as, 1 divided by s plus 1 minus s divided by n; where the denominator is the perfect execution time on a parallel machine, where I could not parallelize this fraction s, but I perfectly parallelized the remaining 1 minus s. Therefore, this is the best possible speedup that I could imagine. Now, in the ideal case if I have an infinite number of processors, how much will the speedup amount to? I can calculate that by looking at the limit as n tends to infinity. And what is the limit of this expression as n tends to infinity. This 1 minus s divided by n will tend to 0, and I will be left with 1 divided by s. So, this is the answer we were looking for. So, if you have a sequential program, in which the fraction s of its sequential execution time cannot be parallelized, and then you took the remaining 1 minus s and perfectly parallelized it across an infinite number of processors, then the best speedup that you could get would, in fact be 1 divided by s, even if you had an infinite number of processors. And what does this tell us? This tells us that, in general, the maximum speed up that can be achieved on a parallel machine is limited by the sequential fraction of the sequential program, remember that, s was the sequential fraction. When I looked at the sequential execution time, I figured out the 10 percent of the sequential execution time could not be parallelized. It is just that ten percent or 0 point 1, which determines the best possible speedup that I could get on a perfect parallel machine with infinite number of processors. This is somewhat negative result, it tells us that even if we have an infinite number of resources and infinitely powerful parallel machine, the speedup that we could get would be limited by this fraction. And in effect, if you work out the maximum speed up that I could get for s equal to point 1, then you will see that it is a very disappointing speedup of ten. In other words, if I ran this particular parallel program on a machine which has twenty thousand processors, the best speedup that I could get would be 10, it could become 10 times as fast as the sequential program, even if I ran it on a parallel machine with ten thousand processors. So, this is to some extent, to be viewed as a negative result, but it is to be viewed as a realistic way of looking at things. This particular formulation of what speedup might be is known as Amdahl's law and gives us, it may not actually give us a good idea about what may happen to parallel programs that we write, when we write them to parallel machines, because it may be possible for us to improve the way that the sequential program did the work, even that that we are going to have to do the same work in a parallel machine, and we may be able to get improve speedup, so what this is predicting because this is working under certain assumptions. But, it does tell us that trying to work with as low a fraction of s is possible may be to our benefit. In other words, trying to view the application in such a way that, the bulk of it can be parallelized to some extent or the other, rather than, being completely un parallelizable. So, Amdahl's law allows us to get some kind of a feel for the importance of the sequential part of activity as far as writing a parallel application is concerned. Now, our next objective, now that since a little bit about programming with shared variables, and I will remind you that we learned a little bit about programming in shared variables when we talked about concurrent programs, we realize that problems such as , the need to look at regions of a parallel program which have shared variables and modify those shared variables as critical sections and ensuring mutual exclusion in those critical sections, the importance of locks etcetera. We have learned about that, when we talked about concurrent programs, and the principles follow through into parallel programming We have also seen that, while the principles follow through into parallel programming and parallel architecture, the implementation of some of the primitives such as, locks, may have to change, and we saw that in certain kinds of parallel machines, it would not be a good idea to have the same implementation of lock, that we assumed for concurrent programming on a machine which has one processor. So, we know something about programming with a shared memory. We do need to know something about programming with message passing, which is what we will proceed to do. So, the idea that we understand about message passing is that: message passing is some kind of a facility provided by the operating system for explicit communication of data values from one process to another process. Therefore, we suspect that, what is needed in order to do parallel program with message passing is, first of all we will going to have to have some kind of a mechanism to create processes to execute on different processors, and we have not talked about this before. But, if I have a system in which there are a thousand processors, and I want to run a parallel program which runs as one process on each processor, it may not be convenient for me to assume that I actually initiate the execution of the program on each of those thousand processors, because for me to type a dot out on each of thousand processors is going to take a fairly large amount of time. Therefore, we probably need to assume that a mechanism to create the processes to execute on different processors should be made available. This is not something that we would want to do by hand, even to execute a program a dot out on a thousand processors may not be convenient for us to. Now, second thing that we are going to need is help from the operating system, we need some kind of mechanisms provided by the operating system to send and receive messages. So, there has to be a collection of operating system provided mechanisms; one to send a message, one to receive a message. Now we understood that, send and receive, are the names for the functions provided by operating system, at the sending end to send a message, at the receiving end to receive a message explicitly from one process to another process. So, it seems obvious that, the identity of the communicating processes must be explicitly specified, and that in the send function it must be possible to specify which process one wants the message to be sent to. As a simple example, it is possible that the operating systems sets things up so that, we refer to the processes by their process IDs. We know that in Linux or UNIX systems, each process has a process ID, which is a small integer and each process on a system will currently have a unique ID. So, it is possible that process ID of the one process is 13, another process ID of another process is 15, and I am referring to those as P1 and P2 here, but here, we are talking about small integers. Now in general, this may have been for assumption on a system where there is a single operating system and a single processor, but remember, now that we are talking about a situation where we have multiple processors, and it is conceivable let me need to view this as, for example, a situation where there is Linux running on each of those processors, and therefore, it is no longer makes complete sense to talk about the process ID as the Linux process ID associated with the process. Why? Because, I could have a situation where process 13 running on processor 1, has to communicate with process 13 running on processor 3, and that therefore, just thinking of the process ID as being the Linux process ID may not generalize anymore, we may need to have something a little bit more complicated than that. For the moment, I will just assume that we have that mechanism, the mechanism by which we can identify communicating processes uniquely on any processor of the system. But, we expect that what the send function or the send system call, whatever it may be, is going to look like, something like this. And process P1, if it wants to send a piece of data to process P2, will use send, it must explicitly specify that it wants to send it to P2, and it must explicitly mention the data. In this particular example, it has put the value into a variable called x, and the address of the variable x is what is used by the send function or the send system call. Now, at the receiving end, I am also assuming that the process P 2, which is expecting to receive a value form process P 1, must explicitly execute a receive function or receive system call, depending on what it is, in which it explicitly mentions that it which is to receive a message from process P 1. Why is it necessary for process P 2 to explicitly mention that it wants to receive a message from process P 1? Now, the way to view this is that, in general, when we write parallel programs using message passing, there could be more than two processes communicating, and it could be that there is a need for process P 1 to communicate a process P 2, and that possibly may be as another part of the activity of the parallel program, for process P 3 to communicate with process P 2. Therefore, within process P 2 they are going to be separate calls, one for the receive from process P1, and one for the receive from process P2. And therefore, it seems necessary to actually have in each of these two calls, in order to distinguish one from the other, explicit mention of which was the sender from which the message is expected. Once again, the data from process that was send by process P 1, will be received as an activity of the receive function execution, and will be put into a variable of process P 2. So, y is a declared variable of process P 2, and the address of y is what is passed as a parameter to the receive function. So, you will notice that this is a setup for communication between process P 1 and process P 2, in which there are no shared variables between process P1 and process P2, all that is necessary is a mechanism for sending and receiving messages between processes, and some mechanism for creating these processes on different processors, an addition to that, mechanism to specify the identity of another process. P 1 identifying process P2 as it is communication partner, and process P2 identifying process P1 as it is communication partner. Now, different operating systems, we provide support for message passing with different sets of functions. And therefore, it is a little difficult for us to understand how we could write a program, a parallel program, which is going to run on, lets say, a thousand processors, in which some of the processors might be running one version of Linux and some of the other processors might be running some other version of UNIX and so on. And might be more productive for us to try to think about a mechanism which is abstracted out above the operating system, and that is what we will assume, we will assume that, in doing message passing programming, we might not be using the operating system provided functions directly, such as, the send and this receive which I have been talking about, but rather, we might use some kind of a library, which will make the appropriate operating system calls as part of its activity, but will make it unnecessary for the programmer to have to know about the nature of the operating system support for message passing on each of the different variants of Linux or each of the different variants of UNIX, that the parallel program may have to run on. And therefore, we expect that there is going to be some kind of this abstraction provided by a library, and if one looks back in time one will find out that a different points in the history of message passing, people have created new libraries for this purpose. If you look back into the 1980s, there was a library which was known as PVM or parallel virtual machine, which used to be fairly popular. Since the 90s, the dominant message passing library has been something called MPI, the library was created sometime in the 90s, and continuous to be a widely used library for message passing programming. And we will therefore, proceed to talk about message passing programming in the light of MPI. So, I am going to use MPI as the example of how one can do message passing programming. You will find out that it is widely supported on almost any parallel system that you could talk about, if not one can download the MPI libraries and cause them to be installed on the system, parallel systems, which you are dealing with, and write message passing programs using MPI. In terms of references, I am giving you two good references to learn a lot about MPI, both of them are available, through the URLs which are provided. Second is a book, which can be downloaded, or chapter by chapter referred to across the internet. Now, first of all let me mention that MPI stands for Message Passing Interface, not a surprising name for a library that is primarily intended for writing message passing programs, parallel programs. But, in effect like any other library or interface, it provides the standard application programmer interface for doing message passing on system regardless of how the operating system may actually support the send and receive functionalities. So, in effect the MPI hides the hardware and software details of the underlying parallel machine, it is no longer necessary for the programmer to worry about what functions are available on the version of Linux or the version of UNIX, that the parallel machine happens to be running. In fact, the programmer need not be aware of the hardware details or the software details. This makes MPI programs portable, in that you could take you could write a parallel application using MPI message passing and run it on one particular parallel machine, you could then take the same MPI program and run it on cluster of network stations and expected to run correctly on the cluster of work stations. So, the portability is good because of the fact that it hides the hardware and software details. There is also me flexibility which results. Now, as I mentioned, MPI is implemented as a library, it is a collection of functions, it is a collection of include files and the implementation of the functions, which constitute the application programmer interface. From the perspective of your program, once you write in MPI program, your program that I am talking about over here is, in MPI program that you have written, in other words, the assumption is that you have written a parallel program which is going to run as a collection of processes, and the processes are going to be communicating with each other using message passing, in order to achieve the common objective. And that this is going to be done using the MPI message passing interface functions, not using the operating system provided functions directly. So, as far as your program is concerned, your program is the top most box over here, your program is written to use the MPI library. Now again, depending on what system you are running your program on, the MPI library may make use of some standard networking functionalities, such as TCP IP functionalities, or it could be making assumptions about some standard network hardware. That is one possibility of what the setup could be on the parallel machine that you are dealing with, the other possibilities that there could be some custom software, some hand crafted networking functionalities, send and receive functionalities, that were created for the system that you are dealing with, that would be call custom software, which because, the hardware was also specially design for that particular system. But as far as you as a programmer are concerned, you would just have to know how to deal with MPI. And then, the implementation of MPI would be suitably selected by the person who constructed the system, to be such that, it either used the correct version of the MPI library, and therefore, your program can be written in this portable flexible frame work. Now as far as we are concerned, MPI then can be viewed as a collection of functions, library of functions. And let me just show you some of the key MPI functions and constants, before we actually go ahead and talk about MPI itself in more detail. I am doing this just to demystify MPI, you will recall that when we talked about system calls, I told you that system call is this very special kind of a part of the operating system and it is a part of the operating system and must be dealt with great respect, one cannot expect that one can executes system calls and modify system calls, but rather I got into it by telling you the names of some of the system calls, and you realize that you had actually use some of those functionalities before in some of your own programs. For the same reason, I am just going to in mention some of the MPI functions, and then we will see that is going to be quite easy to write simple MPI programs. Now the first MPI function that any MPI programmer has to know about is MPI_Init, and as a name suggest MPI_Init is a function which must be called once at the beginning of the MPI program to initialize the various activities of the MPI library. By the same token, you would expect that at the end of the MPI program there is going to be a need to wrap up the various MPI activities, possibly, and therefore, there is going to be a function called the MPI_Finalize, which must be one of the last things that must be called within the MPI program, to suspect that somewhere in between there is going to be activity of causing processes to run on the different processors of the parallel machine etcetera, and the communication between those processes. But one would expect to see MPI_Init at the beginning of the MPI program and MPI finalize towards the end of the MPI program. Now, one of the concepts about MPI that we are going to learn about later is this concept of rank, which will abstract out denotation of the process ID, so rather than talking about the ID of a process from the perspective of the Linux process ID associated with that particular process, we abstract things out into something called an MPI rank. And the number of processes that the program is running as, will be known as the size, and there are functions to determine what the MPI rank of a particular process is, and the process can find out its own rank by calling the MPI function, MPI_ Comm_rank. I will mention a little more about this later. Similarly, by calling the function, MPI_Comm_size, any given processes of the MPI program can determine how many processes there are in this particular MPI program or how many processes this particular MPI program is running as. We expect that there are going to be functionalities in MPI for sending and receiving messages, and they go by the names MPI_Send and MPI_Receive. There are actually families of functions, I am just showing you representative example, MPI_Send. You will notice that each of the MPI functions identifies itself by starting with MPI ,underscore and you will notice that there are various parameters to send, various parameters to receive, which we will have to learn more about. There are several other MPI functions, but I talked about the fact that there are certain MPI constants. And the MPI constants, once again are declared constants, which are available in the MPI library, and all start with the MPI underscore. And you will notice that, there are some which seem to be used to specify types, for example, MPI_ CHAR is a constant which seems to be used to indicate that a particular variable is of type character. Similarly there is INT, LONG and BYTE. And then, these two mysterious other constants called MPI_ANY_SOURCE, MPI_ANY_TAG, which we need to understand more about. So, basically the MPI interface is a collection of functions and a collection of constants. And once one has a understood all the functions and all the constants, one can readily write MPI programs, which means that one can write parallel program, which use message passing for the communication and interaction between the processes, allowing them to achieve their common objective. In the lecture to follow, we will look at MPI, the MPI functions in more detail, we will understand how they can be used, then we will look at some examples of writing applications as parallel programs using message passing as a mechanism for interaction. We will stop here for today and look at MPI in more detail in the lectures to follow. Thank you.

Info

Channel: nptelhrd

Views: 13,342

Rating: 4.7419353 out of 5

Keywords: Cache, coherence

Id: f3q2TyGaZoA

Channel Id: undefined

Length: 54min 13sec (3253 seconds)

Published: Wed Sep 14 2011