Papers We Love April 2016: João Taveira on Congestion Control

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

so let's get started normally I read the bio of the speaker and make something up so so I'm going to introduce Joe who's over there he actually spoke in the first months of our first year so yay so Joe started with John Rhodes receiving his PhD in 1994 for generalising decay our theory to finite algebras now he's applying semigroup theory to replication and convergence in distributed systems Joe loves performing in musicals and has been an extra for cats on numerous occasions let's give it up for towles you all know how to add numbers with Carey if you at least if you balanced your own checkbook or maybe you don't do that maybe use an app for that or something very easy you add in the ones column you get a number out of the ones column they carry something over to the tens column and so on for each of the columns it's a very simple sequential process it's very easy to sort of automate that in your head the important thing to notice about this is that data only flows from right to left it's only going in one direction you do everything in the lower digit independently of everything else then you move to the second digit the mid order digit and it depends on the low-order digit but nothing else and then you proceed on up to the next higher digits and so on so this is called a sequential coordinate system and that's in contrast to a parallel coordinate system where there is no data flow among the coordinates you can compute in each coordinate independently of the others these are really familiar things you've all used Cartesian coordinates for a lot of things in a vector space you have vector addition how do you add two vectors well you take the first two coordinates the first coordinate of the first vector the first corner of the second vector add them and then add the second coordinates of the two vectors and so on so it's much simpler there's no data flow between the two coordinates so we're gonna look more at this sequential idea let's take our ad with Cary example and instead of looking at it as a binary operation where you take two vectors or two triples and combine them somehow we are instead going to think of this as a function that acts on triples and this is important because you might want to make a network message out of this that packages up one of the operands but not the other one so the x and y are unknown but the the two and the five are known in this example so you send a message over the network saying add this pair with carry to the current state whatever it is so we need a way of representing what that function is doing how do we express that if we're not looking at a binary operation so a natural way to do it is in terms of a table this kind of a bruit for brute force approach it's maybe not the simplest but it has a lot of gives us a lot of possibilities so we have a pair and on the right hand side of the pair you see there's a added five to X just a very simple operation just like we would do in in one coordinate by itself but in the other coordinate we got this table and the way this table works is that when we add five to X we remember what X was then we go look up X on the table and if X is 0 1 2 3 or 4 the right-hand side of the table tells us what operation to use on the y coordinate if X is 5 through 9 then we use something else we use Y plus 3 so this is just another way of representing what we did with the carry bit except instead of adding an extra bit we're representing that as a a table driven process we're encoding all that information into a table and you can check that this is doing addition that we've taken to base 10 digits and we're adding base 100 correctly that's y 0 through 4 behave one way and 5 through 9 behave another way but the tables could be filled with all kinds of other operations we could just sort of randomly fill it with other addition modulo 10 you know y + 3 y + say whatever we want we won't get we all build up addition modulo 1 that way we'll build something else that's crazy but just think of all those possibilities for a minute those functions still compose into other functions of the same form so they're closed under composition so if you write down an example with with two of these it's very easy to see how the the composition what the table looks like for the composite of these two functions expressed in this way who knows what this does we started with two base ten addition operations but now we're doing something weird and crazy we're not doing base 100 so we constructed this semi group and I'll tell you more precisely in a minute what that is but we constructed these functions using these tables and the sequential flow from right to left this is called the wreath product and in this case it's the wreath product of two groups of energy integers modulo ten this is a huge object it's got lots of different functions in them in it you can imagine you fill in that table in arbitrary ways and you you get thousands of possibilities but what we've shown by looking at that add with carry example is that you can pick out among all of those possible operations you can pick out just what you need to do addition modulo 100 so it's not exactly the right terminology but I'm going to say for the purpose of this talk that we've embedded z100 the group of integers modulo 100 in the wreath product of z10 with itself so just to formalize now you've I'm sure you're must have you heard about it semigroup is or a mono I'd pretty simpler concepts to each other a semigroup is just a bunch of things that's closed under a composition operation and the composition operation has to be associative it doesn't matter where you put your parentheses it only matters what sequence they're in not necessarily commutative you can't change the order of things but you you can group things arbitrarily associativity is easy to get if you start out with functions if you could have a bunch of functions and you close them under composition that's always associative because function composition is basically an associative operation so let's look at some examples we've seen some abelian groups these sick oops integers modulo 10 medulla and whatever and abelian groups in general are the ones that are commutative and of course they have inverses X minus x equals 0 that's those are the two things that make a group abelian we're writing a plus and a minus there instead of multiplication just because that's historical standard another important class are the sunny lattices you can think if the sunny lapses is sort of a piece of a boolean algebra you've got a truth and a false value and you have the and operation on them binary operation or you have the or operation they're both semi lattices and they're they're kind of the same thing and they're commutative and they also have another property their have inverses but they have item pose you could you take any two of the same thing and you operate on them and you get the same thing two and two is true so those are two of the very basic pieces that we're talking about if you take arbitrary functions on a set you get all semi groups basically any semi group can be expressed as a set of functions on a set if you take all permutations on a set this is where you get groups groups are semi groups that have an inverse okay so with these basic pieces we're almost there we need one more basic piece before we can state the theorem this is called the the flip-flop you can think of this as a memory location a kind of a minimalistic memory location it's got one bit of storage you can write a zero to it you can write a 1 to it or you can do a no operation so there's three operations you can do on it and it's state transition diagram is extremely simple as you play around with it for a little while you can see that whenever you multiply things that's not quite correct there whenever you multiply things like that are not the no op it's always the same as just taking the second one next times y equals y so you sort of forgets everything you've done before it just takes the last one and their item potent two for the same reason okay now we have the theorem kind of amazingly oh wait I want to go back to say one more thing about groups groups are kind of the special snowflakes of the semi group world if you look at groups that have if you look at semi groups that have eight elements there's only five groups but billions of semi groups so they're they're extremely unusual if in a sort of a random noise sense so this is this is part of why the theorem is so surprising it says if you start out with this one kind of quirky flip-flop that's you know a minimal register and you throw in all of these special snowflake permutation groups combine them with this table-driven construct that we saw before you can construct any finite semi group that's it's really kind of an amazing property nobody expected this to be true and and furthermore you don't have to just take whatever groups you need to construct it you can actually choose the groups to be things that are naturally occurring inside of your sending group so that means that one way of reading this theorem is that semi groups decompose into their component pieces so cronin wrote submitted their dissertation to Harvard and MIT respectively they submitted identical dissertations and of course there was a big fuss about that they weren't gonna get their degrees they were gonna get kicked out or whatever so they talked to the deans and argued it's worth for theses we only want to and the argument worked they got their degrees so where did they went they go on to do after that the biggest problem they worked on is a question of group complexity this is not really related to like space complexity or time complexity is sort of going off into a different dimension and talking about algebraic complexity instead so we have these alternating decompositions how many groups do you actually need in them do you need arbitrarily many how can you minimize them in some way that's known as the the group complexity the the smallest number of groups that you actually need somebody's worked out that if you take the 19 by 19 go game which means you take the the state of the board as your state and and moves I think you restrict it so that it's just kind of moves that makes sense you know moves that a sane player would make that's actually complexity at most 200 and I don't know what whether anyone's come up with a lower bound for that however if you if you want something a little simpler if you think in biochemistry there's there's this reaction called the Krebs cycle or a sequence of reactions that involves ATP and ATP and sugars and so on the basic energy reaction in the cells the complexity of that reaction is only two which is actually still pretty complex if you actually write it down so the other questions here is complexity bounded no you can find semi groups that are as complex as you want them to be is it decidable we don't really know for sure it's been worked on for like 50 years and everybody thinks it is but the proof is still not quite there okay so let's sum unless there's any questions with the algebra let's go on back to distributed computing which is why we're we're all here I hope so in distributing computing there's this idea that called a commutative or convergent replicated data type and one one example is a two-phase set it's a way of handling set insertions and deletions in a distributed environment and I'll tell you why that works in a second how you make that work but you may have run into this in Cassandra for example they have something called tombstones where you can mark something as being deleted but the idea is this that you want to have a register that you can that can be empty to hold the empty set or you can add something to it a little smiley face or it's gonna need one other state that it can be can be in an error state so we have two operations we can add smiley to the set we can remove smiley from the set and their state transition diagram is really quite simple you if you if you had and then remove you get into that state on the right from which you can't get out you're basically stuck there if you remove before adding you get into the error state and again you're stuck there one thing to note is that the state on the far left and the state on the far right they both look like they have the empty set in them but they're actually different states because the machine remembers whether you've done anything yet so there's the initial state and then there's the final say the reason they have the same value in there that's sort of the the queryable value the thing that an observer would be able to see but there's still different states okay so a couple of question yeah so you can see you can study the the composition of these operations if you take two a's and compose them that's the same as an a and so on and it turns out that all you can get by multiplying by composing these things is AR and AR there's nothing else and the multiplication table for those is also very simple for you to supply those observations and that's about all you can do so why do you want to do this you have an asynchronous Network you have multiple people in the network sending messages they arrive in different relative orders they might get interleaved differently and you want the result to be the same however they get interleaved so in this case you look at all the possible interleavings of Bob's message into Amy's sequence of messages those actually turn out to be the same via they're all AR via the the equations we had on the previous slide so the server wound up in the same state regardless of how the messages are interleaved and that of course is something you want for conversion replication because the interleaving might be different on different servers okay so back to the algebra and this is where things are really kind of interesting we can decompose this little algebraic structure and it's kind of revealing what it turns into so we're going to think of the the state that we're operating on as a triple of boolean z' we have a flag that says has there been an error a fly that says have we done a remove operation a flag that says have we done an ADD operation and the way the the three to two operations work on this is ad is very simple it just flips up the the a bit the X component of that vector vector the remove operation is much more complicated in two coordinates it's it's the same kind of a thing it flips up the remove bit but in the zeeco the error coordinate it has to do something special and this little table here is the same thing that we were looking at a minute ago with writing modulo 100 arithmetic in terms of modulo 10 arithmetic so to see how it works let's look at an example there's our again suppose we start out with the initial state of zero zero zero blank slate how does our apply to that well in the it doesn't do anything to the X it flips up the the Y into a 1 but what does it do to because that signifies that our has been done but what does it do in the error coordinate well in order to do that we have to do a table lookup we have to look up the x and the y value 0 0 in our little table and we find the the entry there is an operation it says Z or 1 so it's the operation which flips up one in that flips up the bit in that column so that we end up with a state that has a 1 in the Z column which means we're in an error state so what that really means that we've done is we've decomposed this two-phase set semigroup into three semi lattices this is a special case of the Krone Rhodes decomposition it's just a very simple one that doesn't have any groups in it now why do we care about this there's another thing you can prove in semigroups theory I say theorem here but it's actually an exercise in basic classes that's really easy to prove you can decompose a semi group into semi lattices if and only if there's a partial ordering on the elements of s for which multiplication is monotonic what does monotonic mean it just means that the product of two things is below the the first thing so as you multiply more things you keep going down or you stay at the same level or go down in that ordering well why do we care about monotonicity monotonicity is necessary for confluence confluence I don't wanna get into the definition here but roughly the idea is it's stronger than convergence but you also avoid oscillation so that as you're watching this system converge you can take a snapshot of it at any point and rely on whatever you saw there because things are only to get you know more so or less so whichever direction you're looking and so so monotonic - Atia is necessary for that sufficient is a different question and kind of a complicated one I think it's not fully resolved as far as I can tell so what this means is that if you have a confluence system if you've built a computer that not only converges but doesn't do any oscillations while it's converging the state always monotonically progresses towards the converged state then that computer could actually have been built in terms of the wreath project in terms of semi lattices using the wreath product so things that I like to think about in this picture of the world is we've got these algebraic constructions and we've got distributed systems properties how do they how do they match up what about other algebraic constructs what if we start with other pieces besides semi lattices I think there's a lot of interesting things to discover there and another thing is we got up to the level where we're just talking about composition of operations we're not talking about how they act on state we can talk about we can talk about the wreath product of semi lattices without talking about what those semi lattice are actually transforming so we're talking about just composition of operations and not operations as mappings on state and it's interesting to me that moving up to that sort of functional level we're still able to say quite a lot of things about the distributed system behavior of these structures and so that's it any questions questions that was not supposed to happen well come find me later I'll be around to ask more questions if you got oh okay Thank You Joel yeah all right so let's introduce our next speaker this is like a steam gentlemen so Joe is a network engineer at factory where he's responsible for making dumb switches to clever things in addition to writing software for network orchestration to our works on protocol design and performance and holds a PhD from University College London or something to that effect in Keese younger years 200 was a teen star in the Portuguese version of Dawson's Creek I will leave it to you to guess if it was Dawson of pay fee or Pacey so an S has been trying to ask me for a very very long time to give a talk here and for a very very long time I resisted because I'm like from the previous events this clearly a distributed systems kind of vibe and I'm none of that I'm a purely networking person and so what I finally succumbed because it nez is now sitting within stabbing distance of me and so in doing this talk I'm gonna make a massive assumption which is that most of you here are not networking people if you happen to be a networking expert you're gonna be disappointed but I am going to assume that whatever is you do you end up building services or systems on top of the network and so most of you might have some concept of how the network actually works right and you need to because you rely on this intrinsically right and so this talk is gonna be about how you share a network right and if you know anything you're probably thinking that you know this one that it's TCP right and you'd be wrong because I literally didn't say that I said that this talk would be about how you share a network and TCP is an answer maybe probably not but this talk is actually about the question rather than a potential answer and so kind of extrapolating from my assumption I'm gonna assume that for whatever reason when you tried to look into networking you were given an answer and because you were given an answer you can't really understand it and you can't fully understand it because you've actually never went gone through the mental exercise of working through the question right and so this talk is gonna be about four papers spanning almost 50 years and they're all different interpretations of the exact same question so they're all variations on a theme and the first three are foundational papers in computer networking which just means no one actually reads them anymore so the high level objectives is just to give you as application developers or system developers foundational knowledge on well how we got here with what assumptions you know what cost and the nice thing about doing like a sequence of papers like this and the fact that it is actually one of the oldest kind of open problems in distributed systems is that you get this very very nice narrative arc which you kind of see repeated in other systems research and Hollywood movies and so I'll start kind of the very beginning and the very beginning in my opinion is a paper written by Paul Baran in 1962 called on distributed communications networks and it the opening phrase is pretty simple it kind of just asks you as a reader to consider the synthesis of a network which will allow several hundred communication stations to talk to each other after an enemy attack and so the very beginning of the paper posits that essentially you only have three types of topologies right you have a centralized topology which has an intrinsic single point of failure you have a decentralized topology which has much nicer properties and was very similar to the telephone network we had at the time and there's something inherently new that Baran is proposing which is a distributed network and one of the first conclusions is that the decentralized network actually works pretty well if you're assuming randomly distributed failures right but inherently if your threat model assumes an actual enemy attack then you have to do worst case design right and so the first half of the paper is effectively doing a simulation onto as to what the probability is of a network partition if you attack it this is great because he does he estimates 288 weapons were the probability of kill then he goes on to do something like test the sensitivity of the model to no destruction then link destruction then he does a combination of both link and no destruction and like this is literally the most pessimistic paper I have ever read and it's of course it's like it's only paranoia if they're not after you and this paper was published the month before the Cuban Missile Crisis right and so the actual the notion of threat was very very real especially if you consider that actually wiping out a telecommunication network is not that hard because it's just a decentralized model and of course we don't actually live in that different an era we just assess risk intrinsically differently but the really curious thing is that from barons perspective most of you don't actually work on distributed systems because from Barren like Baran anchors a definition on the cost of destruction and so a data center in that regard is just a node right like if you're up against a ballistic missile it doesn't matter how many containers you spin up whatever and so the first half of the paper asserts that a highly survivable system can be built even in the thermonuclear era and this is how you get funding in the 60s by the way like that Frey is just gold but from here the paper like this would have been a nice paper kind of but then he does this really wild extrapolation which is he asserts that if you want a reliable system then you have to do distributed network in which case you're going to have a large number of elements and now all of a sudden it becomes about how cheap can you make this be right and so he states well you'll need variable data rate links because in a circuit switch network you actually have fixed capacity and all of a sudden expanding that network becomes incredibly expensive right because everything is in lockstep well but you won't only need that you'll also need variable data rate users like you don't actually know what data is going to come into the network and you kind of seem in a thing and that sucks but if you do these two things well you only need to do it once and you could have a public utility for data transfer right the one thing you need if you want a single network though is that you need a magical packet format which is a standard understood by everything that connects to this network and he calls this the standardized message block right and at this point you can throw everything you know about resource allocation in networks out the window right because nothing we knew actually applies anymore likewise all that goes completely disappears and so like well how would Baran actually share a network and his solution is priority marking right you have this standard message block and you give it a mark depending on the priority and this makes total sense because if it's supposed to be a military network and if you work for a defense contractor then the thing you come up with is that you have different levels of priority which map to separate military function and then in case of an overload you just dial down the data rates for the lowest priority traffic right and well it's kind of silly be still have this because this is what the IP type of service field comes from right so this is one way about sharing a network but it's silly so this paper is incredibly hard to read if you don't actually take in its historical context but it's actually incredibly important because it really does a wild extrapolation from survivability to a massive global the universal data network thing but it's firmly rooted in like old-world mentality but at the end of it it has a very clear vision of what the Internet is because it's kind of presented as this unavoidable thing due to market economics but everything in between is left completely open he has a really naive proposals about how to do routing and stuff like that but none of it actually works well that's fine because this is the first phase of any research area right everything is like wow this is an exercise to the reader I'm sure I'll be fine and so but and this gets us to the next phase right and the next phase is the ARPANET and you can trace back money that funded this straight back to barons feasibility assessment from 1964 now this is the first diagram in the ARPANET like what no one ever notices is that this is historical because this is the only time that anyone has drawn a network with a single host in a single router it is the most pointless network of all time literally but it's perfectly representative of the next stage of research right which is scientific positivism like everything is good like if you work on containers this is where you are so sorry this photo is actually dr. jonas salk the guy who cured polio right which is an amazing feat over nature right but in hindsight it was actually the first victory of post-war medicine and so it turns out it was just a low-hanging fruit but if you look at the photo you'd think the smile would cure cancer or something right and so in computer networking the maximum expression of scientific positivism is 1974's paper by Vint Cerf and Robert Kahn and it presents TCP and it makes them be the fathers of the Internet in American literature and so just say and so this is this is pretty simple I mean first phrase it presents a protocol that supports the sharing of resources that exist in different packet switching networks and so 12 out the years after Barron asks the question this is the standardized message block right in this paper does a tremendous amount of important things right like packet fragmentations transmission failure sequencing flow control error checking connections setup all the things you take for granted I'm not going to talk about the things at soles I'm going to talk about the other stuff right because if you read this paper if you know TCP and you go back and you read this paper some of these things are pretty off it's very unfamiliar right the first thing you'll notice if you skim the images is that it has this thing called an internetwork header and then it has this thing called a process header but the sequence number is in the internetwork header and so is the syn flag and so that's the first sign that there's something weird about this paper the second letter red flag is the vint cerf is smoking a pipe and then it's like well if there's an internetwork header and there's a process header then what the hell is tcp and hopefully the paper explains that within a host we assume the existence of a transmission control program which handles transmission for other processes it's a program not a protocol right and so TCP is a user space networking stack around 40 years before you thought it was cool and then you get to the retransmission phase and this this is my favorite part of the paper by fire so it starts promising enough which is it states that no transmission can be a hundred percent reliable and we can all agree on that and then literally two sentences later it states that the retransmission mechanism will not be called upon very often in practice evidence already exists that individual networks can be effectively constructed without this feature of course these phrasing like these sentences are not at odds with each other at all and so you have to understand that TCP is originally designed retransmissions are pathological they're not the expected behavior at all right and so if you have a retransmission mechanism you need a windowing mechanism just to know how much data use has actually been acknowledged it or not and that's fine but how do you set the window size scroll down and you figure out that a suggested window which the receiver can use to control the flow of data from the sender this is the main component of process flow control and so effectively there is only two things that control the window it's either the sender side buffer or it's the receiver side buffer and so when image surf explains that this is about sharing resources the resources in his mind is the host the network is never the bottleneck right the other interesting thing to understand about this paper is there's no UDP so UDP Archie came after TCP at so that they actually only really truly broke TCP and IP out when they figured out they needed something else and that's interesting because if you think about quick if everyone's heard about quick historically you can frame this as an attempt to reassert what the standard message block of the internet is because how do we discovered well how do we design UDP before a TCP we would have layered to TCP on top of UDP because the reuse is so high but anyway how do you share a network according to Vint Cerf and Robert can write well you just do flow control and of course he's a systems engineer right everything came from his perspective from around the processes like from around the actual application and so he had a completely top-down view of this from that regard and historically this actually makes sense because if TCP starts as a user space stack it was overlaid on top of something else and so the reason he assumed the network was never a problem wasn't because he was building TCP on top of networking technologies like x.25 ARPANET was very similar to x.25 which provides hop-by-hop reliability provides flow control and provides another ton of expensive crap no one needs right I was super slow well that's fine what could possibly go wrong there's another amazing thing about this photo by the way is there's another person in it and he looks like he missed a paper deadline because he screwed up the time zones that is phase 3 of every scientific research area so if you work in consistency the cap theorem kind of disappointed lots of people and this kind of in here that's fine in reality took a very different form in networking and it if you scroll right to the bottom of 1974 paper the first person Vint Cerf and Robert can acknowledge is our Metcalfe also known as Bob and Bob was a PhD student in the early 70s one of the very early contributors to the ARPANET and Bob and some of his colleagues had an amazing idea and it's absolutely an idea that makes has been making entrepreneurs cashflow positive since the dawn of time he thought whatever instead of all this I did nothing and so Ethernet was born I can't I don't you have time to explain how Ethernet works but there's a small chance that it doesn't okay and so two things happen in the 80s right in 1979 Bob Metcalfe founds freakum by 1981 they shipped a thursday ethernet adapter by the end of the 80s with the explosion of personal computing Ethernet one the link layer battle and as a result Bob Metcalfe becomes a multimillionaire a venture capitalist and according to google image search a devil worshiper and the other thing that happens is all these high-speed local area networks start connecting to the internet which at this point is the digital equivalent of a collection of plastic straws and so by 1986 the internet stops working right and this is where the third paper comes in and this is kind of the seminal paper of computer networking everyone speaks it with immense reverence and for a reason I'm not actually going to make fun of this one and so van Jacobson starts it doesn't even have an abstract or it doesn't even say that it just starts and it explains that in October 86 the internet had the first of what became a series of congestion collapses right and fascinated by the sudden factor of thousand drop in bandwidth van Jacobson and Mike Carroll's embarked on an investigation of why things had gotten so bad and bad Jacobson is not or this time wasn't considered a computer scientist he was a physicist and so to understand what was happening well take TCP as it was envisioned by Baran and put it on an Ethernet network for example right and this is what a flow would look like so the y-axis is the packet sequence number and the x-axis is time and if a flow we're getting normally you'd expect it to always do be doing forward progress right but that's not what happens because well how do you share a network you do flow control you send it at whatever rate the receiver is telling you to and so you send a burst of packets the network can actually handle it and it doesn't provide reliability of flow control for you and so it drops half of it and so you timeout and you send essentially almost the same bursts over and over again and so this is an aggravated behavior because you're just multiplying packets without actually getting any of them through right and under this observation Vann Jacobson kind of stated that well for a system to be stable for a connection to be stable it should obey a conservation of packets principle right and when he says conservation of packets it's a concept that comes from physics right that for a connection to be an equilibrium the packet flow is what a physicist would call conservative a new packet isn't put into the network until an old packet leaves and this is a it's it's a really kind of intuitive notion but it's also completely alien to the networking world at this point and so the amazing thing about this paper is the entire paper fits on one page the remaining like eight pages are just for regular humans like me and you to understand right like the whole thing is derived entirely from this statement and he observes it well you can decompose this into three stages one is that you need to get to equilibrium first right and so you do slow start and slow start you just exponentially ramp up until a certain threshold now slow start is not actually slow but it is way slower than we used to do which was just send at the advertised window once you get to a threshold you move on to some other phase and it's congestion avoidance and effectively here you are just probing and adapting to changes in equilibrium right and so you have a much more conservative ramp up over time in the hopes that more free bandwidth is available but if you hit a lot a loss you back out and if you were to timeouts entirely you just go straight back to slow start right and the difference is as follows right you see the flow at the beginning it's actually getting more and more packets per window but after a while it just linearly of ramps up and it actually ramps up quite look like very slowly over time and it has forward progress of all times right compare this to the well sorry the at scale thinks so the y-axis is actually the same scale and that is a pictorial representation of a drunk driver on the right and so this is world's worlds apart right and so the main contributions of the paper are that well it does give slow start as an approach to equilibrium it fixes RTT estimation as well which is quite a hefty part of the paper not the most interesting to read but it's actually pretty nifty tricks he pulls off because we're all out of control theory and the RTT estimator is important because if you have this notion of a system and you can only inject one packet when the other leaves then all of a sudden you need to know when the other packet leaves you need to have a very accurate estimate of that and not only that the the so the previous RTT estimator we were using didn't actually take into account variance and so under congestion an RTT would vary wildly and the ads will sender was completely unable to keep track of it right and then it kind of introduces this notion of congestion avoidance and so from here like everything is absolutely fine this paper has like Messianic properties in the networking world right because it took something that was working remarkably poorly and it made it scale to this day in effect if you were to take TCP from back in the day it would perform a remarkably badly today but that doesn't matter it wasn't about the solution it was about this framework that integrated continuous control theory mechanisms and so you could reason about the model entirely different differently right having said that the best thing about the paper well other than the fact that it saved the intern and stuff is that it is incredibly honest so actually no one actually reads the paper they all just believed it at face value now this is a completely devoid of the hubris of academia so and congestion avoidance what do you do well if there's no congestion you increase the window by a linear rate right and Vangie code like this is defined as one you increase one segment per round-trip time on congestion if you hit a loss event or you detect a loss event rather you do multiplicative decrease right and this D is 0.5 so how come so this this paper has amazing quotes there is a phrase which simply says there is an analytical reason for this but it's tedious to derive this is the only scientific peer-reviewed paper I have ever seen in my life that presents a core parts of the paper and the expression it uses to open a paragraph is without justification ok when explaining the values he explains that the the reason to use the 0.5 as a decreased term he uses expression hand-waving I thought this was a term from physics it's not and as if that one was any good his next one is the one packet increased has less justification than the 0.5 decrease it's almost certainly too large right but yeah how how do you share a network well turns out if you ask a physicist it's just you know fluid models so this was entirely borrowed from hydrodynamics which of course of course everyone here yeah sure but it doesn't still something else right like if you use this packet conservation principle you get other properties with it and one of them that you get is flow rate fairness right because what you're sharing with the packet conservation simple principle is you're sharing buffer space in the network and so every flow passing through a buffer that is a bottleneck will get the same proportion of that buffer and so this means that every flow will have the same right this is great and so for 30 years the networking community has been incremental improving TCP right it's so TCP has a bad rep because it's considered kind of a boring research area and it's super unfair because it's actually seen massive improvements right you can think of all sorts of ways how would you improve our titi estimation or how do you infer congestion reliably or how do you actually scale to like large bandwidth delay product networks and stuff like that right how do you do faster window adaptation so TCP Kubik looks nothing like the TCP of an Jacobson's time but another area that lots of research went into at some point in time was well how do we get the network to enforce flow rate fairness how do we ensure fairness between flows and so we get to the final paper which is from 2007 by Bob Brisco which has one of the best opening lines of all time I think for a paper which just says that this paper is deliberately destructive right and it basically for eight pages bashes on flow rate fairness and so the paper is quite simple it simply states there are only three things wrong with flow rate fairness but there are only three words and so the first thing is it shares the wrong thing right and so what do you share when you share a bit rate and so from Bob Briscoes perspective there are only two things you can want to share you're either sharing the benefits or you're sharing the costs and so if you're sharing the benefits well is that it not really because the benefits the utility you extract from the bit rates depends on what you you're doing right and so you can imagine a voice call gets a lot more benefit from this a bit rate than a video call does and so by definition if you share the bit rates you're almost certainly not guaranteed to share the utility function unless it's exactly the exact same application right so that's a bust is it sharing the costs and that's slightly less intuitive so imagine a link with a flow of data right what is the cost of using more data and well the answer is the marginal cost of bandwidth is zero right and this is what most capacity planning people ironically don't you understand so you have the sunk cost of the link but you can actually amortize that over subscribers over time right that's kind of a leased line kind of thing but bandwidth itself is an ephemeral commodity right that's actually the technical term in economics right like you can't save up a bandwidth and use it later if you don't use it now it's gone so the incremental cost the marginal cost of using more bandwidth is zero and so imagine you have two flows with wildly different bit rates coming through the link the bit rates aren't equal but the cost is and so there's no actual necessary the relationship between those two I mean there might be but correlated it's it's not strictly in that way and so if you share the bit rate you don't necessarily share the costs and so the paper just asserts that well flow rates is a hopelessly incorrect proxy for both benefit and cost right if the intent was to equalize benefits equalizing flow rates doesn't do it even if the attempt was to equalize costs equalizing flow rates wouldn't achieve it and so sharing the bit rate is wrong but not only that it shares the wrong thing amongst the wrong entity and this one's super easy to understand so if you're sharing bit rates amongst flows well I can just create more flows and get a higher percentage of the bandwidth right this is just free economic free writing effectively and the analogy is great it's the equivalent to claiming food rations affair because the boxes are all the same size irrespective of how many boxes each person gets or how often they get it and so if you share the wrong thing amongst the wrong entities well then by definition fairness is a non sequitur right we incorrectly used fairness in the networking community to signify your quality but fairness in real life isn't that's not what it means right so so the this I like all my best insults come from this paper actually so it defines a fair allocation of Rights between flows isn't based on any respected definition of fairness from philosophy or social sciences it has just gradually become the thing the way things are done in networking and so the conclusion is like obscured by this broken idea we wouldn't know a good solution from a bad one and so this paper actually does go into what fair would be in a computer network right it doesn't go very deeply into the engineering details and I wouldn't bother actually reciting them anyway because they're not that interesting but so what would fair look like and the basic premise is that all you need is to share the costs you can't actually understand what the utility function of the application is unless you somehow write the application but you're never going to be in a position where you write all the applications anyway so all you need to do is understand what the costs from the networking perspective are and the cost is congestion right so the marginal cost of using a PI link is zero until you hit over subscription and then all of a sudden something else completely and so the economic term for this is an externality right like if the network becomes congested each user restricts every other user which can be interpreted as a cost to all and so but this isn't how we do things at all right like if you look at pricing plans and stuff like that what do you get nowadays so in the mobile world you'll normally get a volume cap right and so if you have heavy users and light users and insert volume caps what does it actually do well it reduces the amount of volume using dipper per month for a heavy user but it only marginally improves performance for light users right because you're still bound by the shackles of flow rate fairness anyway and you just waste a bunch of bandwidth which is ephemeral so you can't actually recoup the cut this song cost anyway so that's a mobile world which is fantastic but actually it turns out like residential markets it's exactly the same thing you just be rate limiting instead and it turns out this is exactly the same but along a different dimension right so heavy users will just probably take a lot longer to download crap but they will because that's what they do and like you still get all the waste of not using the bandwidth anyway you don't actually have a decent utilization and so well this paper actually defends is that the correct way of passing on costs is effectively to use congestion volume as a metric right I mean the technical date the details are kind of hard to get across if you don't understand packet marking but the idea would be that at the bottleneck you end up marking packets that actually are above a certain threshold in the queue and you start accounting for that as the metric of how much that users packet cost everyone else and you can think of it in a district like in if you're if you're in queueing systems right like you just have a log right the cost of using a log is actually how many people are waiting for it right and so it's certain that sent if you start passing on that cost and you start using congestion allowances instead of volume caps or rate limiting there's pretty sound economic evidence that this would actually maximize social welfare in a network but the nice thing about quote the congestion volume is it actually has a notion of time whereas bitrate does not write you climb upon bit rate you can just keep on maxing out that bit rate throughout a month or throughout a day and the network doesn't account for the differences in those two and so well the paper actually defends is the notion of a weighted cost that each application would have a budget of congestion they can cause but how you use it is completely up to you right and so you can have a heavy user that is very cautious and whenever there's congestion stop sending crap but you could also have light users go way faster than flow rate fairness would imply rights and they're even faster than slow start it doesn't matter like you have a megabyte just send it all at once like as long as you have the congestion allowance to pay for that behavior it's fine it's within budget and the really nice thing is that the notion of fairness ends up being local to each system you actually design it in so you can have a universe an example in the paper which is a university campus might actually decide the flow rate fairness is the right way to go but then two universities connecting to us ISP get charged by a congestion allowance anyway and so you can have multiple interpretations of fairness right so if you kind of understand that the Netflix versus Comcast thing of the EEC news becomes a lot more interesting because you can actually make sense of it kind of and effectively Netflix causes disproportionate amounts of congestion not on purpose just by the nature of what they do right and so Comcast has to deal with this somehow and obviously they say protect customers everyone thinks they're joking maybe they are maybe they're not I don't know but if you have congestion in an access net will net work which is supposed to be multiplexed and that's where all the cost efficiency of using a date in the network comes from right well how do you do this well you either rate limit Netflix or block them entirely if you're in the right country you can do this or you somehow try to demand more money to Netflix right and you can demand more money actually by either explicitly negotiating with them or you can do it indirectly by stalling negotiations or not installing more bandwidth reports when they ask you to and stuff like that it's great fun and obviously Netflix will go whine to the FCC that this isn't fair except we don't even know what fair is I mean by networking terms knowing everyone uses the word fair and doesn't actually stipulate what the hell they're talking about and so in the Bob Brisco model of the world right both these companies are wrong and they're just experiencing an intrinsic market failure so the really cool thing is that in the end like the main assertion of the paper is that it isn't realistic to create the system the size of the internet and define fairness within the system without reference to fairness outside the system and this is kind of interesting because it kind of posits that the role of engineering the role of protocol design is not to define the outcome it's to make multiple outcomes possible right except then you get to security and ID upon a different model so I don't know so well how do you share a network right turns out cost if you ask an economist ironically right and so you'll notice that every person along the way has a completely different way of sharing a network depending on what their background is now Bob Brisco actually says that you can use all the previous models you just have to adjust the market so that you can actually pass on costs to the correct users like and then you actually have you're maximizing social welfare because you actually incentivize social behavior whereas right now you have well not right now because no one uses peer-to-peer applications anymore but whatever but you have applications which can actually free load or free ride on network bandwidth and so if you're an application developer the only thing is I actually want you to know about TCP how many workarounds do you do for broken resource allocation right how many distributed systems papers are written on the assumption that TCP is bad with small flows how many times have you tried to do high performance stuff by batching and reusing connections because of the congestion window or opening parallel connections to hog more of it or even worse like artificial limits in multi-tenancy data centers right that you can't burst on the network because well that's the only way we know how to is by limiting things artificially that's great don't worry though it's not just you like if you look at HDPE to the first page one of the main design choices for restricting HDPE to to a single TCP connection is that the resulting protocol is more friendly right as if friendly is an engineering thing it's just fantastic so in conclusion with regards to how you share a network the great thing is we still have no idea I mean we have a very strong theoretical basis but actually Bob Briscoes work although it had a pretty nice practical engineering like sight to it the question becomes how do you change a market which is so thoroughly distorted right it's very very hard to change anything the size of the Internet the good news is we know what we have is wrong and it only took us 50 years but it's also not broken enough to fix it right and so this is actually the true like the tragedy of truly successful systems is that you solve every problem along the way with the means you have at your disposal until the only problems you have left are the ones of your in making right and so like this is the final stage right that it's broken but it mostly works so this is fine Thanks all right I have a microphone for questions so you mentioned the Comcast Netflix thing as an example of something that you know is a failure of a protocol but connections between ISPs aren't determined by protocols they're done by network engineers right I mean the way that two ice P is connected to each other is you know two network engineers get to know each other and love each other very much and they brought a fiber link right I mean that's how that's how I speak Connect yeah so how does the like really that fundamentally the problem between and like so I worked at a company that ran a video CDN and I saw this in a big way right I mean all of these connections between between ISPs or between sort of applications and ISPs are determined by you know policy set by individual people who are having at you know ad-hoc handshake agreements and stuff so how does that like that whole ecosystem of like people showing up at Maddog meetings yeah play into you know these papers if you're a network engineer and you and a link what do you see like you know nothing about the traffic you don't know how much congestion is actually going through you only see utilization right and so as if you if you own a network you actually have no way of knowing what the costs you are incurring on others is because so tcp/ip is well it's just the end-to-end protocol design right like the actual network is incredibly dumb and it doesn't actually it's not even able to understand how many how many packets of retransmissions on a link at any given time and so even if you wanted to no you couldn't because you're in the middle of the network and so the early proxies so one thing I didn't talk about in these papers is that even our assumption of what would be priced it was different right if you go to Vint Cerf's paper he explains that packet fragmentation is a problem because pricing is per packet and that made total sense in the 70s because the processing cost of packets was way higher on routers right and so you would pay per packet rather than per byte because the whole cost is in processing the header and then you get to the ATS and it's just like whatever you kind of start doing 93 percent time and so the proxy that the networking community made up literally made up for a kind of referring what cost is the bill in the same sense that other people have done other proxies and it's reflected to the user was 95th percentile because it's kind of peak but not quite it turns out 95th percentile is incredibly hard to economically model all right and you can't even understand what the impact is in the long term because you have to do all sorts of weird fiddling to figure out what the impact of this flow today is going to have on your 95th percentile by the end of the month it's super hard and it's not even like you can't even cumulatively calculate it properly in most cases right it's amazing so really the entire networking thing like the whole market is distorted and you keep using proxy metrics like networking people don't even accept that their networks might have loss which is kind of funny it's amazing but yet silence your question with regards to nanog I mean it's it's just I think it's people getting drunk together it's it's sort of does I mean better tools are certainly better but I think no no it's not better tools it's not a matter of a better tool for better visibility is yes sir the Internet is meant to be opaque that's how it scales right right you don't have it by design this wasn't something someone forgot to do it's like you don't have it by design and and I'm sure a better visibility would help but there are certainly people that aren't either there they're not acting in good faith or you know the incentives are such that like if I'm Comcast for example like Comcast like realistically is not using up all of their older residential bandwidth right and maybe maybe not but well actually well what kind of system do you run well right now so do you run a capacity or you just buy more when you get to 50% yeah I mean why so you're just like Comcast really right but like fundamentally you know so so basically there aren't there are externalities if I owned if I owned a fixed asset right if I'm Comcast I have something nobody else okay so I have to rent seat so the difference is that okay when when an ISP let's stop using Comcast as a example ISP which or not if you actually have to buy more capacity you have to pass the costs across all users this is a fundamental reality that you have subscribers and so imagine that like imagine some an ISP has 90% of people who just use email and then you have 10 people who are free writing the entire network it doesn't matter how much you increase capacity like at some point the 90% who use it for email are going to say that the cost is absolutely exorbitant right and so this is the cost structure goes down all the way to the residential user you probably have a skewed perception of all this capacities for free or whatever but that's because you're a power user by definition because otherwise you wouldn't be here right but for most residential like people it's it's insane that you even pay like $50 for connectivity in a first world country yeah I was just say for the proportional fare sorry for cost allocation and how do you know your students solve an optimization problem assign costs isn't part of the problem that you're you're actually going to want to have a dynamic coupling across an application so you're actually gonna want to buy essentially costs over time that are going to be linked on an application basis and you therefore need kind of like a forward options market on congestion in order for me to actually solve kind of a problem for allocation over the lifetime of a flow or the lifetime of what an application will want yes well so this is part of the problem with the actual deployment model is because people kind of actually figured that out and so 94% I'll I kind of bash it but it's very intuitive the congestion problem now you'd have to get some familiarity as to how what the application profile is but the question was well like with that are there ways to do that because so I have a power systems background and there you have locational marginal prices you still need to forward forecast then and a lot of those that it's a contingent problem that everybody needs to solve and it's dynamically coupled and that most of the things are that the price now is going to be contingent on that and has it been good enough for networks to do some deployment on this because of powers even worse than this so never go into power that is hilarious because power systems is used as a shining example of this so the the interesting thing about the power systems is that one of the main complaints of this system the idea that you pay for what you get which is what you go for what you pay is incredible in networking but one of the one of the main complaints is that then you have a variable cost to the user but in networking it's slightly different because you can do a token bucket approach where you have an allowance and so what happens is if you exceed your congestion allowance all you start only start to suffer is preferential dropping like all your traffic just is the first to die and in power systems you can't do that I believe yeah yeah so I have a question with respect to how you pack the capacity of the network is there a way in which you could consider the so it seems to me congestion is effectively a transition between laminar flow and turbulent flow in a network and therefore you could consider are you that what's the knapsack trying to do a knapsack problem like how big are the boxes that you're trying to stuff down your pipe and how do they fit into that pipe and maybe have an adaptive packet sizing to try and ensure that the capacity utilization that you're getting is a function of the size of the package that you try to stealth first off you have vastly overestimated how technical I am and secondly this is this is bike so congestion volume is measured in bytes not packets so the adaptive element of packets are a packet is a sort of a you know yeah it depends how you reason about in units but anyway we only have one MTU anyway yes but but there's effectively what you are talking about in terms of adaptive packet size in this case is factored into the congestion controller you use at the edge because like that is a minut scale you have to reason about how you perform over the entire flow so effectively you'd have a lower than best efforts congestion controller and that ends up doing the same effect stretched over time I believe like adapting it by packet size so what I'm thinking is is the for example you know ship as you're in the sort of model discussed you know you've got a certain capacity and there's a you know you have to look at stuffing I pack it in the front and and basically that's replace you say you have a certain capacity you assume that that's fixed well I'm assuming that at some point in the network your pipe is so thick and so there's only so much you can stuff down it yeah and so and you know I guess it's talking to with respect to what the previous question you're asked before I'm about you know you're having a lot of capacity at home I mean it's not taking as account that you have congestion points where the network converges and you're yapping and so this is really the issue that Comcast has I mean that it's often sending the same data to different users down the same it might be Comcast doesn't have this sorry I mean I know ESP yeah it's a nice Peter could be Verizon oh yeah so as you're you're pushing these you know packets down to your endpoints and and yeah they have to converge through particular you know nodes on them on the tree on your leaf and you've not got a truly distributed network you've got this sort of monkey right I think there's there's an interesting like there's I think your reasoning about capacity and you have to read more about the buffers and what we can talk about this questions so I'm curious listen you have two parties that are trying to solve this market failure problem what are some practical ways that they could do this if they if they were on a local network or they didn't have to worry about third parties free ride okay so two parties if you're within a data center you can realistically actually do this so if you read Bob Briscoes work it assumes congestive congesting marking on the switches or routers and it uses EC n which is actually implemented by every router ever for the last ten ten years at least so you can actually turn this on realistically in the data center you'd have to implement all the user like the host side of functionality and then figure out if it would actually work actually data sensors are a good environment for this because it's effectively all under the same ownership you have multi-tenancy by nature but you also have very little idea of what the the costs are in terms of networking there's also this other thing which is that for the most part most people don't assume they're Network constrained right most people say oh it's the database is the problem or just storage it's always the database that's being you only really get networking the network constrained once the network engineer screws it up right and all of a sudden you have two less links and nothing works because it turns out you didn't actually provision any sort of failback as to what is critical traffic first versus what is just backup crap running in the background right so if if you're interested in knowing more I can I can tell you like I can point you in more papers but from the Bob risk of paper you can easily find something called read EC N and that's that's a fully blown kind of specification for a protocol and it only uses commodity like you can use it with just commodity switches most of the actual logic is on the user side and it also if you just do it within a single like data center lots of the really thorny problems around this actually disappear which is how do you incrementally deploy this across different networks and the fact that no ISP in the world does congestion packet market so yeah hey oh jeez that's not um so maybe you answered this and it just went right over my head is there any issue here about differences in demands for latency versus throughput on these links and does that play into this economic model in any way it actually does because you can start reasoning not only about the amount of traffic you have you could just send like every packet duplicated and you just cause more congestion but you'd pay for it because you don't have that much traffic and so the trade-off between latency versus the amount of traffic throughput is completely left up to the applications in this model right you can actually start specifying different strategies effectively and this is the thing that TCP as a resource allocation mechanism is absolutely terrible at that it doesn't give you a choice and so you end up doing weird stuff around it to work around the limitations and it's kind of easy if you just have a single application within the data center and everyone has the same patterns but once you start mixing traffic patterns and classes if it get good like it gets super hard to reason about and so the only thing is that cost like congestion volume becomes a currency think about it that way and so what you spend it on it's up to you all right I think we have time for one more question sorry I know nothing about networking so I guess this is kind of related to like the physics related sort of like analogy that was mentioned and I guess like this the second or the third paper so like a lot of sort of like these these notions of like of like rates versus volumes so and so forth like don't really seem to rely on like the history of the inputs into the system into extent I guess you could say that their memory lists yes so but you know that for say for someone like Netflix that's their say could be like a periodic input of traffic like every day at 8 p.m. you know that there's going to be a big spike of people watching TV and so like do notions of like you know prior history for people on that network sort of have any sort of like additional implications when you assume that like that it's not really memoryless because like you know someone who's consuming the bandwidth probably will do so in a predictable way I'm sure lots of people have spent years of research on that but it kind of violates the first paper which is that if you assume nothing you can encompass everything and so the whole point it like if you're starting to build memory into the system you become progressively more and more like the phone network kind of thing which is you become designed to cater for a specific use case and so that's why generally people don't do it that and it's hard the first reason is much more eloquent all right let's give it up for job

Info

Channel: Fastly

Views: 1,431

Rating: 5 out of 5

Keywords: engineering, computers, networking, network engineering, software, software engineering

Id: zVwGfcD6cfI

Channel Id: undefined

Length: 76min 56sec (4616 seconds)

Published: Thu Jun 02 2016