Scaling the Infrastructure of Practical Blockchain Systems

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

i assume i can take it uh yeah so a little bit over here let's just make sure that they can hear you so yeah you can you can be heard let me introduce you can you hear me all right people online can you hear ted roberto yes okay we're going to do the full intro right yeah go do the whole thing at least i'll introduce the uh let me start no let me start with you okay welcome everybody to ted malphon yin's phd defense it's an amazing honor to be here um i think it was uh we just confirmed that it was about six years ago in the summer of 15 when this super bright-eyed kids showed up in my in my lab and said hey i'm looking for a project and he was part of a visiting group from shanghai jatong and and i thought you know these shanghai jatong students are really really bright i'm going to give him something really hard and uh let's see if he has if he manages to finish it you know we have three weeks which was a lie in my own mind i knew it even then that that when he showed up we had two and a half weeks and uh it's a three week project which was also a lie it wasn't a three week project it was a little longer and i thought there's no way he's going to finish it and i remember very well as we sat in the lab with my students at the time and one of them said he's already done in a week afterwards and so we had an extra two weeks and uh with at least the first part of the the project that he started and i thought ah this is a really really really uh bright uh bright student and that was the beginning of a long collaboration that went everywhere and touched everything from the very lowest levels of storage systems all the way to distributed coordination and consensus protocols i'm really proud of especially one aspect of the work that we did which i consider to be the latest greatest step in the uh in the series of families of consensus protocols that have ever been invented so that's uh my short introduction to uh to ted and his accomplishments in the in the time that he uh spent with us here at cornell and he's going to summarize the work that he did uh very shortly i also want to talk a little bit about my situation here i found that in the 50 years that i've been alive that life comes at you in funny spurts when you least expect it things just happen and you have to react and your reaction defines who you are so you're driving down the highway and you see a car accident you either press the brake and help or you press the gas and say that's not my problem and so one of the things that can happen it turns out is that your student can invent something really really really significant and you can say this is wonderful we're going to work together and we're going to make something awesome even more awesome out of this together or you can say i will take this and i will leave the student out to dry so to speak and uh another option that uh one has as an institution is when you see a student who might be in a vulnerable situation as an institution you can say there's a situation here that needs to be managed there's a situation here that needs to be addressed no matter what we do on paper there's a situation here that needs to be addressed and managed or you can say you know what on paper i'm going to separate myself i'm going to abdicate my responsibility to manage this situation by making the student no longer the student of the professor with whom he worked so hard with whom they they built so many things so uh so i now i know what i did and i know what cornell did so i come to you today as a member of the audience ted is not officially my student and uh uh he is actually robert vander an essay student because cornell has decided that the conflict created by having invented something so awesome uh caused us to uh to have to separate um regardless the time that i worked with him we invented quite a few things and i'm very proud of the work we did after that time when he was officially working with robert van der nessi i continued to talk to him and he has put his name down on even more projects and on even more advances in the space of distributed systems so i'm really proud and honored to have worked with ted and i would like to now turn it over to his official advisor for the last few weeks okay well i'll just go into a few formalities uh so tech will have about an hour to present uh the work that he's done over the last years which is considerable i won't cover all of it obviously in detail but and i don't know i was going to present it because it's it's our tradition that uh the peace new student uh does the the preparation himself uh so uh i'll i'll be hopefully uh i'll be mostly just as surprised as you guys are to see what uh where this is going today uh during the talk uh you can certainly ask uh if there are clarifications needed you know feel free to speak up uh if you have deeper questions please hold those uh till the end where we can certainly have some time for discussion hopefully and then after an hour uh we are gonna uh reduce uh this gathering uh to uh officially it's the the people who are in the graduate field of computer science which includes good and bobby and uh but today i'll i might get into trouble for this so dalia if you want to stay i'd be happy if you did and any other members of the graded field or maybe online are welcome to stay and then we'll grill the quartet a little bit more and then uh we'll also let tech leave the room and but don't go too far when that happens because hopefully we'll have you back in here fairly soon after that but here's some good news or some bad news that still remains to be seen uh and without further ado take it from here thank you professor green syria and professor robert van bernes so here we go uh the beginning of my talk and we have uh so much content so many content to cover and so a little time so let's just get started and i'll i already left all my words of gratitude to my dissertation so take a look at that lengthy acknowledgement in my dissertation maybe congo can you do one thing and introduce your committee uh sure so my current committee uh consists of uh professor of robert van rines and my external minor advisor adrian sampson and my internal advisor adrian is online robert kleinberg great and and of course we have two honorary members in the grenada okay okay so during this one hour i'm gonna quickly really quickly go through three main projects that i've done throughout my phd and which could be divided mainly into two parts and first part will be the you know most lengthy part because it includes two bft state machine replication protocols uh namely host off and avalanche followed by the final part which talks about my recent advancement and attempt in making a better persistent key valve stores so before we dive right into the topic let's spend a little bit more time to define the problem because we still need to know what we're really tackling with here so stain machine bit basically fault tolerance emission replication or in short bft smr or bft state machine application uh it's an equivalent problem to its consensus counterpart so you may we have already heard of byzantine thought tolerance consensus problem and they can convert to each other so from now on i just use the word replication and consensus ensure to refer to them interchangeably and when i say that i just mean business info tolerance as the environment for the network so in such a problem we have n nodes particular participating in a network where at most ethnos could exhibit arbitrary behavior here arbitrary behavior means the nodes can send arbitrary messages to other correct nodes to try to overthrow the entire process so this this may sound different from the benign thought hollering protocols like access or wrapped where nodes can only crash but noticeably byzantine nodes cannot interfere the inter-replica communication between correctness and the objective in such a system is to make sure for example in this diagram we have furnace and where one is uh the byzantine nodes what we like to make sure here is uh despite of the existence of the byzantine node the rest of the network will still achieve agreement eventually upon the replicated log or replicating operations that could be applied to a internal state machine in a more english term it's like they are doing eventually doing the same sequence of operations so you have a consistent state across the rest of the network given the presence of byzantines and of course this problem is unsolvable it was shown in 1983 by fischer lynch and patterson that any protocol that satisfy the agreement property which means the consistency or compatibility of the logs for any such protocol there exists at least one non-terminating execution path so to get around it researcher had thought about many ways to uh by amending the protocol model by mending the network model for example you can assume the protocol only terminates probabilistically it doesn't sound like a totally unrealistic protocol if the protocol terminates eventually with probability of one probably after several rounds of iteration the protocol still reaches some agreement and another way is to somewhat strengthen the network communication assumption for example we can assume any message across the network only takes up to some delta amount of time so then we get synchronous model and the most interesting thing is you can try to make a compromise between these two choices and finally have a protocol that's regarded more practical which is a partially synchronous model so under this model protocols you always guarantee safety no matter how asynchronous the network becomes no matter how long the message is delivered but it still terminates when the network gets stabilized and synchronized at the very end just we don't know when it gets stabilized and has the synchronous guarantee and we are focused mostly on these two models asynchronous model and partially synchronous models because these are the uh common models chosen by the current replication protocols and a good example for asynchronous model is the protocol proposed by ben orr in 1983 it uses randomization to break to break the symmetry in the network because imagine in a network you have different nodes each starts with a different value and they execute the same copy of the algorithm then there has to be a way to finally collapse the value into one univalent value in order to make agreement so the key insight here is the inter introduction introduction of random randomness and for a partially synchronous model the way to break the symmetry however is to have a designated leader so we first give a specific replica a special role called leader or leadership then that node starts to dictate the network which uh ever next proposal should be and good example also includes protocols outside of the eft world for example those protocols that tolerate crash fought tall of crash failures such as paxis and reps and for sure these different models have different properties for practicalness usually for asynchronous model because it progresses without the leader in that case everyone just proposes and with randomization they finally reach the agreement and that means it typically takes more time to go through the entire process and the the best-known result was a cubic amount of message exchange in the network to reach uh consensus until recently there's a work called validated asynchronous byzantine agreement paper published in policy 2019 which is partially inspired by the first work we're going to talk about and as for partially synchronous model it does it in a very different way because we have a leader so we have constant number of rounds on a good day so when the network is synchronized we have only one leader the leader is non-faulty then the leader can quickly do the consensus in constant number of steps but even on a good day the quadratic cost for example by pbfd is still far from being perfect compared to the benign competitors or counterparts like taxes because in taxes you only need to spend linear amount message in the network to reach the consequences and these protocols are usually complicated and have subtle operational logic notoriously difficult to implement the leader itself although it solves the problem the contention problem that it may exist in the asynchronous model protocols but the leader could itself become the new bottleneck so here's a diagram showing the network communication pattern for pbft as an example but many other chrome based protocols that i just mentioned in either model may exhibit similar communication paths or even more complicated paths than this so we can see there is there is one to all broadcast and even all tool broadcasting this diagram the x-axis is the time either logical or in real time well the y-axis is different uh replicas were participants in the system however in 2008 someone named satoshi nakamoto proposed something called bitcoin and it doesn't even mention the word consensus in the paper but what it really actually does is quite similar so it wants to make sure that eventually across the entire network some correct all the correct nodes will roughly or probabilistically converge on the common prefix of a chain so this is usually called blockchain but in fact this blockchain is not a chain structure it's a tree structure so what it does is we have a repeated mapping process for each block where each block contains the hash value of the previous block chaining the entire verifiable history of all blocks and the blocks carries payloads like transactions and the the trick here is we require each block to contain uh proof of work and since the proof of work is easy to verify but hard very hard to mine or construct so the main chain if we stipulate that everyone builds on the longest chain the longest chain keeps getting longer and under certain assumptions the longest chain will finally be the common chain among the rest of the network so it operates by a very different principle so then in our first work we like to bridge the two different paradigms basically and by doing that we would like a blockchain style kind of consensus which is still problem based enhanced so in this work we would like to would like the protocol that still works on the blockchain kind of structure where the history is immutable and verifiable that means we need to make consensus on the prefix of the path over the chain the path represents the trace of the underlying state machine replication and furthermore we also would like no special treatment of view change because usually bft problem-based protocols require requires complicated view change sub-protocol and in the end we managed to encode consensus knowledge entirely with onto the chain topology and that actually gave us a lot of uh freedom in designing the protocol and reason about its correctness and also reducing this cost we also solved an open problem that we can achieve linear cost in both optimistic case and view change case per liter failure with responsiveness so this part could be a little bit subtle please refer to the dissertation for more discussion and on that overview the protocol operates like this compared to the previous all to all communication we have an aggregator for each round or it's like for each phase but implicitly within each block and the leader could shift between different phases and really in our protocol in our final protocol there's no explicit definition for faces each block may represent a generic generic phase and here each each leader for that block per phase receives the votes for the trivia space and aggregate revolts into a proof of two f plus one volts to the net space and so the entire process could even be pipeline and the protocol makes keeps track of three major state variables v-log refers to the block leading the preferred branch the execute which refers to the last committed block already may have already executed by the state machine so the green part is the confirmed or committed part and v height is the height of the block this replica last voted for so how does that preference thing work because i mentioned b lock so there's a locking mechanism that roughly got only tends to vote for a block on a more preferred range which is pretty similar to the longest chain rule in not motor consensus or in bitcoin's consensus but in a different way because we don't have football work well what we are looking for right here is among all the blocks like b4 which gets two f plus one votes which is a which could be proved by a quarter certificate or qc if it gets two hops of qc meaning in block b5 it carries a proof of majority votes for b4 and in turn in block b6 it carries another proof of a majority votes for b5 among all these blocks like b4 the highest block gets the preference and it leads the the branch of preference this could be also interpreted in a way that it is the most recent blog on the topology that made that has been made through two phases of both so after we have the preference definition we can define our multi-voting mechanism and the voting mechanism is very simple and for most of the time you just vote you just check whether the new block has higher block than the last box you voted for and in addition to that you also check whether the new block is on the same branch as the locked block there's also an exception case that could conditionally unlock the preference which is crucial for likeness and also defer that to the dissertation the final question is when do we commit we have the voting mechanism we know how to aggregate we know how to put votes onto the block topology by forming qcs we know the preference but when do we know we can actually even advance the green part because the green part is the final final last part or the output of the protocol and this is surprisingly simple because in our protocol we do not need time dependent information other than this topology so by looking at the topology for example suppose b4 gets three hops of qcs so i don't want to repeat it but you know three rounds of majority votes where the first two rounds the first two hops also points prove the direct parent and in this case we know for sure that you can advance the green part to b4 then a question could be like is it always guaranteed that your green part can reach that point yes it is guaranteed and it is proven in the paper so another question for this work is you seem to use two rounds of voting you use uh three rounds of voting three qcs to commit but in pvfd or in like in in taxes or in other protocols you only need two rounds and the reason behind is is very interesting and of course you can have a two-phase variant of hostile but you may end up with some subtle lightness problem and accordingly you can amend the preference rule to have a two-phase version and the two-phase version is deeply connected connected to our related work tournament and again refer to the paper it's very interesting for this part and i think it's also possible to design a two-phase particle that you still have the standard likeness guarantee but by sacrificing the linear complexity so the reason here we maintain three phases for the final commitment is also partly because of the theoretical contribution we like to make it linear so what's the secret sauce i dumped a bunch of things in your head and they seem interesting and you know the overall structure but why does the why does this why is this protocol special what does it do to to solve the problem well if i only i can only say one thing about the protocol is it tries to keep time independent data structures so online unlike old protocols uh usually you you have a bunch of state variables in a protocol you'll have a map you'll have an array you have like valid numbers for different parts you have sequence numbers you have a bunch of variables to keep track of and the protocol just progresses in the intertwined space space-time diagram or timeline in it so that makes the protocol difficult to analyze and difficult to improve but instead in our work we try to move over most of the consensus related knowledge onto the chain itself so now the chain itself becomes a space-time structure so we can derive a lot of useful information by purely looking at the topology of the chain and that way we only need to maintain three main variables that have a clear definition over time overall the main contribution for hosta theory wise is that it is the first partially synchronous protocol that achieves linear lower bound in presence of a leader failure and it also inspired other protocols such as validated asynchronous business agreement that i just talked about and synchronous hot stuff so it also impacted the other the design of the protocol in two other categories that we we didn't talk about in this we won't talk about in this work and engineering-wise the product of safety is just about checking some conditions of the tree topology which is very different from the traditional chrome based protocols and there's no explicit view change handling for safety so that the safety part could be decoupled from loudness in our paper we coined the concept called pacemaker so pacemaker decides when to create a block and when to elect a new leader but however you design the pacemaker it won't affect the safety of the protocol as long as the basic assumptions are held and the safety part could be made within 200 lines of c code which is pretty short and even uh like feasible to proof check and then as an example facebook adopted our protocol in their blockchain platform they derive their own variant called libra or now called dmvft it also contains an instantiation of the pacemaker concept so which is exactly what it we expected so they based on their own product they have their own optimization and own policy of rotating the leader or decide of deciding when to propose a block so they have their own basement and at last we have a prototype implementation that is pedagogical and could be used for research so what we didn't anticipate is facebook adopted work and then it got very famous but deep down inside i think this protocol is elegant and interesting by itself and also we've heard rumors or news about the political obstacle that uh novi or dm is going through so here i just sincerely wish them the best of luck for their launch so after the first work let's take a brief pause to think about a question what prevents problem-based consensus from scaling because even in hostile there is a leader and there is of course leader bottleneck and same you know for other non-bft protocols like taxes well what i can think of right now is there are many two reasons the first reason is the strong consistency requirement by the standard state machine replication model itself because in standard state machine replication all the replicas like showing the diagram they want to collect uh collaboratively write into one consistent block which is not scalable if intuitively if you think about it it's like in a in a classroom you want all students to grab one pen to write down the next entry in the log so one way to to resolve this issue if you have a lot of replicas is to adjust the model i can weaken the model because there are many applications that do not require total ordering and the first i think the first noticeable attempt in the consensus world was made by egalitarian pastors back in 2013 so there they argued that the transactions made the ordering of transaction may may not matter at all and you can just reorder the transactions that do not conflict with each other and thus creates more concurrency they call it there's more consensus in the egalitarian world or some sort they have that kind of title and that's very interesting and the other major reason is for all the column-based protocols we we're currently seeing at least some node or just one node has to exchange messages to the vast majority of others this is this sounds like uh inevitable because if you think about it your network you want to make a consensus you want all correct notes participate in the process of making the consensus then you have to consider the opinions from the entire set and is that really the case well it's not that quite true so if you think about an analogy in politics say in the u.s we have a congress and there are a hundred senators in the congress plus more than 400 representative representative in the house and usually a bill has to be passed by uh those parts uh and and usually you require simple majority or super majority very similar to current-based consensus but if you ask yourself whether the congress mode is largely scalable it's of course not you cannot include every american in this u.s congress it will be a chaotic state and the efficiency could be very low but however in real world we do seem to have some degree of consensus on social media if you think about it from the statistics there are over 200 million users on social media in the us so that means once the news or rumor counts out magically somehow there is a tendency that the majority of the crowd moves towards one purpose like oh the rumor is fake or the mover the rumor is authentic so we are actually interested in that so could we have a loosely organized consensus with some probabilistic kind of safety guarantee so we do not have to be as secure or as deterministic as the us congress we can go for the social media yes there's a way so we can do peer-to-peer gossip in the bft pro environment by sampling that of course this is not a free launch this is not like well we can't achieve the same level of democracy using social media as we would with u.s congress but it's like you know it's like bitcoin kind of trade-off we trade off deterministic safety for a probabilistic one as long as we are able to make the probabilistic failure probability arbitrarily low and of course by gossiping you already assumes it requires more synchrony in the network but not as as strong as in those log step protocols because when you gossip in the network you don't do you know round one round two because the network is very large and it's hard to coordinate everyone so uh in the in the first graph we can see the failure probability that goes up with the percentage of byzantine nodes uh still goes up to 50 percent and for bitcoin at 50 percent it is the exact point where it breaks down so that's like 50 known that's 51 percent attack for classical protocols usually it requires three or plus one participants to tolerate air failures so once you go over that one third threshold immediately the protocol immediately collapses but otherwise the protocol never fails for our protocol it's more like a through smooth threshold function that combines the property from both bitcoin and classical particles if we take a closer look at it by log scaling the y-axis we can see by configuring the protocol carefully we can achieve better safety uh than bitcoin six block rule and might and also much faster according to the violation result that i show later so it looks like the entire logic is is by adding a feedback loop so we we sample the network we gossip we adjust our preference and we sample again we adjust it again if i keep doing that we hope that we can do a positive feedback in the network so the binary consensus will topple over to one side to one color or zero one once the balance is hard to to maintain and indeed in our protocol we do we did see this uh characteristic so in short it's like rather than asking everyone about their preference or opinion on something we just take a constant size sample dependent on the safety parameter you'd like to choose and because the k is pretty scalable it's only dependent on your safety choice no matter how large the n grows the choice of k can still stay very small and next as an example here say we have a bunch of circles representing nodes or participants in the protocol they currently have their preference for one color and say alice sampled other five nodes and realized blue might be the majority answer and at the same time bob may sample five other nodes and thinks yellow is the majority answer so for now they could flip their collar to the opposite side but this process cannot be maintained indefinitely so the property here is if you design the protocol carefully then you can create a positive feedback loop so once the balance is off by a bit either because of the network fluctuation or the initial preference the entire network will quickly converges towards one value so in our first vft protocol called snowflake we have a loop that samples k piers uniformly at random and if the majority are graced on the color and the color is stays the same as the previously seen color we increase the counter otherwise we change the color and reset the counter but this is not ideal because then the memory for each node is a firmware it's like uh once i see a majority vote now from the network if the color doesn't agree with my current color of preference i immediately give up my preference and yield to the new color so to counter that in liveness we added the notion of confidence so we increase we tally and increase the confidence for each color and our current preference is defined upon the distribution of the comfort of confidence we only prefer the color that's with the highest confidence and the rest of the protocol is very similar so there's a demo okay we can play with the demo later at some point because that saves time but there's a demo showing the dynamics of the second protocol snowball particle so we have a protocol that seems to do a binary consensus decide seems to decide between two values and with probability safety guarantee but what now because we want to have a usable system say we would like to build something similar to bitcoin we like to have a decentralized payment system then the question is how do you move from that binary protocol to an actual system so the key observation here is standard replication is not necessary in realizing a decentralized payment because bitcoin white paper already proposed a model called unspent transaction output model which already ensures a verifiable flow of spending by using digital signatures then the only missing piece of the the whole story is how to resolve the conflicts between the transactions to reject those transactions that does not obey the standard model so here's an example in this diagram we have three transactions a b and c and on the on the left hand side of the transaction c you can see there are two proofs so the proofs are could be signatures showing that the creator of c has the ability to spend the unspent output from transaction a and b and the entire structure it's like a flow of cash so basically you go from left to right you can see uh the output that the the spendable part of a is fully consumed by transactions transaction c and the spin suspendable part of b is fully consumed the first spinnable part is consumed by transaction transaction c and there's still one ouncement output left and there are also unspent output left for transaction c the unspin parts are usually called balances in uh for the you know virtual accounts in such uh decentralized system and the model is called announcement transaction outputs model so what we what we would like to achieve here is suppose there is another transaction called d and d try also tries to set spend the same announcement output from a and maybe the owner of the announcement output of transaction a is able to create these two transactions at the same time because why not i can create two proofs that proves the validity of spending the same output then things gets a little bit complicated because we have to choose between d and c because only one can really one can go through otherwise you will make some money out of thing here and anyone can just uh claim that i'm spending that already spent outfits to always get more money so the resolution here requires something weaker than constances by weaker i mean because for an honest spender he or she never creates such transaction abstract transaction d because creating this transaction is pretty deliberate imagine this is a system where each node checks the confliction of the transactions then in order to do that your program has to generate two conflicting transactions and submit them to two different nodes in the network and hope that the majority of the network receives these two transactions in time to have the confliction so it's a very deliberate process this is why in our work we justify that for such malicious spender who tries to double spend the announcement output he he or she may just get stuck forever we don't care so then it has weaker liveness compared to the consensus problem and for as for the honest case so i'm an honest spender i never do this thing then my transaction is the only choice in such conflict resolution then that's like a triviality case in consensus which could be largely optimized so here it shows we have different sets of conflicts like these three transactions are conflicting these two are conflicting and this is uh like honest so i don't create any conflicting transaction to that but what i can do in addition to having multiple instances of binary consensus to have the resolution is to chain them up together to better utilize the voting mechanism so here suppose i vote for t8 because it is the only preferred one in the set then i also cascadingly votes for t4 t2 and all the way to t1 and this just shows a very uh simplified example because in reality the transaction could contain more than one input just like i showed in the first diagram and the problem is if transaction a conflicts with transaction b and b conflicts with c it is not guaranteed that a must conflict with c because they can have different inputs and their input it is their inputs that conflict so to account for that correctly we are doing the consequences instances at the really at the transaction input level and here we go we have the payment system because we can repeatedly use the binary snowball a snowball instances and try to generate that to multiple multiple colors it's not like deciding among two values but multiple values and that for sure will create some lives issue but we don't care because the auto spender will only create one value in the consensus process so then uh here comes the evaluation we have evaluated a system in a data local data center from 100 up to 2000 nodes and the two bars represent the throughput width and without signature verification for the transaction because when you provide proof you want to do a signature and as we can see it scales pretty well you can barely notice the performance degradation as the number of nodes doubles as for latency the latency is also surprisingly well and by the way the protocol was configured to be much much better than bitcoin in terms of safety so it's like 10 to the minus i think 10 or 9 of failure probability given i think around 10 to 20 percent of byzantine knows uh in an effort so it's kind of unfair but you get the point it is possible to have such protocol which is way faster than bitcoin without doing proof of work and scales very well and latency here the majority lies at sub second and the maximum latency is also sub second so we're more interested in a geo-replicated scenario because that's more realistic suppose we distribute these 2000 mls in 20 cities across the world and all nodes directly participate in the consensus in the in the process that i just talked about and we also assume full signature verification so it's a very realistic system then the results again are very promising so we can achieve thousands of transactions per second and to have only around one second of latency and the maximum latency is around four seconds so to conclude in this work we actually first introduced a interesting family of binary bfp protocols called the snow family which is which is consensus by sampling so it introduces a new paradigm and perspective of how to do consensus by changing the consensus assumption a little bit and this family is quite as an agreement just like those problem-based systems because they also vote however they have probabilistic safety so they are they could be loosely organized like nakamoto's consensus and they also incur less communication cost and that leads to the scalability and all it's all thanks to the use of sampling in the product and on the other hand we also build a prototype implementation for avalanche system a decentralized peer-to-peer payment system you can combine multi-value snowball instances with helpful utxo model to allow a fully functional payment system and this this system has already been commercialized and deployed worldwide with around thousands of values so what we didn't anticipate again is the avalanche system actually we built a startup around this work and the current system is up and running since last september and in the in the system is carrying a circulating a pro a supply that's uh in like two billion dollars worth of uh equivalent us dollars and you can see somebody built a nice map showing the distribution of the node running this protocol and there are around 1 970ish when i take the picture picture and the volume is is very decent from my perspective to show that this system is actually used by many users from the world and here is a screenshot of the block explorer and and this shows the at the real world transactions under the utx small model you can see there are inputs and outputs so it's taking one input and generating two outputs the avalanche system is named x-chain of the platform which is the main uh part of the platform so uh for the it is uh it is all about the first two works and uh for for now for this minute i want to take a break and uh take a further step back and think about uh you know about optimizing or scaling the blockchain infrastructure as a whole thing because we managed to optimize the protocols we actually offered two solutions two very different and interesting solutions but once we but can you okay but once we have the solution the next question would be uh what what what where else could the bottleneck actually be because you know when you have a system once you fix one bottleneck there could be other bottleneck emerging so my opinion for this is the state machine itself might be the next bottleneck because the replication could be made even faster than a typical so-called evm implementation on the market so then we're diving into the replica itself the state machine state itself one immediate problem is the storage problem based on our current ongoing research experience so then i we did a totally different kind of work it's not about replication forget about bft stuff forget about voting all this all these things for now so we're talking about a storage system called serious tv so we had three main observations before we set out for this journey the first observation is there have been many on-disk indexing schemes and usually these are the commonly used uh schemes for open source or commercial kivado stores and one commonly used data structure is uh based on b or b plus 3. so this kind of data structure i believe everyone have learned about balanced binary search tree at some point in their career and these trees are stable they have predictable degradation so when you insert more items into the tree its height increases logarithmically there is no surprise in the process it has they have non-sequential small rice however because in order to maintain the tree structure you may need to amend the tree nodes here and there inducing non-sequential rights to the underlying persistent storage like your hard drive and also tree balance maintenance creates some extra amplification in the process so you want to balance the tree in b plus tree you want to move like borrow the children from the sibling subtree and that could also induce some amplification another approach which became very popular recently is lock structure structured merge tree solutions or lsm solutions exemplified by level db and roxy so they are optimized for write intensive workloads because they generate sequential rights and the storage final story of outcome is compact but they also have their own deficiency they're not perfect for example they have periodic merge sorted compaction so that's part of their rebalancing or like adjustment part but it's very different from the the b3 based solution so periodically you will see a huge performance degradation because some data are read from a red from the disk into the memory and getting fully sorted combined and written back to the disk so that that induces a lot of amplification they also have higher read amplification so for both uh solutions under the on this indexing scheme they were intended for a scenario uh where data cannot largely or entirely fit into the memory and the secondary storage is much slower than the memory and nowadays because of the advancement you in both hardware and data structure and also like the improvement of not only the secondary storage but the capacity of the main memory neither of these two assumptions may hold for some cases and in the meanwhile the second observation is we notice also recently there are some log or slab based solutions which do not do any unbest in this indexing at all because organizing indexes or data structure on disk is very expensive so they instead took another a very different approach they use fast in memory indexing which means they can use hash table some not so storage friendly data structure to keep track of the data and the record all rights to a lock and when you want to use the keyboard store after the failure or after you shut it down you have to either replay playback all the changes uh to the uh initial image of the keyboard store or you have to periodically take checkpoints of the entire storage so the recovery time can be very expensive that's its deficiency main deficiency so um based on the the and the third observation is many applications do not require the sorted property so whenever we use a keyword store we usually just pretend it is like a persistent on-disk kind of unordered map in c plus plus so we just input a key and the value to store it and then input the key to retrieve the corresponding value but in my opinion according to my experience many applications that just manage user configurations do not care about the ordering of the keys neither does some cloud infrastructure care about the the argument of the keys and of course blockchain systems do not cure for the most of the time because their keys are just hashes already so they don't care about based on all these opposite observations we propose a new data structure called lazy tree or tri-e depends on how you pronounce it it is a data structure that is a memory index but at the same time also storage friendly meaning it can be directly maintained on the disk so that way we can achieve both a comparable performance to the state-of-the-art log-based approach like faster while at the same time maintain a fast recovery speed like rocksteady so the data structure is actually very simple if you think about how to do a hash kind of tree you will get it you can hash the user key into a fixed length of characters or bytes then you partition the fixed length spikes into several chunks and use that as indexes into the different level of a radix string so the property for this basic structure which is kind of kind of trivial to in some sense is it has logarithmic tree height and it also has no cl almost no collision if the chosen hash function is strong enough but of course that structure is not practical for example if you take a closer look at the tree structure you will notice that for any two keys since you already use a very strong hash function it's very unlikely for them to share a long prefix on so that you will see this kind of structure like in on the left you will see two keys having large part of the path as a singly linked chain so based on this statistical property we can just do a path compression for the suffix of a path to shrink this whole structure into the right structure but that doesn't solve the entire condition the problem is on the radix tree kind of solution has a dilemma that if you choose each tree nodes to be very small say you only do a patricia tree which is a binary rydex tree it could be very tall so that means you have to do a lot of disk ios but on the other hand if you have a fat enough radix tree of course your tree height is spiritually reduced but each three note footprint is very large so that it the three footprint could be even larger than the data set your story which is kind of bizarre for example here suppose we have a 128-way lazy tree the uh the rudimentary kind of version the solid blue line so the circle over here shows uh an example of having user data footprint for 23 byte keys and 128 byte values which is small a set of small data 100 million small data in order to store that you have to have way higher storage footprint that's just for metadata to keep track of this user data and the reason behind is the low utilization of each tree node and the periodic change here is because of the increase of tree heights you can see it corresponds to the increase of the tree average path length to be more exact in this case so then we had a very simple and interesting solution what is the perimeter s in those graphs oh sorry so so the s oh so we're looking at asking just one case so that's without the uh the optimization i'm going to talk about in the next slide so is this the log of the branching factor of the no no it's like a introduced parameter in the next one oh okay yeah sorry for that confusion so then we introduce some notion called sluggishness or s in in the graph so the the principle for sluggishness is suppose we don't have any sluggishness like what we did for in the previous slide we'll have we'll end up with a pass compressed tree structure like this so all the leaves are data nodes and you have some interior three nodes but each tree node could have thousands or hundreds or thousands of children which consume a lot of storage overhead so what we do instead is to lazily expand this tree so instead of doing this all at once we have a systematic way which is to allow some uh overflowing data items under one to one child so in the example if we look at this structure if we allow s equals three then we can have d0 and d1 all under uh this child to form a list and d345 under this trial to have another list and whenever we insert an extra child or extra item in down this child like big six it overflows our upper limit like s equals three so we have to redistribute the whole list by introducing a new trill to split it so then we have the final structure in graph c if you compare graph c to graph a it still has two like uh it only has three three notes in total like two less than the uh in a graph of a so it actually saved around roughly uh like fifty percent or ish of uh tree footprint storage and by doing that if we say use sluggishness of four what we can do here is for the previous example of having 100 million items that's that corresponds to around 21 gigabytes of entire user data set and we only need to spend two gigabytes for the tree structure instead of 31 gigabytes so it's a simple but very interesting elegant optimization that makes the whole structure practical and apart from the lazy3 data structure we also have a pressure so how do you store this sitting based on the previous slide is it in the same node or how does this increase the in the slop conditions slow down say this oh so the the sluggishness is like a global parameter for the cuba store yeah so like for example here i just i just have a data store that allows sluggishness of three so this link list of three items yeah so whenever you have the fourth item you got you have to split the list further by introducing how many details do you need for this list of three oh it's a yeah good question so the overhead of splitting because we introduced an additional operation here the overhead of splitting is negligible in practice because the frequency of sleeping sleeping is splitting is very low uh because of the hash function the the statistical property of the tree intake we're going in the last five minutes of your shirt yeah so okay um you could choose whether you want to leave final for yeah yeah i talk about it's it's around the end it's around here so on apart from in addition to the the data structure we also have a interesting system design for it but of course this is the time limit i'm not going to explain the engineering details in this talk but there's an overall architectural look of the system so we have we divide the storage space linear space into pages uh and the group pages into the to the granularity of a region and we then group several regions into individual files and all by looking at different bits it's like a page table style kind of division in this space and then with math we do memory math because our data structure is both in memory and storage friendly so we can directly memory map the region into the memory and whenever you make changes to the pages you have you create dirty pages put into the right buffer you can have batch rights there's some basic level of concurrency control to utilize some concurrency in this data structure because red x3 has a lot of potential in concurrency and there's a separate disk thread that first writes ahead to log and then schedule the block rights using asynchronous linux io so this is the overall sketch of the implementation and overall the evaluation results are very promising we only digital timeline then we can look at the the graph on the left we can see for read intensive levels like from 90 read to a hundred percent pure reads uh for four threads our keyboard store outperforms other main competitors and for more right intensive ones of course uh faster which is a lot based approach is definitely faster than anyone else but even in this case we achieved comparable or superior write intensive performance compared to other uh on disk indexing schemes because we are on disk indexing schemes and here is the scalability with respect to the number of threads and there is indeed some bottlenecks in our implementation and we identified them we talked about it in the paper but overall the trend looks very pharmacy so like if you check faster here like we although although we saturate over here it's still comparable when you have uh excessive amount of fluids in this kit and that's for recovery so this is the ultimate table so we have the normal throughput for the given write intensive workload which is 58 percent updates we are always this index we are like faster we are hash mapped so we we don't preserve the order for user keys but our recovery time is close to other in disk index solutions whereas we are cheap throughput comparable to fast and covalent as well as a slab-based solution and the other ones like mastery achieves very high throughput but you know you have to create either checkpoint you have or you have to suffer a very long suffer from a very long recovery time so that's it uh the conclusion for this part is we propose a literature data structure it is friendly in memory on this axis it dynamically dynamically grows with neo optimal tree height it's also efficient in concurrent access we designed and implemented the full kiwi store it could be made available but we'll pla we plan to open sources later very soon so it's ready to use rust crate can be used it has c and go language bindings and also i i separate right here logging to form basic library so it could be used for future research purposes and that's it so i would like to leave some final remarks if i'm not over time but in short it's like many researchers like including me and other collaborators uh have the feeling that you know consensus research is converging slowly it's hard to do new things in this area but i'd say you know it's like twofold on one hand yes it is it's hard to do groundbreaking new research projects but on the other hand it's a good thing for the industry because you have to wrap it up in order to move on so if you look from the wider perspective if you look at the entire blockchain system it's not only about consensus but where consensus gets involved so it has something to do with state machine it has something to do with the execution of the operations on the state machine and the storage so i envisioned that the next main bottleneck for such blockchain infrastructure is placing the state machine the storage to preserve the state and the execution and there's also many things to do there and of course in some relation to the consensus because it's good to treat them as black black boxes but sometimes you want some more optimization it is better to have knowledge for each part so you can integrate them better and to conclude my phd journey started with uncertainty went down with surprises i finally finished up with joy and fulfilling hopefully after the results and i see myself both as a researcher and engineer and but more of a bridging role that was theory with practice location or position does not affect my computing efforts in doing research so i'm definitely looking for our future collaboration the this exam really changes my career trajectory and at last i'd like to say there are too few blockchain system researchers to be honest so most of the researchers are either in crypto security or secure theory and the importance of building decentralized infrastructure is largely undervalued by the system of libya and that could probably explain why the lab lunch paper is still rejected now but thank you for all listening thank you for attending my video okay well technically we don't have time for questions but let's have some questions nonetheless any questions in or outside the room uh you know so thank you for talking previously you showed a figure that the y-axis is the failure probability and x-axis is the ratio of basic modes and the x-axis stops at 0.5 and the figuration at that point is 1. and i'm wondering whether i don't know but whether uh the phenomena of power law will place a role so parallel says that in many systems a very few players may control a majority of resources or power and so i'm asking because i heard that for example in bitcoin there are like a a large majority of resources but controlled by very small network parties yeah that that's a very insightful question so can we get the full question so yeah sure so basically my question is whether that's a power law is real play or low while playing that figure okay so so at the time when we drew that graph we also had the question because in our mind because uh strictly speaking when we talk about fault tolerance of bitcoin or nakamoto consensus we're talking in terms of computational power we're not talking about the identities but they're in a graph we just make them to mix them together because they look similar so in reality of course uh like if you own the like super computer uh of multiple labs in the us then you can overthrow bitcoin and you only probably just post process one identity so there the graph more or less just shows for that particular protocol in its own like assumption the model the portion of adversaries in the network so for bitcoin it's like the the power the computational power but for other protocols because they don't require that power they just need an identity it's either down by some admission rule like like in this room i invited everyone but you know random random people cannot come to this talk so you can do that it's by admission sorry it's by administration or permission system or you can do permission less with some similar argument to bitcoin you can do proof-of-stake so everyone who participate has to staking at least this amount of money so all right it's a literally the same thing but uh good question thank you the other question oh i love it i did not fully get the connection between the center's tv and blockchain why is it just going to be particularly good for starting oh that that's a decent part of the entire story so first of all the hashtag or the lazy try your tree structure was inspired i forgot to say it's actually inspired by the merkle patricia tree in blockchains in blockchains you use some merkle trees to prove that some item belongs to a set so we when we did before we did the research we were using some uh existing solution like level db in avalanche when i code up the the prototype i use i had to use the database but i soon noticed that for commodity level storage hardware the uh the performance is not scalable due to the compaction so then we set out for for for uh for finding a battery kill store for blockchain but later we realized that you know this qr store is could could also be used as a generic keyboard store so that's why you find that the tone is kind of you know separate from the first one but you can also use it as a as a blockchain key valve store so it just means like i'm trying out a new thing that could also contribute to the blockchain infrastructure because as i said state machine storage is also a huge bottleneck an online question from dalia did you compare cdsdb with trillion truly i'm not familiar with it either i'm not really sorry yeah we can do the comparison no quick answer is no but uh we'd like to know more about that and do the comparison questions in or out okay well in that case uh let's do one more round of applause thank you all for coming online and inline and uh so [Music] uh i said this i sell fortune for online

Info

Channel: Ted Yin

Views: 661

Rating: 5 out of 5

Keywords:

Id: TmDpN9VMnvo

Channel Id: undefined

Length: 73min 42sec (4422 seconds)

Published: Thu Jul 15 2021