XArray: One Data Structure To Rule Them All

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

you testing Mic Check oh that's that's me all right yes I thought since since I'm in New Zealand I really must make more Tolkien references but the idea really is that this is the the only data structure you need so what it what is it it's an automatically resizing array of pointers and you don't need to know anything about the implementation of it because I'm going to change it hopefully for the better so this important traits is that it's an array of pointers it's indexed by an unsigned long some people been asking for you 64 index they're not getting it yet they might in the future it starts out with all pointers in it in our null and it occupies no it takes no memory to begin with it has a spin lock embedded in it so you don't need to not worry about locking it's taking care of its own locking at least if you're using the advance the the the regular API which is all I'm going to be talking about today there is an advanced API where you get to take care of the locking for it I've touched not in other talks if you go look at the Linux plumbers talk from November in in Vancouver you can find out more about that I don't talk about that today the other important performance aspect of this is that loads are store free so if you are doing a load you do not take the spin lock the the RCU thanks Alma Kenny is technique is used in order to avoid doing any modifications so a simultaneous modification can be happening to the array and you will get atomically either the the pointer that was in there or the pointer that was just stored in there but you won't get a mixture of the two and you won't get you will get something that was never stored at that and at that index so the most users only need to use these three or maybe even four functions you know you load you store maybe you were raised rather than storing a null some people prefer that and you can iterate over it be you know counting from 0 to infinity takes quite a long time so we have XA for each which we'll call which will execute the loop once for each non null entry in the X ray we have this functionality called marking these are search keys you get three three marks per entry so you can say you know I want to iterate over only the ones which have marked zero set or only ones that have marked two set and mostly users will define their own meanings from mark zero mark one and mark two so that they can say I want to iterate only over only the ones which are not yet written or I want to iterate only the ones which are currently being written now that that kind of thing so these are somewhat less used parts of the API than load and store so we have an insert and insert will only store if there was nothing already in the array compare an exchange so even if you only want to if you want to be certain that nobody else was racing with you you can use a comparing strange operation find will search through the array in order to find the the next one which matches the the filter and reserve is a shocking part of the it's an awful part of the API really but what it does is if your locking is so complicated that you would otherwise have to really mess around with with with the locking before you take any lock at all you can you can say preserve this index and then you can do a store later without allocating memory speaking of allocations you can also ask the for a store operation you say here was the point to which I want to put at this index with allocation you say here was a pointer please find null entry to put it out and it tells you where it put it with an allocating x-ray I said earlier but storing null is the same thing as doing an erase that's not ruining all with an allocating array because it is very convenient to be able to just to store a null we actually have users which depend on us which depend on being on they distinguish between storing a null and keeping the we storing they'll keeps the entry allocators so somebody else calling X a a lock will not be able to allocate that one who remains reserved just as if you use the X a reserved maybe I mentioned on the previous slide but so XA arrays will both set that pointer to null and make it available for reallocation and that's what most people want anyway so it works well for the for the users that we have now the the X ray was developed as a replacement for the radix tree and so I have converted all of the existing X all of the existing radix tree users to this new API and it's more in that get tree there I have not yet submitted a pull recording that I never intend to submit a pull request for this git branch I'm going to submit the conversions to the maintainer and get them to look it over because with something around 50 users of the radix tree API I've probably made a mistake or two so I want the maintainer to be responsible for the checking checking my work we have another data structure in the kernel the IDR which is where the whole allocation API comes from and I've converted most of we have rather more used with like 200 users the IDR I haven't converted all of them yet but I've converted and a large enough sample of the IDR users that I'm convinced the current API is is good enough for them so the another great thing you can already you can use it for today and you should use it for today is replacing custom implementations of resizing arrays so a fairly common thing we have a fairly common pattern you see in device drivers is people inventing their own resizing array is I just submitted a patch for the AIO code which has its own you you you you start out with one context and if you allocate a second context it doubles in size and third context it doubles in size to 4 and then doubles again to 8 and really it's it's just a resizing array of pointers and that's exactly what the x-ray is you can also replace some linked lists which is kind of fun and I touched on that a little bit more in the plumbers talking I don't talk about it now because I want to get to the meat of this this presentation so here's the interesting things here is the laser which is not yet the one data structure to to rule them all if you try to use it for arrays with what I call sparse array so arrays with very sparse indices it will not work well if you try and replace a hash table again it's like having a very sparse array it won't work well we have an API in which I didn't even put on the slides because I'm a little bit ashamed of it to support storing a single pointer across a range of indices so if you look up any index between a and B it will return you the same pointer it there's this the implementation is not good the the API itself is not good either so I I need to do another round of refinement on that before we had more users we have a number of users in the kernel of a red book of red black trees don't don't try and you use the the x-ray instead of an RB tree yet unless you're very certain that it will work well and the file descriptor table for reasons that I will hopefully have time to get into later okay so why not what what what goes wrong if you try and eat and use if somebody else is choosing your array indices for you in order to force maximum damage the radix tree data structure that the x-ray current uses as its back hand really does not handle it well here's here's an example of storing a pointer at an index which is its own address which is actually a fairly sensible thing to want to do if you just want to say if you just want a list of all pointers then instead of instead of trying to invent somewhere to store it you distorted its own address and then you can iterate over all of them and you've got your list of pointers except that the radix tree really loses its mind when you try and do that because the index is implied by the position of the pointer in the tree and so in order to store it at this f-f-f-f-f-f-f it has to allocate nodes all the way down to the bottom and then store the pointer in one place and so you you have this incredibly deep tree and it's got one pointer in it and you've allocated about three about twelve kilobytes of memory in order to store a single pointer and the process of looking it up well six bits at a time you you you you get to the index that you sought us out this is this is not good this is this is quite the drawback and as I alluded to earlier some people receive the index they are storing the pointer out from another machine now maybe that machine has a different endianness and and it's actually malicious it's not trying to harm you but nevertheless is it's giving you an address it's giving you an index which doesn't work well for your machines endianness and there's a few examples of that scattered throughout network file systems network protocols anything where somebody else is giving is giving you an index that you're going to want to look at okay so I'm pretty happy at this point with the x-ray API but so time to look at what what we do about the backhand as I said we already have a red black tree in the kernel where black trees are not very cache efficient you if you have at each level of a red black tree you make a choice left or right so if you've got a thousand entries in your red black tree your red black tree is ten levels deep so it's not very cache efficient because every time you walk through it you you you you you basically take a cache miss it's like having a W is like using a doubly linked list sometimes in order to annoy computer theorists to people like also Trevor Lee linked list because each knows you're linking up to your parents and left and right so also the interface to the red black tree is really hard to use and you know if we use this as backend for the x-ray you know mate wouldn't be so hard to use but right now to convert something from using the the radix treaty using the red black tree would involve writing a lot of code because for each for each type of data you store in it you have to you you have to write your own search net search functions and in particular we saw a struck page in the inner radix tree all over the place and the last thing you want to be doing is talking to people about growing struct page that's not a popular thing to suggest when the very black tree isn't asked to you safe so you have to take some kind of lock every time you walk the tree one of the base one of the advantages of the radix tree is its RCU safe very very important for the page cache you you really don't want to use the page cache on a modern machine without RC RC in so the very black tree was out snow to the B tree we have a b-tree implementation inside the kernel it's not as efficient for densely packed indices as the radix tree is it's about half as efficient in fact because for each for each entry you are not only storing the pointer you're also storing the index at which that pointer says so that's not great it doesn't support ranges and the whole point behind me starting to do the x-ray work and the other work I've been doing with the page cache is that we want to support huge pages in the page cache and so what you want the huge page in the page cache is you want to say anything between indexes 512 and 1023 gets the same pointer and the b-tree doesn't support that and the implementation inside the kernel dozen isn't our see you safe so i i rapidly discarded using the b-tree code we already have so colleague and i liam the at liam and I came up with some thing we're calling the maple tree now it is the reason it's called the maple tree is because we're both Canadian very patriotic people so the maple tree is an adaptive tree so we have several different types of node a range node you'd probably start at the top the trees looking to arrange note in the range node you would have eight pointers and you have seven pivots and so you you scan through the tiny this is the seven entry array of pivots looking to see where you sit where where the index you're looking for is going to fall and then you follow the corresponding pointer and you're thinking hang on you've got seven pivots and eight pointers how does that work we start out with a an implicit minimum and maximum for each node so you you you really have nine indices that you're looking at and eight pointers that fall between them but you don't need to store the minimum acts because either they were stored in the parent or they're implicitly zero and you long max because you're looking at the head of the of the tree now having only eight pointers per level of the tree doesn't sound like a most efficient way to look at it so when you get down to the when you when you get down to the leaf node you can just store fifteen pointers in there if you're looking at something like the allocating array where you start off with you know I miss all the first one at the zero and second one at one and then keep going two three four five you only need to allocate dense nodes at the leaf and it's it is actually more efficient sometimes than using the radix tree was simply because we have smaller nodes but there's less wasted space per node the the the important part of the Linux memory management is that it's it's based on pages I mean yes we're using the slab allocator but the slab allocator splits up pages so the question is how many nodes can you allocate per page and with the with the radix tree you get seven nodes per page and I'm talking about 64-bit systems here it's it's a it's a it's a different calculation on 32-bit systems but for the sake of keeping this brief I'm just going to talk about 64-bit systems you get you so with with the radix tree that's 64 entries at each level of the tree and you multiply up 64 times 8 and then you have to add on an extra 64 bytes because there's some extra metadata associated with each level which we including things like a parent pointer and knowing what the shift is and all that kind of thing and we can cut most of that out for the maple tree and we end up with the calculation I'm not on the screen if you're just using if you're just using dense nodes you can get 448 entries per page with the radix tree and 480 per page with the with the maple tree so that's good I mean that's like a 10% improvement that's not not not not inconsiderable but you end up having to spend more nodes managing your other nodes so you you you you lose that 10% advantage fairly rapidly but it's comparable and the nice thing with the pathological example I showed you earlier storing ext2 FS ty Paget so an address well that bad just fits in one knows because you say okay well all of the indices up to here like all of the entries up to this index 0 this one has a pointer to the excutive s5 and after it is also now and so all that information fits into a single earth that's pretty good now making this RC you safe is tricky I'm really glad that Liam and I are working on this together because it turns out that reasoning about when you can Ryu when you when you can use and reuse parts of the node is very very tricky we we had some very complicated code that we ended up having to throw away after having written it because we had not taken into account that two threads might move a different speed so we were looking doing inserts and of course one thread might take an interrupt while another thread continues execute at full speed and we managed to come up with scenarios where a very protected by RC you could end up seeing data which was never stored at a particular index and and that's what I'm a forgivable right you can either see the earliest stage of the the value of that index or the subsequent stage you may not see something which never existed you can't have a pointer which was never stored at that index that's just not allowed so injera throw that throw all that code away and we we end up just having to allocate and copy a lot of nodes and this is what has driven us to keep the nodes so small the computer theoreticians will tell you the larger branching factor you have at each level the more efficient your tree is and they're absolutely right but if you're having to copy a kilobyte every time you you you do an insert that isn't which isn't an append operation isn't into blank space at the end if you're doing an insert into the middle its it'll kill you so we've kept we've kept the notes very smaller just 128 bytes now in contrast to the b-tree literature which tells you that you should keep your nodes balanced at all times we say no we are going to keep we do not feel the need to balance our nodes at all times under two circumstances one is that when the indices are so dense there's no advantage to bouncing so take the example of filling up from zero uuuu allocate your first 15 and a dense node and then allocate number 16 into a new dense node and you have a node above it which shows that which says nought to 15 over here and number 16 down there that's an unbalanced tree a tree would have you know nought to 8 over here and 9 to 16 down there well why would you do that why why would you choose to balance your tree to satisfy the theoretical the theoretical basis behind the the B tree when there's actually no advantage to it in fact there are disadvantages to it because now when you come to the end of the the 9 - 9 + 15 24 when you've done number 24 note now you've got to go back and rebalance into the earlier node or you've got to grow the node again and this is actually noted as a as a problem with the B tree if you if you read some of the papers about B tree and they recommend having solutions like well when your load when you're doing the initial load of the database into the into the B tree then created unbalanced and then have one rebound on grand rebalancing operation at the end and you know we don't do that we don't have that luxury of being able to say oh this is this this is just the initialization phase so we just ignore the whole balancing aspect of it we say okay if it is clear that there is nowhere else to insert to the left then just leave that alone and and you know start allocating to the right don't move nodes over from the left nodes to the right notes just to keep things balanced supporting marks in the in the maple tree is is a little bit trickier so the way that this is done with the radix tree is that each node you may remember I told you earlier that each node has 64 bytes of metadata associated with it well 24 of those bytes are the the search markings and that that's a lot so we've started looking at what how we can support marks with the with the maple tree we're pretty committed to keeping on those 128 bytes we've got eight bytes already allocated to be the parent node so it seems reasonable to start removing the two to start decreasing the number of pointers available to each level of the node if you choose to support if you choose to have search marks in your x-ray right now you get them whether you whether you use them or not you just get them there's part of the basic operation of the of the raid extreme but what we're going to do is is is force the users to say that when they declare their x-ray I'm going to use two marks or I'm going to use five marks or however many they want to use and we will then simply decrease them for pointers at each level until there's enough space to saw the number of marks that they've requested and this actually solves a problem for a lot of people because I've had requests from exe for butter FS and XFS all wanting to have more search marks and I say how many and they say well four would be nice five would be nice but one thing when you give me five I'm definitely going to ask for six and I said well how about 18 as a maximum and they're like yeah that's probably do us for a while I don't know how much they're going to enjoy having the the performance hit of having somewhat deeper trees but that is now their problem and not my problem so I'm happy about that okay so the last bullet here is well while we're working on this we obviously have the existing users of the X ray in mind and we want to be sure that we support them well but we also want to support some new users and one of the current pain points we have in Linux is the mm Sam and what the MSM does is it it protects the VM a tree so every time you call em map you get a new VM a and we keep those vma's in blackTree and like how saying that there's there's no concept using asks you to walk a red black tree you have to actually take this mmm semaphore in in read mode for every page fault so if you have a very multi-threaded application and you're running through it and and all the threads are running across all your CPUs they all take and taking lots of page faults they are all hitting this one cache line that contains the the semaphore taking it in read mode so so it just bounces around every CPUs making it dirty and this is not the most efficient way of doing it if we can walk the VMA tree by using RCU then we will have a fairly large gain but the the the the red black tree has support for doing efficient allocation of free space because when you call em map we have to search that VM a tree and find a hole in your address space of the right of a sufficient size to put in the VM map that you have asked for and doing a linear scanner it's not the not not really recommended you know if you've got a thousand VMAs you don't want to go looking through all of them try and find a gap so what we're going to do is in addition to supporting ma search marks also be able to note where what the largest hole is at any in in in any of the the pointers but that's going to be an optional feature that is that that's not something that most users are going to want to have it's entirely optional because again it's going to reduce the number of branches you have at each level which is going to lead to a deeper tree it's all about trade-offs so is this the one API to the one true data structure to my chagrin no there are there are still places where some users want to have overlapping ranges that that's not the metaphor I've gone for with the x-ray I have I I I have presented an API to something where for each index there was one pointer with these other users if you think about things like file locks there's some other good examples and there's gone right after ahead but they want to say okay there are three ranges which match the search key that you're looking for I mentioned when we're doing because we're you because we're I'll see you based we often have to allocate and free new nodes on an insert and the cost of doing that may be too high for some users so we're talking about maybe having a per instance of the x-ray flag to say you know what I don't do lookups under RCU so that's going to make the inserts run faster but it's going to make searches slower because every search will have to take the spin lock and for some users that's the right thing to do some users don't want to use our see you don't need to use ask to you in order to search the array some users can't tolerate memory allocations on insert to use the red-black tree you don't need to do memory allocation on every insert because the RB node is embedded within your data structure and so some some of them are just not written to tolerate memory allocations at that point because if you're doing a GFP atomic allocation that can fail and it it's pretty rare but it can happen summary there apparently there are some people who saw things other than pointers in erase apparent you can have an array of things that aren't pointers I don't know why you do that but so it's something some people do so we were also talking maybe coming up with another API for you know do so you can store 16-bit int and instead of a pointer well it's getting ahead of ourselves but we are considering it and of course the people who want 64 bit indices this is generally file system people again you know I've booted this 32-bit kernel and I want to address terabytes exabytes of storage so I want to look at things up with 64 bits indices yeah I understand where you're coming from but I don't want to write that code yet but you know I'm listening these are things which are on my wish list so with ten minutes to go what if I glossed over that you would like me to explain better is it like a question like that cool things but um have you looked at Judea rays have looked at Judea ray I'm assuming that you'd chose not to do that so why not wonderful question thank you yeah it's a bit of personal history I was actually sat across the cube wall from dog Baskins when he was writing the Judea Ray at HP back in I guess that was 2000 2001 yeah Judea rays are really cool they are also radix trees but implemented in a much more clever fashion than the radix trees that we use in the kernel today they are adaptive and they they're really good for storing string it's because each layer is actually a uses eight bits instead of the six bits that we do so when each layer is a character you you can index your way through a string really efficiently the problem is that they don't they don't compress very well so when so the radix tree that we have will the rate the rate extreme we have compresses itself so if you only have very small indices that you're storing you don't allocate very many notes the problem comes when you in allocate really large nodes with the Judea ray at each at each level you you you you have to know are you staying or are you going and do various comparisons it's it's not the sort of general pub it's not the general-purpose data structure that we need and it doesn't really have support for doing ranges in the way that we need it to have and I think that's the key thing that you think that the maple tree is it was designed to be a range base because doubt that that was what we were missing we really did focus on the VMA tree as our prototypical user and we actually started out creating completely separate interface that was just for Rangers and it was only after word got a little way into development that we realized oh this will actually work really really well for single ranges of length one so yeah I hope that answers your question I'm not a huge expert on Judy I there wasn't a whole lot of osmosis across that cue ball but yeah I'm a big fan of the judy array idea the the implementation the Doug came up with was a little over complicated that matted Noa has written a much better implementation that sits in user space actually and I've read it over and I really like what he's done with it but I think the maple tree is probably a better choice for us but though that was a great question Thank You Tobin you said that if you know there's nothing no space that you don't rebalance the tree does that mean if you fill up from zero and then on the first delete you have to rebalance no no you you you you actually never rebalance in order to you never be balanced on a read ever ever because in order to rebalance you have to allocate memory and you you're on a read you're probably under the RCU lock so you you you you can't you can't do anything you you can't modify the tree if you're only holding the artsy you are you you don't rebalance you you don't rebalance in order to make the tree balanced unless it is possible for for you to improve the improve the the width of the tree all the depth of the tree so you you you you might rebalance in order to be able to delete one of the nodes but you wouldn't be balanced if the number of nodes would be the same just distribute just have the point as just attributed slightly differently between them so you you you you try and make the the the left side as heavy as possible in order that you can just keep allocating out to the right as far as possible without doing any more rebalance thing make sense okay question here so when you need to split a range node either so okay so when you need to split a range node how do you decide whether to allocate at the same level or one level below okay if there is space if there is space in that node okay so so so you're saying the root the root node is four okay so arrange nodes at any point in the tree is full and you need to do an insert in there okay so that point you are send to its parent and you split its parent now where do you choose the pivot for that split that that's the key thing what which is what I'm saying about the denseness of the indices if you can see that you can create if there's an advanced should be gained by splitting in the middle will split in the middle but generally what will try and keep the left as heavy as possible without without putting ourselves in a situation where we would need to rebalance and create what I call a spindly tree a tree that no longer satisfies the basic properties of B tree so I mean the basic properties of e tree is you know that you're within a certain percentage of being an optimal tree but you know but we play pretty fast and loose with that rule but we don't try and rebalance for the sake of being balancing yeah so when you're pivoting on a parent node you might have to shift some of the nodes from one at the level below from left to right yeah you know then how do you decide how many to shift yeah so we look at what the maxximum could be that what the maximum index could be in that left node right so if you've got one dense node there is no point in splitting in in allocating less than 15 right splitting that node ever you're just wasting space so we would happily have a 0 to 15 and then 16 to 17 in the next node it can create trees which look spindly and and and then you you go and analyze it and you realize actually there was no way to have done this better it could look differently it could look less spindly but it's actually no more efficient at least not not any example I've been able to come up with now I could be wrong there could there could be bugs but you know we may need to change how we do things but we'll we'll look at that and it's definitely better than the radix tree at this point right we haven't been ever come up with any examples where we're doing worse than the radix tree there is example those words that you've used to describe the efficiency and the way that you make those decisions seem a little bit rubbery I guess is the word have you got any tests that actually quantify those results and you do is there a test suite that actually is working on analyzing what kind of tree comes out at the end yes we we have a test suite so the the x-ray has it has it has a test suite which is at this point mostly focused on being a regression test suite we we have had bugs in the past both in the the radix tree and in the x-ray rewrite the side of it and so that the tests are very focused on finding or making sure those bugs don't worry occur but we also have the the Liam and I aren't done yet we're not I mean the git tree is public but we're not publicizing it you know ty didn't put a link to our git tree because we're not ready for people to dive in and help at this point wherever there's a lot of things which are in flux but one of the things we are doing is yeah definitely a test some of the tests in the test suite are going to check that the tree is not inefficient well I think that's time so thank you very much this brings us to the end of the LCA 2019 kernel mini-cons please thank all of our speakers as well as all of our wonderful volunteers on the a/v team and the room monitors who have been helping us put this day together and thank you all for coming [Music] enjoy the rest of the conference [Applause]

Info

Channel: linux.conf.au

Views: 17,047

Rating: undefined out of 5

Keywords: lca, lca2019, #linux.conf.au#linux#foss#opensource, MatthewWilcox

Id: -Bw-HWcrnss

Channel Id: undefined

Length: 45min 33sec (2733 seconds)

Published: Tue Jan 22 2019