Alibaba: Using SPDK in Production

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hi everyone my name is Jan choke from Alibaba so we have another speaker named the name meaning he we also talked about some part in the slide so I just doing top out the first part of this slide so in this slide we many cover some some topics why we choose SP DK in our production environment especially provides a high performance storage service in Alibaba real deployment and also we will cover some of the problems we have made when we we have been met when we use XP decay in our environment and we saw also we will talk about some solutions we have down to resolve these problems and also we hope we can get some feedbacks from the community and all the other peoples also welcomed for your comments what we have done and in the last we've talked about a future plan with the SPD community so here's the outline so basically why we choose a speedy K so performance is a first consideration when we think about the SPD K so as everyone knows as a storage device become one more faster and provide low latency high bandwidth and performance to the application so how to make the stories I will stack more efficient and also more light weighted it's a critical to explore these high performance devices the Lambert's so that's the reason we choose HBK since SVD K provide a very efficient implementation in the user space and provider has very low latency our stack and also very flexible extendable software stacks to for us to customize for our applications requirement so basically the second consideration is how to stand and how to customize so SVD cases in the user space so it's very providers very flexible space to East and IO staff for our Pacifica usage model for example if we consider about the quality of service since we have different applications and different service providing on the up layers so this application and the service may require different performance sums maybe I will worry io intensive they require large bandwidth about their Lord critical tasks and some others may be very net insensitive so for these different requirements maybe we want to add some specific io scheduler or some curious design on top of these io stacks of indicating user space is more easier to extend this service so the last part facility also a switch K provide a very strong on community support so there are a lot of experts out there so when we have problems so it's it's very convenient to basically get some feedback or comments from the community so these are the three basic Cancer Research where we choose SPD K so in this life in this life specially I talk to a patient we cover some problems we have made we have made when we adopt ice v DK to deploy to deploy our storage service provide for our applications one critical problem is I think the previous speaker from Oracle they also made the similar problems basically the limited IBM include pairs but from the devices side because in our environment we we need to deploy a large number of applications single server and also even a single device and basically we can provide container also other in running environments of the application can be switched from Washington version device to device and this so we cannot basically allocates static fixed number of mmq pairs for each process first there won't be that large number of America pairs we can advocate so for our for example typical ambient you guys maybe only provides shortly to maybe 32 queue pairs so that gives us an imitation of running only 32 application instance on the same device and the other so so if we want to make this application can be shareable on the same device so one possible way is maybe share the remaining two pair between between applications but this also gives another problem and that is the lock contention so as an SPD case designed to be locked free so one application basically when when you submit i/o you don't need to unlock the queue and you don't need to compete contend with other threads or running processes so another and more important is so since you are sharing the the critical data structures among processes for example if one process when it summit i/o it locked up you and then this process crash and this Q cannot be accessed by other process since the locking the lock may be also a shared memory limitations so and so we look at whether there are some existing approach we can utilize from the SP DK part so a spinnaker architecture provide another good thing called were hi we host architecture so this architecture basically provide a virtualized the layer above the physical devices so applications can just access which work IO devices and then the we host handle into actual physical device resources such as memory to payers and other other resources this seems good since applications they can just create their watcher queue the broker which are clean which is implemented just a ring buffer and the virtual device size and we host just match the physical cue pair maybe we can just use allocate one cue pair per per virtual device or we can have multiple two pairs for multiple virtual PDF devices but the but these gifts but seems the current implementations from the world I of skocy target it starts not allow multiple connections to the same target which means if we want to have one we host target which shared by multiple applications this cannot be be used in the using the existing implementations moreover when the we host crash so the the we host has to reach that and then the word what IO device has to be has to reflect to the we host also this the restart of the application site which is not acceptable in our usage models so - in order to overcome these issues we have found several patches to make make sure we host to be usable in our real production environment so the first patch we have done is to support the multi-process access to the same we host target in this way we don't first we don't have the compare limit and also we we can have multiple process to access the same target share the same target the second one basically we add some recovery auto recovery mechanism means skocy initiator site so once so we host crash we can detect we hold crash and when the we host to restart it we can re-establish the connection with the week we host and basically we rebuild the weep world IO size Q and recent pending i/o so all these are transparent to the applications so the for the first patch we basically made these changes to the we host target code so we changed the base need we def from a single pointer device to just array of the vdf devices so each target can manage multiple connections instead of just one single connections and then the we host site a walker so IO worker and the the main worker when they handle the IO and when they pouring the atom e io and the request IO so we are loop through two connected devices collected we we DV devices and basically handle them when we when there are any wart IO device remove from the we host target then we also we are monitored that invent and basically remove that disconnected device from the the WeHo site and for the second patch basically we build some auto recovery mechanism at the Rio site so this flow discharges the described the logic of the flow so when the porting thread at the world IO initiator slides when they try to pour the completion event they we ought they also check whether there are any i/o timeout happens and pending aisles if there there are I will timeout we found and we checked we host libraries so we send some heart beat to the we host the monitor socket to the socket we host socket and we host process basin is protected by the system the service so we registered at demon so if we host a crash and the season T we are restarted the we host process so once the Rijos process is restarted and each night and each other side we are find that okay this we host is its knife now so we can reconnect to this real host and when the reconnection is down then we will resend order pending IO and all these basically transparent to the applications so or happened I had the word IO initiator and B dev code so I think so mean can cover the remain of the slides okay so yeah actually I feel tired I feel sleepy so I guess you guys the same as me so just wake up I only have two slides to talk actually so before I start out first I want to thanks the bank actually in a big very beginning of this project we did a lot we have had a lot of discussion with Bank and he suggested a subvert oh I oh and the we host architecture and actually it works pretty well in our production system so yeah so um you know we run our brother on Alibaba crowd so we care about the Li instances start-time a lot so our target is actually we need to schedule an instance study application in less than 0.1 second 100 milliseconds now we get like 0.3 second so here's a example for example in my tester system I use that 10 gig huge page me every application use 250 mega huge page so you see the original means the current code the internal code the upstream code it was scary thought all three huge pages even the application only here's like of twenty five six six 256 megabytes 50 pages iris tears scanner saw doll a freak huge pages so it takes a lot long time for this example it takes about seven point five second then we made two changes wise to the db/decade a nice to what I was Cassie so at least I hear one in two year mass caring we add new parameter masks mm - TBD k so with this parameter it means we only scaring so many huge pages as specified by the parameters with this change in the start time application start ham job from 7.5 seconds to 1.8 seconds the other changes to the voter is caste so by the border now the when one when i but i all initiator can add to the be host target evil skin by the boy less a macro is a hard contains 64 he was scaring tried to scare 64 targets this also to the sometimes more than 1 seconds so we add a new parameter to lavertus caste configure configuration fire I call it asked us to target for example the next is number it means for the arrow discussed initiator we're only scaring target is 0 because in our application one initiated we only connect to one because target so this is qu the first and the here's the patch yeah so now we get 0.3 second application boot time but now our target 0.1 actually last week I also heard from lead db/decade now has a new feature with the list with latest TPD code it will not the pre-allocated all acute pages even allocated memory you needed on Fri so maybe we look at it and we'll try to talk it back to our system it was I think a university even maybe we can get a 0.1 second target later yeah ok another we modify the react thread in only initiator site it's kind of hard to change later on a server vo style so the first thing we did in only in another initiative side so we will deploy a tangible application on one single machine for some probably fifty or sixty instance so we can only use it on our application we can only use the PC polling mode of that we record thread so we borrow the code from SP DK and we implement with this some change we implement our own sleepable react thread in our application side so this is the pro it's kind of very straightforward actually so the application submitted request and then we have a some kind of counter we recorded our request to record how many repairs on the fly now so they will go to the Rick the main reactor thread you read sorry it one more thing is we were first check even a reactor is asleep if we sleep then we wake up it use the people at the condition variable let me go to the reactor throttle main loop then check if Lucas is zero it means we have we have handled all your classes so now you can keep go to sleep with the piece rather condition wait if not the threat is the main loop we will continue to handle our request so oh but wait in the DPD Cal there's a parameter called max the delay but it doesn't work for us because once the initiator connected laying all of this whole side there is always one active at least one active polar so even you specify max delayed us till I get to the polling things so we are always keeping a cop is the 100% busy so that's why we cannot even use the max delay the parameter yes so here I didn't understand the purge out because the because as I said I borrowed code from SP Kalin we article in our own application so later maybe we can try to in the great let's change back to as PDK I mean if the recommit community thinks there is a good things maybe I can then we we could try to do that oh sorry yeah so here's the future pain we we would like to contribute to the community so that's why we have stand a lot for patches out on the mail list because of the patches were based only old spk 17-point hang so now we didn't we have another mega Gary happy review we have to part of the patches to a latest master branch then we will make it a official very happy review later we'd like to contribute the so the community it will also hammers the community can have the review probably somebody can also have a test then we can do better in our own production system we can make our code more more robust more stable yeah yeah so let's say thank you any question so when you put your reactor to sleep no you know under with the condition variable how do you how do you would determine when to wake it up is there what's the signal mechanism to wake up the reactor again sorry let me let me let me go to here oh sorry oh sorry okay what sorry Peter Pan I've seen your condition variable reactor when you put the reactor to sleep what's the mechanism to wake that up in an efficient manner oh so in the main loop even the counter drop to zero we were means now there's no pending requests or requests are already handled so now we can take go to sleep right then how does it work what wakes that reactor up again right then how does that threat get woken up again oh oh yeah this part because of the that's why I said is easier to do listings in the initiator side because the application knows is the application stopped me to require the application census and year side the applications yeah yeah yeah so it's harder to do less things in because of Pakistan I'm a target yeah because be on target he doesn't know when the IOPS were coming to come so you always need to Dula polling yeah you were using vhosts Ghazi why aren't you using v host nvme because I noticed you're using nvme devices in your implementation so York Rochester why we don't use the be hosted enemy one thing two things actually last year when we start a project there's no B host a me yet another is a let's no but I am e-library on initially excited even now so another third test I think probably the performance may be the Lapera things for us is report IO blk the purpose maybe Ruby better learn whether I was Cathy / and last year we we thought that the word I was gossip probably is more stable length or one one more day is that in the beginning of this project let us know what IO Bureau cater so actually we have no choice you had no choice the best choice is the board I always Ghazni yeah any other questions all right thank you guys good stuff

Info

Channel: Storage Performance Development Kit

Views: 584

Rating: 5 out of 5

Keywords: Ali, alibaba, Alibaba, SPDK, spdk, vhost

Id: AbySTDtEH-o

Channel Id: undefined

Length: 24min 17sec (1457 seconds)

Published: Thu Jun 14 2018