Kubernetes Storage Lingo 101 - Saad Ali, Google (Beginner Skill Level)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

my name is Sally I'm a software engineer at Google I am the tech lead of the gke storage team and I work primarily on the kubernetes project I'm a co-lead of the storage special interest group so today the agenda is pretty simple the storage subsystem the volume subsystem of kubernetes is pretty complicated there's a ton of terminology and if you're not familiar with it it can get pretty overwhelming by a show of hands how many of you know all of the words that are on this screen should be at least one person how many of you know some of these words okay and how many of you don't know any of these words all right cool so I think by the end of this talk I'm hoping that you guys are familiar with all of this and we have a lot of material to cover so I'm not gonna take questions during the presentation save those till the end let's get started so one of the most important principles of kubernetes is the idea that we want to make workloads portable across clusters it doesn't matter what type of cluster that you're running on the idea is that we want to abstract away the implementation from your actual workload now what this allows you as an end-user to do is write your application once and be able to deploy it anywhere you can write your application on a cloud provider like GCE on Amazon if you'd like or on your own on-premise deployment regardless of where you are the way that you deploy your application is going to be consistent for example you can use a replica set or a pod you use the same exact Yam well regardless of what cluster you're actually running on to deploy your workload the challenge now is with stateful services pods and replica sets abstract away compute in memory but they don't really address stateful services the problem with state is that containers are inherently ephemeral there is no way to persist State inside of a container once a container is terminated everything that you've written inside the container is gone and containers can't really share data across container boundaries so if you have a couple of containers that are working together which would be the case if you're familiar with the concept of a kubernetes pod you can have multiple containers that are working together so for example you could have some-some container that's pulling some static content from somewhere and a web server that's serving up that static content you without a way to be able to share state you have no way to be able to share data between these two containers so now storage is a very big nasty concept there's a lot of different types of storage systems you have object stores like s3 GCS on Google you've got sequel databases you've got no sequel databases you've got pub subsystems you've got time series databases and of course file and block storage and even file on block so what how do we tackle all of this the answer is that we can't tackle all of it we have to go back to the core principle of kubernetes which is workload portability so what we decided is that we're going to focus on file and block storage and not on everything else not on the objects stores the sequel databases etc and the reason for that is because the data path for both file and block has been standardized your Linux operating system your operating system basically takes care of writing block and file for you so your application doesn't need to be aware of how to do those things whereas for the rest of these data services your application has to be aware of it you have to have some sort of SDK built into your application that understands how to read and write from these different sources and the problem is that there is no comment and for any of any of those things yet once there is we can talk about abstracting that away within kubernetes so kubernetes doesn't build standard data paths but once the standard data path exists we can do some really really cool stuff with it so let's see what we did with file and block so the kubernetes way to abstract away extract a WAV file and block is the volume plug-in volume plugin is just a way to be able to reference either a block device or a mounted filesystem and make it available to all the containers inside of a pod a volume plug-in basically specifies how to make that volume available inside of a pod and the medium that backs it the lifecycle of any given volume could be the same as the lifecycle of the pod or it could extend beyond the lifecycle of a pod and B could be persistent persisted beyond the lifecycle of any individual pod so this is the set of volume plugins that kubernetes currently supports I break it up into five big areas one is remote storage the idea here is this is network-attached storage that persists beyond the lifecycle if in any individual pod then you have ephemeral storage which has its lifecycle tied to a given pod then you have local persistent volumes which are which enable local storage to be used in a persistent manner and then you have these new out of tree volume plugins and finally you have host paths which you should never use if there's one thing you take away from this talk don't use those back so let's dive into this a little bit ephemeral storage so ephemeral storage is basically temporary scratch space that's stolen from the underlying host machine temporarily and it is exposed to all the containers inside of a pod so you can think of it as just scratch space if you need to share files between two containers you set up an empty durval um-- and at any files that are written onto that scratch space or visible from all containers that make up that pod these volumes can only be referenced in line meaning your pod definition has to actually specify the volume type emptied er you can't use a persistent volume persistent volume claimed to reference empty directories and I'll talk about what pvp VCS are in a little bit so the basic volume plug-in here is the empty der that you need to be aware of let's take a look at what that actually looks like when you're deploying a pod so this is a basic pod definition that has two containers container one and container two both of them mount a single volume which is an empty der called shared scratch space into a mount path inside the container at shared so if either if these containers writes into that path it's visible by the other container so fairly straightforward and now the cool thing about empty ders is that it does maintain this principle of workload portability if I move this pod yeah mell that I showed you - any other cluster it'll work in exactly the same way regardless of the cluster that you're actually running on there is a set of volume plugins that are built on top of empty der specifically secret volumes config map volumes and downward API volumes basically what they do is create an empty der and pre populate it with some data in this case it's data from the kubernetes api so for example a secret volume allows a secret in the kubernetes api to be exposed as a file to your pod the reason we made these volume plugins is again a new kubernetes principle meet users where they are the idea here is we don't want folks to have to rewrite their applications to be kubernetes aware you know we want your applications to just work on kubernetes so if you had an application that understood how to read some credentials from a file we don't want you to have to modify that to then you know call out to the kubernetes api and fetch secrets so instead what you can do is use a secret volume and the secret volume will basically fetch the secret on your behalf and mount it inside that container as a file and your workload doesn't need to be changed at all so that's a ephemeral storage now let's move on to remote storage the remote storage is storage that exists beyond the lifecycle of any one pod this allows data to be persisted so that you can actually have stateful services examples of these volumes include gcep DS Amazon EBS I scuzzy NFS and there's a lot more these remote volume plugins can be referenced either in line or through a persistent volume persistent volume claim object and I'll talk about that in a little bit the beauty of remote storage is that this is what enables kubernetes to be able to shuffle your workloads around because it decouples your storage from your compute so the pod that is serving up your service can be terminated on any one node for any reason either the node goes bad or there's too many other workloads running on this node and there isn't enough CPU or memory available that workload gets moved somewhere else remote storage allows the the persistent State for that application to be made available regardless of where your actual pod gets scheduled so now let's take a look at what this would look like in using it so you create a pod in this case I have a single container it's just a busybox container when it starts it's going to just go to sleep for for a minute or 6s and seconds and I have a volume in this case a GCE persistent disk and I specify the name of the disk and the filesystem I want on it and then in my inside my container I I specify where it should be mounted in this case /data and now when this container is started any data that is written to that path is then persisted to this persistent disk and if this pod is terminated from one machine and move to another machine that data comes along with it because the data is independent of the pod and it is moved along with the storage and kubernetes will automatically take care of attaching the volume to the correct node and mounting it to make it available inside the container so all of the the pipe work is is taken care of by kubernetes automatically but don't do this do not reference volumes directly in your pod the problem with this is what is this workload Portability right we talked about this principle again and again but the problem with the Yambol that I just showed you is that it references GC persistent disks directly inside the pod yeah Mille now if I were to move that pod yeah Mille to a cluster that's running on Amazon it wouldn't work because there's no GC persistent disks if I were to move it on Prem it wouldn't work because there's no GC persistent disks so how do we fix that so yes this pod yeah Mel is no longer portable across cluster so how do we fix that persistent volumes and persistent volume claims I've been saying that word over and over again this is the solution to workload portability for storage across clusters it decouples storage implementation from storage consumption and the way that it works is your cluster administrator can be aware of the storage that exists within that cluster but your end user shouldn't have to so what your storage your cluster administrator can do is they can come along and create persistent volume objects within the kubernetes api to represent volumes that can then be used by end-user they knew that by creating a persistent volume object inside this object they define the actual storage that will be used so in this case I'm going to define two persistent volume objects referencing two different disks one is panda disk and other is panda disk 2 one is 10 gigabytes though there's 100 gigabytes they're both GC persistent disks if I happen to have two different storage systems within my cluster let's say GC persistent disks and NFS then one of these could have referenced an NFS share as well so this is basically the cluster administrator going ahead and pre provisioning these volumes to make them available for consumption now when somebody comes along and they want to use the storage what they're going to do is create a persistent volume claim object now if you'll notice the persistent volume claim object doesn't contain any specific implementation details it's simply a generic description of the type of storage that the user wants in this case they describe that they want a minimum of minimum of 100 gigabytes of storage and they want it to be read/write once and so now when the user creates this persistent volume claim object the kubernetes system automatically figures out what Peavey's are available and binds the claim to one of the persistent volumes that's available the beauty of this is that now your workload your pod definition can now be portable so instead of referencing the GC persistent disk directly the user simply references the persistent volume claim and now if you were to move this pod Yamma Lacrosse to a different cluster that doesn't have GCE persistent disks it would still work because as long as the the cluster administrator has some TVs made available to the end user so if I was running on Amazon I could expose persistent volumes that point to Amazon EBS disks if I was running on pram I could expose is a Gluster a filesystem SEF whatever I want so one of the problems you might have noticed with this is having a cluster administrator pre provision volumes that is both painful and wasteful a cluster administrator can't necessarily predict how much storage every single workload is going to use and having to manually provision for every single workload is very painful and so in kubernetes this is one of the unique things about kubernetes is we have the ability to dynamically create volumes on-demand when the user requests them so the way that this works is through another kubernetes api object called the storage class the storage class by the creation of a storage class is a signal from the cluster administrator to say I want to enable dynamic provisioning so as a cluster administrator instead of creating PV objects what I can do is create a person a storage class that points to a specific volume plug-in in this case I'm going to point to GCE PD and specify the set of parameters to use when that volume is provisioned so think of this as a template for when a new volume needs to be created this is these are the parameters that will be used to create it in this case I created two sets of storage classes one that I'm going to call slow and one that I'm going to call fast you can call them whatever you want for me slow is something that's going to translate into a standard GC persistent disk and a fast is going to be a SSD GC persistent disk now the cool thing is that you could create the same exact storage classes on your own cluster which may be running on Prem and has a completely different storage system the names of the storage classes could still be slow and fast but you can point to your own volume plug-in and set some other opaque parameters these parameters are opaque to kubernetes so you can pass through whatever makes sense for that particular volume subsystem this is basically the way that we got around the fact that you know these different storage systems are going to have so many different knobs and things that we can't necessarily encapsulate everything into the kubernetes api so the way that we got around it is these opaque parameters kubernetes doesn't care what you put in there only the volume plug-in cares about it and the cluster administrator creates these storage class objects so the cluster administrator knows the type of storage that's running on there and they can fill in the parameters with whatever makes sense for their system and now we're decoupling this the underlying storage from the consumption of storage so now that we've made a storage class to allow the creation of volumes dynamically the next question is well how do you actually create a new volume so as an end user very little changes you still request storage in the same way you create a claim a persistent volume claim to a generic request for storage in this case the only thing that's different from before is that I specified the storage class that I want in this case I want fast storage as an end user I don't care whether it's SSD or not SSD or certain number of i/o ops or whatever I just go you know look at the storage classes that exist on this cluster and specify one to use and once that store a persistent volume claim is used what kubernetes will do is it'll go out look at the storage class object call out to the volume plug-in that the storage class references to create a new volume once a new volumes created kubernetes will automatically create a persistent volume API object to represent that new volume and then it'll bind the persistent volume claim with a persistent volume so everything is automated and then you can reference the volume in exactly the same way as before a persistent volume claim in your pod and again this is portable across clusters now you can also as a cluster administrator choose to mark a specific storage class as default if you mark a storage class as default what this allows is the end user no longer has to specify a storage class in their persistent volume claim object if even if they don't specify a storage class kubernetes will automatically use the storage class that the cluster administrator marked as default to do dynamic provisioning so it's a way for the cluster administrator to basically enable dynamic provisioning for everybody on their cluster and if you use AWS using cube up.we pre-install as a default storage class for you that creates EBS volumes and on Google Cloud if you use GCE or gke we will have a default storage class as well that will provision GC persistent disks and we have a default storage class for OpenStack that creates cinder volumes as well alright so that was remote storage and we talked about how to use remote storage don't reference it directly in your pod use a PvP VC so that you maintain workload portability next we're going to talk very briefly about host path volumes host path volumes are a way to be able to expose a directory on your host machine directly into the pod but there's problems with this if you expose a directory from your host machine what happens if your workload is killed and moved to a different node if your application expects that data to be persisted the data just changed underneath the application so our recommendation is not to use host path unless you have a very specific need and you understand what it is that you're doing some people use host paths along with things like node affinity to try to pin a pod to a specific node but you know take a look at your use cases and see if it makes sense and think twice before using host path but this leads into local persistent volumes local persistent volumes are a way to expose either block or file from the local machine as a persistent volume so what we recognized was some people were using host paths to try to expose underlying storage from a host machine in a persistent way but there were tons of challenges with doing that so what we did is create a new volume plug-in called the local persistent volume and what it does is allows local storage to be used as a persistent volume persistent volume claim it's referenced in exactly the same way that you saw the GCE PD examples that I've been showing you the only difference is that the PV object would reference a local storage volume the cool thing about this is that kubernetes is aware that local storage volumes are special so it takes care of data gravity for you what that means is that once your workload is created and it's using a local persistent storage volume if it needs to be moved kubernetes is not going to move it anywhere else it knows that this workload can only be fulfilled from this given node that comes with drawbacks right if you're unable to move your workload you have reduced possibly reduced availability and reduced durability so if you're going to use local storage make sure you understand what the purpose is there's primarily two purposes as I understand it one is for building higher-level distributed storage systems on top of kubernetes right you can to aggregate all the storage available from each one of the nodes use local persistent volume to expose it to an application that then aggregates it and exposes it as network attached storage to the rest of the cluster another use case is high-performance caching so if you have some very high performance disks that are attached to specific machines and you want to use those as a caching layer that's a perfectly valid use case michelle is giving a talk right after this in this room talking about local persistent volumes in depth and she'll do it far more justice than I can next up is let's talk about volume plugins in general so the volume plugins that I've talked about so far GC persistent disks Amazon EBS volumes all of them are entry what that means is that the code for these volume plugins actually resides in the kubernetes code base and so all the kubernetes components are built and compiled with these volume plugins and shipped with these volume plugins the reason we did that initially was to allow us to move very quickly we didn't need to expose an API for volume plugins so we could modify that API because it was internal so anytime we needed to change it we would modify the API internally and update the volume plugins because they were also entry and whenever we ship a version of kubernetes everything was compatible but there are drawbacks to entry volume plugins actually before I talk about the drawbacks let's discuss why they're awesome right kubernetes volume plugins are allow you to do dynamic provisioning which no other cluster orchestration system allows you to do they automatically attach and mount your workload wherever it exists and they provide a powerful abstraction that'll decouple your storage from your from your workload that's consuming that storage so the volume plugins are awesome but they have these drawbacks where the there they're built into built-in tree so the drawbacks are that they're painful for kubernetes developers so we have to maintain these volume plugins and in a lot of cases we don't have the resources to be able to actually test and maintain them they may have dependencies which we don't where we don't have so we can't properly actually test these volume plugins in addition any bugs in these volume plugins can act we affect core kubernetes components so you can cause cubelet to crash if there's a bug in the volume plug-in and then because these volume plugins are built into kubernetes it means that they implicitly get all the permissions that we give to the core kubernetes components which is not necessarily something that we want to do from a security perspective it's not just painful for kubernetes developers to have entry volume plugins it's also painful for the storage vendors who want to expose their volume plugins in kubernetes it means that they have to be aligned with the kubernetes release schedule so for for things like patches they have to check into the main kubernetes repository and cherry-pick things back to the appropriate kubernetes releases which can be extremely painful and probably not at the pace that they want to move and it also forces them to open-source their code whether or not they want to open-source it there are some volume vendors who would choose not to do it if they had the option and so this is where out of tree volume plugins come in so you guys have probably heard of container storage interface it went to beta in 1.10 which was the release of kubernetes last quarter the idea with CSI is that it makes the volume layer truly extensible it allows volume plugins to be deployed on top of kubernetes as just any other workload that can be containerized and deployed using cube cuttle create a chef samyama and the benefit of this is that the volume plugins are now completely decoupled from the core kubernetes codebase there's going to be a talk this afternoon by GU who is one of the co-authors of CSI along with me so if you're interested in learning more about it please attend that the other out of tree volume plug-in mechanism is called flex volumes flex volumes was an earlier attempt by six storage to do out of tree volume plugins the big difference between flex and CSI is that flex is an exec base model meaning it whenever we need to do a mount operation and attach operation kubernetes will call out to a flex binary or script that exists on the machine the drawback of this approach is that the binaries for the driver are files that must be copied to the node machines as well as the master machines and that just means that deployment of these flex volume drivers is much much more difficult you actually have to have access to be able to copy these volume plugins instead of being able to deploy a kubernetes workload and it also means that for a set of clusters that don't give access to the master so for example on gke we prevent users from having access to the master Google manages the masters for you it means you cannot install flex volume drivers onto the master so you can't do things like attach so flex is limited but we're going to continue to support it because there are a set of drivers that were written for flex the idea is that we'll keep it in maintenance mode and invest in the future with CSI so CSI is API is going to continue to it expand and flex is going to be maintained as is and that's it so if you have any questions please or if you want to get involved please join the kubernetes storage special interest group for storage we have meetings every two weeks if you go to that link you can find details about how to join we have a mailing list if you have any questions you can reach out and reach out to us there there's also a slack channel for the bi-weekly meetings there's an agenda doc that is attached to the meeting invite feel free to edit that if there is a bug that you have that's not getting traction if there's a design that you'd like to discuss just add an item into that agenda dock and we'll find time in the next meeting to discuss it and if we don't we'll just move it along to the next next meeting and with that I'll open it up to questions yes yeah excellent question so the question is given the drawbacks of entry volume plugins is there any plan to migrate the entry volume plugins to si si the answer to that is yes the challenge is that the kubernetes api has a very strict deprecation policy so we can't deprecated the the the the api objects that we expose like GCE persistent disks Amazon EBS we can't deprecated those very easily so the plan is not to actually deprecated the api but instead modified the internal logic to proxy through to si si the design for that is underway right now and probably q3 q4 we're gonna start implementing that ideally once this is complete the end users shouldn't notice that it happened they're still going to consume these volume plugins in exactly the same way using the same API s but under the covers we'll have those requests fulfilled by the NEX ternal si si driver instead of the core kubernetes binaries yes for CSI sure so the question is does the abstraction of CSI being out of tree affect performance latency i/o things like that the answer for that is no because kubernetes is not in the data path kubernetes is strictly in the control path so the responsibility of kubernetes is to set up a volume and expose it either as file or block into the container and then get out of the way and so when you read or write from a file system you're writing through the kernel to the underlying storage system any other questions yes no mention of OpenStack Swift API I think what we tried to do with with I'm not super familiar with OpenStack Swift what we try to do with kubernetes is evolve an API that makes sense with kubernetes so we started from the user perspective first the idea was to try to create something that would enable workload portability and things like dynamic provisioning and so we wanted to start with something that actually makes sense for kubernetes so we started with an entry volume plugins and once we got that to a place that we're comfortable with and we see that it's working we promoted that to an external API that's CSI yes okay sorry can I use packets let's follow up offline I think we're just about out of time thank you very much if you have any questions feel free to reach out to me [Applause]

Info

Channel: CNCF [Cloud Native Computing Foundation]

Views: 15,164

Rating: 4.9401197 out of 5

Keywords:

Id: uSxlgK1bCuA

Channel Id: undefined

Length: 34min 35sec (2075 seconds)

Published: Sun May 06 2018