Single Sign-On for Kubernetes - Joel Speed, Pusher

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hi guys the last few people trickling so just why we do that thanks for coming and I appreciate that you all stuck around this late into the conference I know it's the penultimate slot as I want to enjoyed the conference yeah there's a good solid yes there that's great right okay let's just get going then so this talk is sort of about a journey that pusher went on we started in October 2017 looking at how we can improve the user authentication on our platform ID then in January February March published three blog posts on the new stack about what we've done and then that kept turned into this talk which I've done a couple of times in London but never in America so first time in America yeah so I'm Joel I'm Academy engineer and I work for a company called pusher we are based primarily in London but also have a San Francisco office how many of you have heard of pusher well smattering of you how many of you could describe what we do to the person next to you one guy two guys so we're a developer tools company we do communication and collaboration api's the whole point if that is making it simpler for developers to add real-time features to their applications our main product is a WebSockets as a service gateway that processes billions of messages per day that's about all I'm gonna say about pusher this is more about our journey and not about the company my twitter handle is there so if you've got any questions about any of this stuff afterwards feel free to tweet me my emails there and my blog link is there as well which has links to the original blog post so if you prefer reading this kind of stuff you can go on my blog and read instead so back in October 17 my team Lee came to me he said Joel we should start using our back now given this to the security talk I would imagine most of you know what our back is already but if you don't its role based access control what that allows you to do is assign permissions to individual users you define roles that say this person can create an update deployments but they can't delete them and then assign it using a role binding to the users user account the reason we wanted to do this is because we're also sort of phone tenant there are actually five or six different teams that use our kubernetes platform and why they won't maliciously go and interfere with each other they could accidentally it probably wouldn't make anyone popular of check it accidentally deleted fiends namespace for instance so what we wanted to do is make sure that we could use permission sets to prevent any accidental interference between the e teams oh now back when we started this project the kubernetes project about two years ago what we do is we spun up our first cluster and we generated a certificate we gave that certificate the system masters group which if you know are back that has cluster admin which means it can do literally anything to the cluster and then like any good DevOps engineer we distribute that to every single engineer on the team I can't believe I've gotten applause from that so yeah there were 30 engineers at the time we all had this one stiff get the certificate was actually I don't think it had an expiry even better and we were all logging into every single cluster as cluster admin not really where we wanted to go as we started growing the company so what did we want to change about that to do our back we needed everyone to log in as themselves we needed individual user accounts so I would log in as Joel my other teammate Phil would log in as Phil we wanted to be able to do this with group management now obviously only 30 engineers at the time but growing company we're 17 our want to be 150 within a year don't want to be manage every single person's user account and permissions individually so groups would be really helpful for that we want to make sure that whatever we designed would scale at the moment we have eleven kubernetes clusters but that's because we're only in one region we've only just gone GA with our products on the platform so we're probably going to expand that and over the next couple years as we have new regions to the products there'll be 40 50 60 clusters so we wanted to make sure there's not too much overhead in terms of management in that space finally user experience this is really important to us to push sure we're trying to build this platform that our developers want to use one of our teams actually uses Heroku and we're trying to price them off of it so we're literally fighting Heroku in terms of user experience at the moment so that was a big concern when we would build system so if you look at the humanities documentation it lists nine different ways you can authenticate an entity to the cumulates API not all of those are designed for humans to use so if you look at bootstrap tokens for instance that's for kubernetes nose to bootstrap themselves against the API and get the real credentials the service count tokens which property for pods for your workloads they're not really going to be used for humans we look then at the other remaining seven and actually ruled out a few more of those things like the webhook tokens the authentication proxies and Keystone at the time we couldn't really find any good documentation for them we couldn't find reference implementations the only things we could find documentation and reference implementations for were client certificates and open ID connect so these are the two that we decided we were going to look at and try and weigh up which one with you better for us so starting off with client certificates you've probably very familiar of this if you've done any TLS ever the same sort of stiff Achatz you use to secure an HTTPS website but you have the client off usage in there and you can use that to authenticate a user or a even a service to another service these have a few problems so they're issued and when they are issued by the certificate authority or wherever they're issued they are issued for a set lifetime now in the real world when you're using TLS there is certificate revocation list and OCSP kubernetes does not support these so the only way that you can revoke a certificate if you happen to have issued it indefinitely is to rekey your kubernetes cluster which makes sure you is not fun yeah I've done it 11 times another thing about this is the self service with certificates is hard so if you let people self sign they're stiff Achatz what they're going to do they're just going to give themselves a system masters group and you just completely defeat at the point of view office II so you have to have some sort of infrastructure for generating the certificates verifying that people are who they say they are which means that not only have you got an authentication system for your cumulative structure you've also got an authentication system for the infrastructure for your infrastructure for authentication system for Kuban at ease so that seemed like a bit of a pain there are things out there now that can do things like this vault for instance can do this kind of stuff but at the time again there wasn't much in that space it wasn't particularly great another thing we found with this is there's no accumulations dashboard support I know a lot of people don't like the cube nest dashboard but we use it a lot our engineers you'd like to see the little green dials and the CPU and memory graphs and just generally get a visual overview of what's going on you can't use a certificate to authenticate the cube nest dashboard so if we did that we'd have to make it read only and very much limit what it would actually be used for so next up was open ID connect and if you'd asked me fourteen months ago what ID connect is I would have said you I don't know so what it is is a single sign-on protocol and as you could probably guess by the title of the talk this is the option we chose I'm a very big fan of single sign-on and as Kelsey said in his lightly place keynote this morning SSO is dope there are a few problems with that however like stiff Achatz you have a fixed lifetime they can't really be easily revoked Google's i/o IDC implementation for instance gives you an hour by default for the ID tokens it's less than a certificate typically would be but it's still an hour yeah there are only actually a handful of providers so if you look at our IDC providers on the web kubernetes only lists three and their documentation you've got Google Salesforce and Azure there are a few others out there I believe all 0 for instance but there's not that many options most of the engineering stuff at pusher we were already a sewing with github so we wanted to see actually if we could interact github with our key missing infrastructure obviously that wasn't in that list so that was kind of a bit of a problem for us but on the plus side as I say it's single sign-on so we can reuse groups we've got you know every engineers in the github org for pusher every engineers got a G suite account so if we could use one of those to authenticate our users then we don't have to run our own instrument that's identity management system which would be great as well o ADC also has a nice feature that I didn't know about which is automatic refreshing when queue control for instance gets IO DC set up if you pass it a refresh token when the ID token the open ID connect provides you expires it can automatically refresh that for you and so you don't need to login every hour which is a pretty cool feature and we like the sound of that so how does it work how many of you have used OAuth 2 before yeah quite a lot of you so you probably recognize that sort of flow here oh I DC actually piggybacks on a 12-2 so it's incredibly similar you've got your website you see the little sign-in with Google button or whatever it is you click the button you get redirected off to the sign-in page you type in your credentials you log in and you get redirected back part of that redirection back is a authentication code as part of the query string this authentication code in OAuth 2 is used by the server to on your behalf in 0 ID see I've nightly connect what the server does there is exchanges the authentication code for an ID token this ID token is then what you use to say yes I am who I say I am and provide to say the community's API these ID tokens come in the form of JSON web tokens or jobs they are constructed of three parts the metadata the payload and signature the metadata tells you how the signature was signed the payload gives you a bunch of information and there's examples of that on the right of the screen there and the signature is used to verify that the payload hasn't been tampered with if we look at the payload now you can see it's got a number of fields in there one of which is the issuer where it came from the identity provider the client it's expected to be consumed by so kubernetes in this case when it was issued when it expires and then things like the email the group information and the name of the person once you've got one of these ID tokens you can use it for cue control there's a token flag and you can pass it in there or you can put it into your cube config in the background when you then make a request to the communities API queue control sets an authorization header on the request it prefixes the ID token the bearer to tow communities it's a bearer token and then when the API receives that token it validates the job signature goes off to the issuer fetches the keys to verify the signature and once it's verified that the payload hasn't been tampered with we'll check whether it's expired if it hasn't expired it will use the information from the payload to authenticate the user it then passes off to the authorization which in this case would be our back if we go back to this slide now there are still a few problems as I say there's this fixed lifetime problem that could be a problem Google issues them for an hour by default that would mean you didn't have to login every hour if you're not using refreshing that might get in the way of our engineers so we wanted to see what we could do about that there are only a handful of providers as well and as I said we wanted to try and use github we've got github team selves for every team green eyes are all repo permissions using that it would be nice to tie that in with our kms infrastructure so we looked for a solution and this is where we came across a product called Dex Dex if you haven't heard of it was a core OS project it has been recently moved into its own github walk now that chorus have been acquired and what it does is proxies authentication it's an identity provider on its own but it doesn't actually store any identity information instead you configure it to talk to one of many upstream providers such as LDAP or github or salmon and even though IDC we looked at github for this as it says but found that the github provider index didn't provide you that much information it gave you the users unique ID which is a numerical about it that's not particularly useful when you want to manage permissions and you want to know who you're assigning these permissions to you don't want to assign them to a number so in the end we actually ended up using Google as our identity provider as I mentioned every pressure engineer has a G suite account so we're using that now and Google Groups but that may raise the question google's compatible it's got our IDC anyway why are you putting Dex in the middle there are a number of reasons so the first of this is the control of the token lifetime by having that issuer there we can set the token to however long we want initially we weren't using refreshing with this system and so what we did is set the token to last for eight hours that meant that an engineer could come in log in once in the more work all day without having to login again we're now down to 15 minutes now we've got refreshing setup and that means that if someone loses their laptop or their phone gets stolen they're not 15 minutes we know we've knocked them out if we disable their Google account after 50 minutes we've locked them out even so that has a number of benefits there with control of the identity provider it's actually possible to revoke tokens where I said earlier that the API server goes and gets these keys from the idæan identity provider every OID C provider has a well known open ID configuration document that it can serve part of that is the JSON web Keys URI that it serves and there you can get the public halves of the keys that sign these jobs dextrose takes these on a periodic basis and so what you can do is if you control it you can delete the keys restart dex it will generate new tokens and when the API server receives a token that was signed with the old keys it won't be able to get the public half it won't validate the signature and therefore will pass that token is invalid and reject the request so in a worst case scenario if you want to you can then log everyone out of the cluster which can be helpful if you've already set up an oauth2 application in the past you know that typically you'd have to go on to github or Google and go onto the UI and navigate through all the menus and find where you have to set up a new client you have to put in a secret and the callback URLs Dex obviously is configured by us we've run it locally on a centralized cluster and so we have all this configuration in code which means that if we want to set up multiple websites to be protected by our single sign-on system we can actually do that in version control so rather than having different teams come to us and like hey can you set up a new app on github they can just submit a piata or instructure repo we can approve it there and Google doesn't even need to know Google only knows about Dex not the 20 30 40 websites that Dex them protects finally the project is open source and the brilliance of open source is that you can then extend it so we've actually added a couple of piastre decks and running a fork at the moment the first of these is refreshing so when we started with Dex the reason we weren't using refreshing is because the implementation rather than actually going back to Google to check that the user was still allowed to authenticate would just say yeah cool that's fine they logged in once so they must be able to log in again so implementing refreshing it now goes off and checks every time at Google the person's account hasn't been disabled which often means our networks secondly we've enriched it with Google Group information the Google Oh IDC API doesn't provide you group information by default so now with this PR index what we've got is the OID C flow and then dex goes off to the google groups api fetches the group information and enriches the ID token so that it provides giving us that group management that we wanted so how do you use this so the first thing to do obviously is self Dex and I'm not going to go through that because the documentation on that readme is very very good they've got a bunch of Yama's that you can download and apply obviously check those through first this would kill me if I didn't say that and there's a link at the bottom of there read me that says using in kubernetes with Dex go through their entire house set up it's really really simple once you've got your Dex instance set up you need to connect it to the API server the first thing it needs to know is where you've hosted it this is so I can go off and get those keys the second thing it needs to know is which client ID which audience is expecting to receive if it receives an ID token that doesn't match this audience it will reject it straight away and it can only actually support one of these that could be a problem if you're just going straight to an IDC provider if you wanted to use multiple sources for identity but with Dex in the middle of what you can do is you could have people choose between their Google that github whatever and then kubernetes won't see that I'll only see the Dex token and you can have multiple sources for your identities then you've got to tell it which username claim to use from the payload and which group claims to be used from the payload and that's what once you've got the Dex instance in the API set up you can now start using a trick you control queue control I'm sure you've all seen a cube conflict before normally you have a little pointer to a CA file and the certificate identity used to fit with OID C you give it a few more bits of information you give it the clients which is not particularly a secret but more appreciate key it's part of the IDC spec that people often confuse I like I'll know this needs to be really kept secret but you need a couple of bits to make that actually useful you need to be hosted in the right place so if your application is using HTTPS which it should be and you have the client secret that's fine but people can't really man in the middle when you've got TLS anyway so it doesn't need to be that secret you then need to get yourself the ID token the ID token that identifies yourself those dots that I talked about earlier and if you want to do refreshing you'll probably need a refresh token to now hue control doesn't have a login command Humanity's doesn't actually provide you a way to authenticate with an OID C provider that's something you have to go and solve yourself dex on the other hand they've kind of solved this problem they've helped us out a bit they've got an example application that can interact with debts or any ID C provider and what it does is it runs a small binary on your local machine sets up a small webserver you go to your browser you do the login and then it displays the information back to you as a web page so you can see at the top there it displays the ID token that you need the information from the payload and the Refresh token at the bottom great copy and paste that into your cube config if you've got refreshing setup you never need to do that again that's not great when you run a roll out to 30 50 300 engineers we wanted to make sure the UX of this project was really good as I've said before so we built our own tool this is called chaos off internally and it's actually an extension of this example app these are collapsed only a couple of hundred lines this is about three hundred and just enriches it a little bit so what it does instead of having to provide all of the configuration as command-line arguments it uses the fact that our engineers all log in to volt when they join the company to store the configuration in volt they when it starts up goes off to volt fetches the client secret the issue a URL the CA certificate that it needs and then does the process from there starts that same little webserver opens your browser to start the login process you log in and then it redirects back exchanged the ID token and then builds your cube config for you so when a new engineer joins the company or we need to do is sign in to vault download this case or binary give it a list of clusters that they might want to interact with and this builds a complete keep config for them with contexts and clusters and the credentials that they need so this is where I'm now going to try a live demo and I'm just going to switch on to my every if it's there it's not okay we'll try it on the public Wi-Fi okay how's the text size there good yep so to prove I'm not cheating the first thing I could do is delete any existing configuration then if I do a get node you'll see that it's trying to talk to localhost that's the default behavior for cube control when it's got no configuration now I'm going to do is run our little case off binary give it the name of one of our clusters yeah pitous and see what it does so you can see it's pointed this browser now to google straight away I can now log in because I've got a pusher com email and of course do my two FA and you can see on the right there it's got you've successfully authenticated and on the left it says it's written a config to users Joel cube config and it's logging succeeded as John will speed it pusher so now what I can do is set the context to that cluster ID named and let's have a quick look at the conflict that it built so you can see there it's call the credentials it's downloaded as a certificate for our CA certificate from vault set up the context on the clusters and now we can use this config to get nodes from an existing cluster proving that I've logged in hopefully I'll just exit out of that stuff cool so that kind of wraps up the command line experience for single sign-on obviously that case off tool is very specific to pusher especially the vault integration stuff so what I've done is I've created a slightly more abstract aversion of that and put that up on github that's basically the same thing but all the vault configuration that you'd get actually ends up just being command line arguments if you wanted to build a similar sort of experience that might be a good place to start so the next step of this journey was the kubernetes dashboard love it or hate it I know a lot of people say just don't use it our engineers as I said earlier like to use it they like the little green dials so we wanted to make sure that we could use this in version 1.7 of the Kuznets dashboard they introduced a login view the login view you can upload a cube config if you want to you can copy and paste a token in or you can press the little skip button and not login at all obviously that's not great when you're using cube configs you can only actually use basic auth you can't use certificate date or anything like that and with tokens again if you've got it in your cube config yes you can copy and paste it but who wants to do that ten times a day there is a solution however the author header that I mentioned earlier when we were talking about how cube control interacts with API server if you make a request to the kubernetes dashboard with this off header it will then proxy that wolf had a through when it makes requests up to the API server so somehow we wanted to make sure that every request is a dashboard had this authorization header in it this is where we get the bit the OAuth to proxy coming to the story so we've been using this for a while to protect a number of things we've got manuals and run books a bunch of legacy infrastructure that has single sign-on so we knew the project well I knew roughly that it could do what we wanted given that I knew it had no ID C provider so I could connect up to our decks I didn't know whether it was actually going to be humanities compatible or not though turns out it wasn't quite put a couple of piell's later and now it is so now with these PRS what we can do is we can set it up to actually provide that ID token from the OID C provider in the authorization header that we can then proxy for you to the Kunis dashboard you can use this in two ways there's the upstream way and the author Chris way in the upstream where you have your ELB the ingress controller and then the OAuth proxy and the QMS dashboard sort of in a line what that means though is you have to have the off to proxy for every instance of your kubernetes dashboard so if you have it on multiple clusters you're gonna have to configure decks to know where each of those are if you want to protect more than just the dashboard say Prometheus or katana you're also gonna have to put one in front of those so that's not particularly great the alternative is this all for Chris made so the nginx ingress controller has something called an off request module and what it can do is when it receives a request to an ingress object that is protected by a single sign-on it goes off to your elf to proxy provides a sub request and that sub request will tell nginx whether you are logged in or not if you're logged in great properties a request through proxies our authentication header that I talked about if not it redirects you through to then start the login process this is configured on the ingress object as I say I'm sure you're used to adding annotations to your ingress objects there are three you need for this the first of that is the author URL the URL provided or the endpoint provided by the OAuth proxy this is one that returns over a 202 or a 401 in the case of a 202 you're logged in it has the authorization header this configuration sniff it at the bottom here takes that configurate authorization header and proxies it through to the dashboard in the case of a 401 nginx sends a redirect for the sign-in URL which is the second annotation there so time for another quick demo since I was logged in but already that's going Cognito for this one that's my stuff so what I'm going to do here is type in if I can type the URI of one of our Cuban eTI's dashboards so you can see I don't see anything straight away I go straight to Google again I can login to Fey and then I get redirected back and I should end up straight on top of the dashboard so you can see up here in the right just zoom in a bit that says up there logged in with auth header the cube next dashboard is detect that authorization header and is proxying that through when it makes requests to the API server and I can now go through and look at nodes and stuff I am now logged in as me okay there is some sad news however about two weeks ago I got tagged in an issue on the earth proxy the bitly team have actually decided they're no longer going to work on it and have archived the project this issue is about finding a new home for it and a pusher we have decided to volunteer to take on that responsibility we've got a good on signing of codebase the couple of PRS I've written [Applause] the couple of pls I've written are quite extensive for the codebase and touch a lot of it so I've got a very good understanding of put it myself there are another few people I've spoken to on the issue have also volunteered to help maintain the project and I've now spoken as well with Cheryl hung which a director of ecosystem for the CN CF and Brian grant who's on the type of Oversight Committee and in the new year we're going to try and get that adopted as a CN CF project so yeah although I've mentioned this project that is kind of archived it is not dead we are taking on that responsibility we're gonna try and get it into the CN CF and hopefully we can build a new community around it there's a couple of hundred like PRS and issues that are open people are using this thing it's got four and a half thousand github stars we want to keep that alive so just to quickly recap what I've gone through now every engineer pusher when they log into our system has their own account which means we can assign our back permissions to them we've got group information coming from the extensions that we've added to Dex we've got short-lived tokens they're only there for 15 minutes so if someone loses their laptop or gets their phone stolen then we can lock them out of Google and get them out of our instruction within 15 minutes the thing we've built is scalable so when we add these new clusters we don't need to reconfigure anything and in terms of UX the binary that our engine is download they run that once they never have to run that again the dashboard whenever they're not logged in they get automatically redirected I think that's probably the best we could achieve I if you have any better ideas please do let me know cuz I am interested I would probably get shot if I didn't mention I know everyone is hiring butt pusher are hiring if you live in London or fancy emigrating to London we are hiring we can sponsor visas we've got not only kubernetes openings but we've got openings for back-end engineers in general in terms of communities we're doing lots of interest stuff around custom controllers at the moment especially using the queue builder project check out my Twitter I tweet about all the stuff we're doing loads and come at the chat me if you're interested this final slide has got a bunch of links to mentions of the peels that I've mentioned in this talk and just details again if you want to talk to me yeah thank you very much for listening any questions you have time for questions [Applause] yep so the question there was the fact that the tokens are 15 minutes long that the dashboard get refreshing so one of the pr's to the oauth2 proxy six to one does that say at the bottom yeah six to one actually implements refreshing in the off two proxy so that now stores the Refresh token in the cookie as well so basically it's got this encrypted cookie inside that you've got your ID token and the Refresh token and then when it gets to 15 minutes because every request to the dashboard has to go and get this 202 response it can then perform the Refresh cycle once you get past your 15 minutes so the question was to remove the tokens that already exist so if if you're using some upstream provider like Google if you go and disable the Google account when it gets to refreshing that Refresh request will get denied if you have if you don't want to do that you have to revoke the keys which if Dex is running in kubernetes is a case of deleting a CRD if it's running outside of that it will have something like Postgres or sequel Lite you can just go into the database and delete them there's no like stored list of the ID tokens so you can't just go and delete them that's not what the cheque does it doesn't go and check that it still exists it goes and checks that the signature is about it those are never stored in the system anyone else yep oh sorry yes so the Google sorry the question there was where it wasn't clear where we've got the group information from so Google's our DC implementation doesn't return you in the ID tokens the groups list but there is a Google Groups API so PRA 1185 on Dex is the code for that what it does is it uses a service account to talk to the Google Groups API that's a completely separate API it takes the user name that it gets from the ID token lists their groups and then uses that to enhance the ID token there yep so the question was about using our backward groups so when you're defining our back role binding or custom role binding in the subjects field there is subject kind I believe is one of the fields that can be user or it can be group if it's group then it will be the four in this case email of the group so say we've got Group 1a pressure com we would put that as the subject group Group 1a push comm and everyone in that group would then have the permissions from that role binding yep so logging into different clusters is the question we actually host all of this centrally and we use the same credentials for every single cluster the thing is with all then and all see I am Joel I'm always Joel like it doesn't matter where I'm going with that so we decided for now that we would have a central authentication cluster and all of the clusters would use that same central off you could separate production but really is there a need to prove myself on the different infrastructure given that the are back can be different on each cluster it doesn't particularly matter it's ok yeah we originally ran them on a separate infrastructure so we originally ran them on their own easy two instances we're now running them index however so we have them in two clusters we have like two central clusters that we call global for stuff that doesn't need to be on every region and in there we hope have decks and the oauth2 proxy that obviously does create a bootstrapping issue we have a terraform module that can create a say certificate in like a break glass emergency team things so the the infrastructure team have a kms encrypted secret that they can use with this thing to get our long certificate if we need to so if our Dec stuff does go down we can also get into the clusters using that break glass approach yeah should probably call it there but I'll hang around outside if anyone's got any more questions

Info

Channel: CNCF [Cloud Native Computing Foundation]

Views: 7,875

Rating: undefined out of 5

Keywords:

Id: yaJnT6DNHHc

Channel Id: undefined

Length: 34min 27sec (2067 seconds)

Published: Sat Dec 15 2018