Building a Kubernetes on Bare-Metal Cluster - Alexandros Kosiaris & Guiseppe Lavagetto

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

okay yeah okay I can speak hello you're Alexandra play we're going to talk about our experiences in building kubernetes cluster on bare metal hardware for Wikipedia so first some information about who we are I think most of you are familiar with our main project which is Wikipedia leukemia foundation is a nonprofit organization that works on maintaining the infrastructure for Wikipedia but not just Wikipedia our sister projects like wiki data or we travel with users and so on not with leaks as you can see we have half the amount of traffic monthly so we are one of at um top 10 websites as far as traffic is concerned still we have our infrastructure as two primary data centers where the application layer stays and persistence layer stays and then we have free caching data centers one in San Francisco one in Amsterdam and a new one in Singapore but are basically just keeping the caches of a content so that like we can serve a content nearer to a user so it's kind of our own home-baked CDN even with the amount of traffic that we have our cluster is relatively small we just have 1,200 hardware machines we don't typically use VMs in production just lately we're using a couple of vanity clusters what was that about hundred VMs and we're not just lean in terms of hardware we're also lean in terms of the size of the organization as you can see around 160 ingenious overall and the people that care about the application layer which is what we're moving to kubernetes is just four people so why we're moving to kubernetes just a few words about this and up until 2014 we just had the big monolithic application which is MediaWiki and one service which is called par soil that was the helping with our visual editor fast forward four years to today we have much more services we are moving somehow towards a micro service-oriented architecture slowly but even with slow motion the toyline on our team which is very small is increasing increasing and we need a way both for us to work faster and for the developers to be able to deploy our services better and with less weight on our time but these are not the only reasons to move to kubernetes one big reason for us is elasticity so V ability to scale up and scale down each function that we have in a short amount of time after traffic bursts and also well the fact that it kind of frees up us from having to deal too much with a failure of a physical node because companies basically detected south and reorganised resources by itself instead of having us doing that and also containers are a good thing because they allow developers to have basically the same consistent environment on their own development machine and in production so that reduces surprises when things are coming to production and finally we are giving the same developers with people that deploy code on the cluster more control over their applications because the configurations about realisations is not anymore in puppet which is controlled by us so they can have much more flexibility about what they do but we also see also saw some challenges so the first one kubernetes is complex as we've just seen in the previous talk right just the QoS is very complex and everything else is complex and there are a lot of moving parts we didn't want to sacrifice the best ability that we have just to get flexibility also it's a new paradigm anyone has heard cloud native in last few days yet it's a new paradigm and it's a good thing but it's also a problem for all the tooling that we have around our cluster which is based on butt concept of having physical servers we've services installed on them and finally containers ourselves are also a challenge for us because we used to be very efficient in doing security upgrades and everybody that has dealt with containers knows that's that's kind of almost a salt problem on containers so why we're doing things on bare metal why are we at the cloud native conference talking about standing things of them they're made aware well there are some reasons why we can't use a public cloud first and foremost is that we value the privacy of our users very much and the contract that we have with our users in terms of their privacy and we feel that if we are the third party actor in the middle we have to have to trust that for party fully in order to guarantee gave the use of the same guarantees and that's not gonna happen with public cloud also we already have our infrastructure you already maintaining it so it's not like we have an upscale cost for just maintaining a bare cluster a bare metal cluster and finally costs as I said in the first slide right we have quite some traffic what quite some amount of traffic think of the Bill of just taking all of us by it's out of cloud what what wet would be and we don't really have money being a non-profit and relying on donations of individuals why we're not doing a private cloud van well we already do it actually for another project that we ran on top of OpenStack with kubernetes but that's a platform as a service for editors that can build tools that have helped them edit the week is for the production path we felt it's it's not really something we want because you want to reduce the number of moving parts right and OpenStack is actually more moving parts when kubernetes so we didn't receive a reason to do that because the elasticity we need is already given us by Kuber needs itself since we want to move all the application layer to kubernetes we have a set of machines that are dedicated to that and we just want to switch resources between one or and the other and so we don't need the additional that is the city that's given by cloud so this is it for my part and let Alex talk about how we set up cluster hello everyone nice to be here so talking about how our clusters are set up we have a very Debian developer populated as a reteam there are a lot of those people there and that's why we decided to go of course with the BIM packages way so we package our own debian versions of communities and very very recently in this conference we found out that Google is also providing tabs for people so thank you for that we aim to evaluate them and if they are good enough use them and I'm hoping that they are because maintaining that package is a little bit of a pain anyone who has ever gone into the process of building communities can probably attest to that so we currently are at version 1.7 and we which is of course as you probably know and maintained right now because Benitez team only maintains the last three releases so it's 1 cell 1 8 1 9 and 110 that are currently maintained and we need to upgrade of course and maybe we'll be here next year telling you how web graded all these - all these versions we also use calico but for those of those of you do not know it's CNI plugin basically for kubernetes and we are currently at version 2 to 0 and of course 3.0 is out and again we need to break and we are also using etcd 2 to 1 and you're probably already asking why are you not on 3 point X and the answer is yes we need again to grade we're packaging all these software on our own and we need to maintain it our clusters well we're a very big pet shop if you go to if you want to go to our Gerrit or you go to github where we have a mirror of our Gerrit you can see our pipette repo it's I think the biggest open source puppet repo there is out there even though it doesn't have a license but don't please judge us on that so we configure everything via a puppet so we configured all of the community's components by puppet as well and that had one very nice thing it has our list it allowed us to have Els for all the components everywhere from the go just by reusing the puppet see a and also showed a couple of problems with a puppet ca in the process we would like at some points all these or migrate away from it we will see what we do not maintain by a puppet even though we pondered the idea was the community's resources there is even a puppet module in puppet Forge that allows you to do that but on thinking about it we realize that maintaining such a puppet module and making generic and nos and all that and publishing it in puppet Forge would be a pretty big toriel and then we even realized later on it would not make sense for our case because we end up maintaining most of the resources if not all of them via helm at the end a little bit more about how those processors are set up we have API servers 2 per cluster and they are behind a load balancer and in-house a load balancer which I will describe a little bit more in detail afterwards and we also have the coop scheduler and the cube controller manager also on the same node and we pass - - select masters in true equals true and the reason for that is that it allows us to produce two components to be on the same node as the key API server and talk to it on the unfit ik aided port and just rely on firewalling in order to protect our API that allowed us to avoid for now authentication and authorization that we would need to do for those two components of communities to talk to the API server a little bit more about those two production clusters they are currently three well one is not really production it's the staging cluster which we are now binding to our built-in deployment pipeline and we have one primary DC we were talking about the primary basis previously we have separated these two clusters for every one of those kubernetes clusters and of course we use a basic idea of three in order to maintain quorum they are of course DC local in order to maintain level of latency and funnily enough those are on VMs from Ghana DBMS gannett is a mature virtualization orchestration framework by Google thank you google for that as well you can expect what we run q proxy in the default way which is a tables mode but we are evaluating at PBS we already have very heavy lvsd our shop and so this is actually interesting in with for us and we do have expertise already in this and we host our own registry and we enforce it we do not allow things off of docker hub running in our infrastructure and the reason for that is what you can probably guess and a lot of people have said in this conference is that you cannot guarantee a if you're pulling things from docker hub you cannot guarantee what you're running and most importantly if and if you can trust the person doing it has he updated hash he updated the specific image that you are running so what we do is we run all lowering we build and run all our images from scratch and you can even go around check out our images your docket Pokemon is there feel free to see what kind of images we have there of course very specific to our use cases but your feel free of course they are well they're being based as I said we're Debian shop the backend for this specific docker registry which uses the reference docker registry from docker is OpenStack Swift and it works reliably well remarkably well we went with our back rule based access control from day one and the reason for that is very simply we're not cloud we cannot have our comedians classes in the cloud which means that we cannot assign a kabillion this class super team or / service or whatever so we had to have our box in order to protect the various tenants of the infrastructure from each other we have a policy of applying one namespace per service and the result of that is Conway's law for those of you who do not know what Conway's law it says basically that the organizational structure or the structure for your organization is going to show up in your products and what has happened in the past in organization is that there have been a lot of changes and we were kind enough locks in the last few years so we decided instead of going to team centric approach to go for a product centric approach where product is of course the service authentication wise we use parently token for our file we have evaluated all the various authentication modules there is actually a fabricator task and our fabricator it's fabricated at Wikimedia dad org so say if it were if you want it's written by yours truly and it valuates more each one of those vindication methods and why on we ended up with toka North which is a little bit difficult to maintain but we would like at some point to go to the web hook authentication mode but we first need to run a service that actually allows you to off against the emission controls is we have all the standards that are the standard mission controllers that we run for keep the RQ news version we have not deviated from that what we have done in however is that we have firm which is a wrapper around IP six tables and IP tables that makes managing rules why are these tools way way way more easier so we have this across the entire fleet there is not a box without well there's not a box that cannot be absolutely true but anyway so where most of the boxes have firm there and they have default policy of drop and that specific thing also is applied on the kubernetes nodes and of course the API servers we were kind of worried about this because we also applied firm to our open stack boxes and well there are race conditions you reload firm and flashes of course the entire rule set will fight the tables and then Nova network and does not notice because no but networks just execute commands and does not want your anything and of course you're left without the rule the federal rules and default job and your that in order the good thing is that both cubelet bubbles queue proxy and queue Blair and the Felix agent which is what calico names the thing that runs in your node all monitor and now they all fix rules and this thing works way better than we feared so let's go a little bit into network this is our typical primary DC these are for rock rows F as you can see we're in one of those adding a fifth one and every rock row is comprised of eight tracks and every Rock low has one tour top of the rock switch and they are stacked into being a single single managed switch which means that basically if you reboot that specific switch the entire rack row which is 32 racks is going to go down for the duration of the reboot which by the way jr. predicts does take a little bit of time in that big switches it's a little bit of a collapsed cloth topology if you want to be not very pedantic about it in the form that we have those two effects switches without which are basically 10g port switches that are the spines and there are the next witches that are the the Leafs the thing to note in this specific diagram is the fact that if you have a box in one rack row we know de to reach any other box in any other row you have to go through the routers now we might have quite a bit of traffic outbound but in between traffic rules it's not that huge and it's fine for us to use those routers because we feel that at this specific point in time layer 3 is more debuggable familiar to a little bit more about it so we have currently the clusters are rather small therefore machines and so we have one per row but we plan to increase these to 12 very soon and we had this requirement since the beginning that the kubernetes clusters needed to be fully compatible with the legacy infrastructure we're legacies what we currently have which meant that we could not do not know network and its relation every anywhere we started looking at the various solutions and two years ago at FOSDEM in 2016 there was a calico talk we looked at it and we said whoa this thing is what we need but calico does is that you have an agent and every one of your nodes and it talks bgp we follow with well either all the other nodes or without your routers or whatever you tell it it's fully configurable and it announces the prod I piece to the rest of the world so what we have done we have it configured again by a puppet well partly by a puppet part in manual because there are two things that we do not currently do by a puppet in calico and those are the IP pools and the BGP peers the routers basically and it's working quite fine and we're probably happy with it we do not do BGP full mesh that would be all the nodes talking to each other in changing the routes of the paths so that you would have one only hop four or one pole to reach another because as I said we have this rack row aware networking diagram where you have to go via the router in order to reach a different rack row we're thinking about row space specific full mesh though in order to keep it no more local have a more affinity in the rock row for inter pot traffic we are using 10 dot X / 8 IP addresses for the pods and they're fully routable in our network thanks to calico and we are using the exact same IPS for the exact same not nothing is that's the same IP space but a different different slash 24 for the service IPS and this is the part where you'll say service type these are not really route or bullion cabinet is there they're not really a piece to start with they are tags basically you say the pod just needs to have an IP for to talk to so those are effectively just reservations we just reserved IP spaces because we do not want to have 192 168 IP is in our network because whenever we met those for some reason it was due to a either a miss configuration something very confusing going on so it was just maintaining the status quo and one thing that calico has allowed us to do and was very very nice and thank you calico for that is we now have even though communities does not support ipv6 or at least in our version our paths are ipv6 enabled and in order to do that it was very easy to do that in the ipam for colic or just the sign maybe before true a sign of pv6 true and then we had to go to every node and set those to see CTL configurations that you see there now for those that I perhaps you do ipv4 forwarding and haven't done ipv6 forwarding on Linux that thing does not do what it does in ipv4 go read the darks they are completely different they do different things the second one is because we wanted to have routing in ipv6 in our nodes but at the same time we did not have to go around and manage all the aspects of a router in Linux and that value there except array equals to allows the nodes to receive router advertisements from our juniper routers in order to continue working as the normal boxes that we now pass having the capability of rewarding activity six packets for the pots we also have network policies I know so for those who don't know one point seven does not support egress one point eight does so we had to work around that we're hoping to drop that workaround in the future in one point eight calico does support us and we asks how you implemented a workaround and the second thing that we met was the fact that one point seven does not allow you to change a network policy you have to bring down and reinstall it in order to to change the policy and we had to do that via hell the authority that we have actually implemented is we patch the calculate s controller the version was zero point six zero and it's a Python one and it's no longer there because it was rewritten in go so our solution well it's not needed anymore thankfully with one point eight we don't even need that it's what's really really small we haven't default ingress policy and our patch is like very minimal fourteen lines of code all it does is read a config map and populate the Calico configuration what about ingress we evaluated back in 2017 and said nope not yet not ready not for us so how do we route traffic to our classes you asked that's a pretty valid question so what we do is we use that Python in-house built daemon that we have been having for like what it's now eight years maybe it's called PI Bo and there is a URL for you and github if you want to go and take a look what it does is it monitors lvsd our entries in the kernel lvsd are obvious direct routing what it basically does is when the packet reaches the load balancer the only thing that it does it through the IP packet is change the Mac it doesn't alter in any way the IPP headers or anything like that so it's very very lightweight and the traffic goes directly to the backend that means however that the backend needs to have that IP set up in its interfaces otherwise it's not going to accept the traffic the good thing is that it does not have to reply back to the load balancer and then load balancer does not have to reply to the client outgoing traffic leaves all the backends and goes straight back to the client instead of going through your load balancers which is good in our case because we have minimal into traffic bits basically get slash and article but we have a lot of outgoing traffic from those boxes we had a lot of expertise over this software and we decided to reuse it and we know from a blog post from github they have more or less a similar approach a little bit bit different but we're happy where at least we're not alone in this and we're using node port in order to with external IDs that change by the way in Bernie's 1.65 member correctly it used to be different in the API that caught us off surprise for a little bit then figure out that this is actually a better way about metrics collection we actually connect metrics thank you from easiest for that it works quite well and the only thing that we did was have pop-eyed populate the configuration so what we do have is that Prometheus discovers the API and pause the couplets the couplets eadvisor it all so Paul's the the couplet API itself and we've met a bug in 1.73 where the API it's a little bit was a little bit different the interesting thing with that was the Prometheus community already documented is in the example configuration that was great great however applications are a little bit more interesting so we also use Fermi fix for that too what we did is all other implications you surgically use stats D so we have stats the exporter that's a sidecar container to all our parts applications talk to localhost and push all the stats team metrics and then Prometheus discovers all the pods and Paul's all the pods and gathers the information and we we visualize all of that through grow fauna and here's a link if you want to get our graph on our installation big kudos to the graph on a team this is an excellent software thank you very much rather on pretty sure you all know it it's just a shout out alerting is not that great currently and this is mostly for historical reasons we use Prometheus again but partly because we're still on I shall I sing a sharp eye single you know for those of you do not know it's I'm all software from the 90s it's very node centric not service centric we won on my grade off of it we have started by integrating check Prometheus metric script in bash and of course we had problems with it because it's not support floats and we and various other small corner cases and then we ended up rewriting it in Python and if you add in puppet which is how you manage all of our alerts you can get that very very weird-looking alert like what does this thing even do we're fixing this one of my colleagues actually while we were here submitted patches to fix this and all of the alerts themselves are now in the Prometheus engine and we just all the very specific name alert for that the last thing that we did is that we decided to go forward with a different way of monitoring applications that what a singer did because basically applications are a little bit difficult to monitor when it comes to api's because each of them one of the eight points can fail in a myriad ways and all of the other points may not fail so it's difficult to go around and create a software that monitors all of these api is and have a die single compatible and all that so what we did is we extended our swagger specs for all these applications with a very specific expect that allowed us to create an applicator that would go around walk all the API and the endpoints and submit a a test query to that specific endpoint and that's where I pass the ball again - gizelle okay so we call the project stream nine service delivery because the main point for us was that kubernetes allows you to change the way you manage this lifecycle of software and specifically micro service from development to production there is this big scheme of what we did that which is pretty detailed I won't get into that or let you leave at 6 o'clock I'm just going to concentrate on the details of deploy stage if you want to say the details of everything where it's also all documented on wiki so if you just search with enough tenacity you will find all the documentation about this let's just concentrate on deployments so as we said we have one namespace per service basically and we will have several people which should be allowed to the product of that service but we want people to be extracted from the problem of knowing the details of our kubernetes works so we use Elm which is working very well for us I will it did we set up element production so that we could secure registration as much as possible what we did is that we have one Taylor instance running in each namespace with specific Arabic rules which are basically what they what is suggested in guide from tiller apart from the fact that we also need to set up a networking rules for egress because we basically white least the ability of a pod to call our pods to make connections to our pods and when we have a deploy user which is just allowed by our buck to speak with tiller basically so we only think that somebody that has we credentials for way the pro users can do is use them basically on the cluster we can't use cube Caudill so we restricted the user coop kado to administrators and normal the players just have to use them we don't have to care about when treatises wait can basically shoot themselves in the foot with avocado all of this is implemented with a seem very simple and kind of shameful at the moment Raptor around helm this rapper does some more things we said we have to make primary data centers most things that we deploy all things that we deploy on kubernetes at the moment are active active between two data centers so we want release to go to both data centers and this rapper does that and also we are working at the moment we hope to be finished before the conference but we're not on doing cannery releases the way we do it is pretty simple we just have we cheat basically we have two LM releases for each practical degrees a canary release with a small set of replicas and when production release but as larger set of replicas and the service definition but the selectors to select all the pots from both releases whenever we do a new release we first really do Alan first upgrades the canary release we check the metrics and vlogs to see if everything's green we go on and approach of a production part so basically what we do is proper coloring the number of people they are proportional to the amount of replicas in the tube class two clusters we get a new version for a small amount of time and if we don't roll back after a few minutes everybody's getting the new version finally as I said it's a shameful bash script at the moment what we plan to do is to try to rewrite everything including this part as an plugins so that we can basically give service with a rest of the community finally we have our very own and very new elm repo which at the moment that's just one chart which is a public repo so that you anybody can run the services which we run in production using our and charts in the future and where's the link and finally since we're almost done and I think everybody's very tired at this point of a conference I want to thank you for sticking with us until five o'clock and a shameful plug as I said our team is too small and we are hiring at the moment and I'm living this slide because you have some links to the Fayetteville resources that you can consult remember we're a compounder percent free software shop so everything we've talked about and everything we do is publicly available as free software and you can take a look and even contribute if you want to so that concludes our talk with a couple of minutes available and if somebody has questions we are here to answer them [Applause] [Music] [Applause] the site said uh you were thinking about moving from docker to rocket what's motivating that that mindset so the question is about why we evaluated rocket I think rocket has some interesting characteristics the most interesting of which from our perspective is the ability to use the TPM to verify the images and to check consistency we're very concerned with security remember our site is a site that accepts arbitrary input both in text and binary form because Wikimedia Commons accept binary forms as well so we have to be very very careful with security and that was one thing that seemed appealing I don't think that the support for rocket was there when we decided to deploy communities in production into kubernetes so we're stick with sticking with docker for now and we have quite some tooling around docker we would have to adapt that rocket whenever we want to do that you call the the session installing on bare metal so you're installing the kubernetes or all your Dockers on bare metal or on virtual machines now the virtual machines are only the API servers everything else is bare metal did you get a scoop yes yes so the server physical servers are used for running pots and and all the applications basically okay I don't think where if there are any other questions and see anybody because I'm right by the lights but if you have questions we are just around for a few more minutes thank you again [Applause]

Info

Channel: CNCF [Cloud Native Computing Foundation]

Views: 11,263

Rating: 4.9172416 out of 5

Keywords:

Id: 7rqvRwfZHF4

Channel Id: undefined

Length: 34min 32sec (2072 seconds)

Published: Sun May 06 2018